PyNLPl-1.2.9/ 0000755 0001750 0000144 00000000000 13442242642 013514 5 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/LICENSE 0000644 0001750 0000144 00000104513 12630554467 014536 0 ustar proycon users 0000000 0000000 GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc.
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU General Public License is a free, copyleft license for
software and other kinds of works.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users. We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors. You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so. This is fundamentally incompatible with the aim of
protecting users' freedom to change the software. The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable. Therefore, we
have designed this version of the GPL to prohibit the practice for those
products. If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary. To prevent this, the GPL assures that
patents cannot be used to render the program non-free.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Use with the GNU Affero General Public License.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
Copyright (C)
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see .
Also add information on how to contact you by electronic and paper mail.
If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:
Copyright (C)
This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
.
The GNU General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License. But first, please read
.
PyNLPl-1.2.9/MANIFEST.in 0000644 0001750 0000144 00000000237 13003654637 015260 0 ustar proycon users 0000000 0000000 include README.rst
include LICENSE
include requirements.txt
recursive-include pynlpl *.py
include pynlpl/tests/test.sh
include pynlpl/tests/evaluation_timbl/*
PyNLPl-1.2.9/PKG-INFO 0000644 0001750 0000144 00000013017 13442242642 014613 0 ustar proycon users 0000000 0000000 Metadata-Version: 1.1
Name: PyNLPl
Version: 1.2.9
Summary: PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl contains modules for basic tasks, clients for interfacting with server, and modules for parsing several file formats common in NLP, most notably FoLiA.
Home-page: https://github.com/proycon/pynlpl
Author: Maarten van Gompel
Author-email: proycon@anaproy.nl
License: GPL
Description: PyNLPl - Python Natural Language Processing Library
=====================================================
.. image:: https://travis-ci.org/proycon/pynlpl.svg?branch=master
:target: https://travis-ci.org/proycon/pynlpl
.. image:: http://readthedocs.org/projects/pynlpl/badge/?version=latest
:target: http://pynlpl.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
.. image:: http://applejack.science.ru.nl/lamabadge.php/pynlpl
:target: http://applejack.science.ru.nl/languagemachines/
.. image:: https://zenodo.org/badge/759484.svg
:target: https://zenodo.org/badge/latestdoi/759484
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language
Processing. It contains various modules useful for common, and less common, NLP
tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and
frequency lists, and to build simple language model. There are also more
complex data types and algorithms. Moreover, there are parsers for file formats
common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to
interface with various NLP specific servers. PyNLPl most notably features a
very extensive library for working with FoLiA XML (Format for Linguistic
Annotatation).
The library is a divided into several packages and modules. It works on Python
2.7, as well as Python 3.
The following modules are available:
- ``pynlpl.datatypes`` - Extra datatypes (priority queues, patterns, tries)
- ``pynlpl.evaluation`` - Evaluation & experiment classes (parameter search, wrapped
progressive sampling, class evaluation (precision/recall/f-score/auc), sampler, confusion matrix, multithreaded experiment pool)
- ``pynlpl.formats.cgn`` - Module for parsing CGN (Corpus Gesproken Nederlands) part-of-speech tags
- ``pynlpl.formats.folia`` - Extensive library for reading and manipulating the
documents in `FoLiA `_ format (Format for Linguistic Annotation).
- ``pynlpl.formats.fql`` - Extensive library for the FoLiA Query Language (FQL),
built on top of ``pynlpl.formats.folia``. FQL is currently documented `here
`__.
- ``pynlpl.formats.cql`` - Parser for the Corpus Query Language (CQL), as also used by
Corpus Workbench and Sketch Engine. Contains a convertor to FQL.
- ``pynlpl.formats.giza`` - Module for reading GIZA++ word alignment data
- ``pynlpl.formats.moses`` - Module for reading Moses phrase-translation tables.
- ``pynlpl.formats.sonar`` - Largely obsolete module for pre-releases of the
SoNaR corpus, use ``pynlpl.formats.folia`` instead.
- ``pynlpl.formats.timbl`` - Module for reading Timbl output (consider using
`python-timbl `_ instead though)
- ``pynlpl.lm.lm`` - Module for simple language model and reader for ARPA
language model data as well (used by SRILM).
- ``pynlpl.search`` - Various search algorithms (Breadth-first, depth-first,
beam-search, hill climbing, A star, various variants of each)
- ``pynlpl.statistics`` - Frequency lists, Levenshtein, common statistics and
information theory functions
- ``pynlpl.textprocessors`` - Simple tokeniser, n-gram extraction
Installation
--------------------
Download and install the latest stable version directly from the Python Package
Index with ``pip install pynlpl`` (or ``pip3`` for Python 3 on most
systems). For global installations prepend ``sudo``.
Alternatively, clone this repository and run ``python setup.py install`` (or
``python3 setup.py install`` for Python 3 on most system. Prepend ``sudo`` for
global installations.
This software may also be found in the certain Linux distributions, such as
the latest versions as Debian/Ubuntu, as ``python-pynlpl`` and ``python3-pynlpl``.
PyNLPL is also included in our `LaMachine `_ distribution.
Documentation
--------------------
API Documentation can be found `here `__.
Keywords: nlp computational_linguistics search ngrams language_models linguistics toolkit
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: POSIX
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
PyNLPl-1.2.9/PyNLPl.egg-info/ 0000755 0001750 0000144 00000000000 13442242642 016324 5 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/PyNLPl.egg-info/PKG-INFO 0000644 0001750 0000144 00000013017 13442242642 017423 0 ustar proycon users 0000000 0000000 Metadata-Version: 1.1
Name: PyNLPl
Version: 1.2.9
Summary: PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl contains modules for basic tasks, clients for interfacting with server, and modules for parsing several file formats common in NLP, most notably FoLiA.
Home-page: https://github.com/proycon/pynlpl
Author: Maarten van Gompel
Author-email: proycon@anaproy.nl
License: GPL
Description: PyNLPl - Python Natural Language Processing Library
=====================================================
.. image:: https://travis-ci.org/proycon/pynlpl.svg?branch=master
:target: https://travis-ci.org/proycon/pynlpl
.. image:: http://readthedocs.org/projects/pynlpl/badge/?version=latest
:target: http://pynlpl.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
.. image:: http://applejack.science.ru.nl/lamabadge.php/pynlpl
:target: http://applejack.science.ru.nl/languagemachines/
.. image:: https://zenodo.org/badge/759484.svg
:target: https://zenodo.org/badge/latestdoi/759484
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language
Processing. It contains various modules useful for common, and less common, NLP
tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and
frequency lists, and to build simple language model. There are also more
complex data types and algorithms. Moreover, there are parsers for file formats
common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to
interface with various NLP specific servers. PyNLPl most notably features a
very extensive library for working with FoLiA XML (Format for Linguistic
Annotatation).
The library is a divided into several packages and modules. It works on Python
2.7, as well as Python 3.
The following modules are available:
- ``pynlpl.datatypes`` - Extra datatypes (priority queues, patterns, tries)
- ``pynlpl.evaluation`` - Evaluation & experiment classes (parameter search, wrapped
progressive sampling, class evaluation (precision/recall/f-score/auc), sampler, confusion matrix, multithreaded experiment pool)
- ``pynlpl.formats.cgn`` - Module for parsing CGN (Corpus Gesproken Nederlands) part-of-speech tags
- ``pynlpl.formats.folia`` - Extensive library for reading and manipulating the
documents in `FoLiA `_ format (Format for Linguistic Annotation).
- ``pynlpl.formats.fql`` - Extensive library for the FoLiA Query Language (FQL),
built on top of ``pynlpl.formats.folia``. FQL is currently documented `here
`__.
- ``pynlpl.formats.cql`` - Parser for the Corpus Query Language (CQL), as also used by
Corpus Workbench and Sketch Engine. Contains a convertor to FQL.
- ``pynlpl.formats.giza`` - Module for reading GIZA++ word alignment data
- ``pynlpl.formats.moses`` - Module for reading Moses phrase-translation tables.
- ``pynlpl.formats.sonar`` - Largely obsolete module for pre-releases of the
SoNaR corpus, use ``pynlpl.formats.folia`` instead.
- ``pynlpl.formats.timbl`` - Module for reading Timbl output (consider using
`python-timbl `_ instead though)
- ``pynlpl.lm.lm`` - Module for simple language model and reader for ARPA
language model data as well (used by SRILM).
- ``pynlpl.search`` - Various search algorithms (Breadth-first, depth-first,
beam-search, hill climbing, A star, various variants of each)
- ``pynlpl.statistics`` - Frequency lists, Levenshtein, common statistics and
information theory functions
- ``pynlpl.textprocessors`` - Simple tokeniser, n-gram extraction
Installation
--------------------
Download and install the latest stable version directly from the Python Package
Index with ``pip install pynlpl`` (or ``pip3`` for Python 3 on most
systems). For global installations prepend ``sudo``.
Alternatively, clone this repository and run ``python setup.py install`` (or
``python3 setup.py install`` for Python 3 on most system. Prepend ``sudo`` for
global installations.
This software may also be found in the certain Linux distributions, such as
the latest versions as Debian/Ubuntu, as ``python-pynlpl`` and ``python3-pynlpl``.
PyNLPL is also included in our `LaMachine `_ distribution.
Documentation
--------------------
API Documentation can be found `here `__.
Keywords: nlp computational_linguistics search ngrams language_models linguistics toolkit
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: POSIX
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
PyNLPl-1.2.9/PyNLPl.egg-info/SOURCES.txt 0000644 0001750 0000144 00000005765 13442242642 020225 0 ustar proycon users 0000000 0000000 LICENSE
MANIFEST.in
README.rst
requirements.txt
setup.cfg
setup.py
PyNLPl.egg-info/PKG-INFO
PyNLPl.egg-info/SOURCES.txt
PyNLPl.egg-info/dependency_links.txt
PyNLPl.egg-info/entry_points.txt
PyNLPl.egg-info/not-zip-safe
PyNLPl.egg-info/requires.txt
PyNLPl.egg-info/top_level.txt
pynlpl/__init__.py
pynlpl/algorithms.py
pynlpl/common.py
pynlpl/datatypes.py
pynlpl/evaluation.py
pynlpl/fsa.py
pynlpl/net.py
pynlpl/search.py
pynlpl/statistics.py
pynlpl/tagger.py
pynlpl/textprocessors.py
pynlpl/clients/__init__.py
pynlpl/clients/cornetto.py
pynlpl/clients/freeling.py
pynlpl/clients/frogclient.py
pynlpl/formats/__init__.py
pynlpl/formats/cgn.py
pynlpl/formats/cql.py
pynlpl/formats/dutchsemcor.py
pynlpl/formats/folia.py
pynlpl/formats/foliaset.py
pynlpl/formats/fql.py
pynlpl/formats/giza.py
pynlpl/formats/imdi.py
pynlpl/formats/moses.py
pynlpl/formats/sonar.py
pynlpl/formats/taggerdata.py
pynlpl/formats/timbl.py
pynlpl/lm/__init__.py
pynlpl/lm/client.py
pynlpl/lm/lm.py
pynlpl/lm/server.py
pynlpl/lm/srilm.py
pynlpl/mt/__init__.py
pynlpl/mt/wordalign.py
pynlpl/tests/__init__.py
pynlpl/tests/cgn.py
pynlpl/tests/cql.py
pynlpl/tests/datatypes.py
pynlpl/tests/evaluation.py
pynlpl/tests/folia.py
pynlpl/tests/folia_benchmark.py
pynlpl/tests/formats.py
pynlpl/tests/fql.py
pynlpl/tests/search.py
pynlpl/tests/statistics.py
pynlpl/tests/test.sh
pynlpl/tests/textprocessors.py
pynlpl/tests/FoLiA/setup.py
pynlpl/tests/FoLiA/foliatools/__init__.py
pynlpl/tests/FoLiA/foliatools/alpino2folia.py
pynlpl/tests/FoLiA/foliatools/cgn2folia.py
pynlpl/tests/FoLiA/foliatools/dcoi2folia.py
pynlpl/tests/FoLiA/foliatools/folia2annotatedtxt.py
pynlpl/tests/FoLiA/foliatools/folia2columns.py
pynlpl/tests/FoLiA/foliatools/folia2dcoi.py
pynlpl/tests/FoLiA/foliatools/folia2html.py
pynlpl/tests/FoLiA/foliatools/folia2rst.py
pynlpl/tests/FoLiA/foliatools/folia2txt.py
pynlpl/tests/FoLiA/foliatools/foliacat.py
pynlpl/tests/FoLiA/foliatools/foliacorrect.py
pynlpl/tests/FoLiA/foliatools/foliacount.py
pynlpl/tests/FoLiA/foliatools/foliafreqlist.py
pynlpl/tests/FoLiA/foliatools/foliaid.py
pynlpl/tests/FoLiA/foliatools/foliamerge.py
pynlpl/tests/FoLiA/foliatools/foliaquery.py
pynlpl/tests/FoLiA/foliatools/foliaquery1.py
pynlpl/tests/FoLiA/foliatools/foliasetdefinition.py
pynlpl/tests/FoLiA/foliatools/foliaspec.py
pynlpl/tests/FoLiA/foliatools/foliaspec2json.py
pynlpl/tests/FoLiA/foliatools/foliatextcontent.py
pynlpl/tests/FoLiA/foliatools/foliatree.py
pynlpl/tests/FoLiA/foliatools/foliavalidator.py
pynlpl/tests/FoLiA/foliatools/rst2folia.py
pynlpl/tests/FoLiA/foliatools/xslt.py
pynlpl/tests/FoLiA/schemas/generaterng.py
pynlpl/tests/evaluation_timbl/test
pynlpl/tests/evaluation_timbl/test.IB1.O.gr.k1.out
pynlpl/tests/evaluation_timbl/timbltest.sh
pynlpl/tests/evaluation_timbl/train
pynlpl/tools/__init__.py
pynlpl/tools/computepmi.py
pynlpl/tools/foliasplitcgnpostags.py
pynlpl/tools/freqlist.py
pynlpl/tools/frogwrapper.py
pynlpl/tools/phrasetableserver.py
pynlpl/tools/reflow.py
pynlpl/tools/sampler.py
pynlpl/tools/sonar2folia.py
pynlpl/tools/sonarlemmafreqlist.py PyNLPl-1.2.9/PyNLPl.egg-info/dependency_links.txt 0000644 0001750 0000144 00000000001 13442242642 022372 0 ustar proycon users 0000000 0000000
PyNLPl-1.2.9/PyNLPl.egg-info/entry_points.txt 0000644 0001750 0000144 00000000240 13442242642 021616 0 ustar proycon users 0000000 0000000 [console_scripts]
pynlpl-computepmi = pynlpl.tools.computepmi:main
pynlpl-makefreqlist = pynlpl.tools.freqlist:main
pynlpl-sampler = pynlpl.tools.sampler:main
PyNLPl-1.2.9/PyNLPl.egg-info/not-zip-safe 0000644 0001750 0000144 00000000001 13072717442 020556 0 ustar proycon users 0000000 0000000
PyNLPl-1.2.9/PyNLPl.egg-info/requires.txt 0000644 0001750 0000144 00000000037 13442242642 020724 0 ustar proycon users 0000000 0000000 lxml>=2.2
httplib2>=0.6
rdflib
PyNLPl-1.2.9/PyNLPl.egg-info/top_level.txt 0000644 0001750 0000144 00000000007 13442242642 021053 0 ustar proycon users 0000000 0000000 pynlpl
PyNLPl-1.2.9/README.rst 0000644 0001750 0000144 00000007634 13433532000 015203 0 ustar proycon users 0000000 0000000 PyNLPl - Python Natural Language Processing Library
=====================================================
.. image:: https://travis-ci.org/proycon/pynlpl.svg?branch=master
:target: https://travis-ci.org/proycon/pynlpl
.. image:: http://readthedocs.org/projects/pynlpl/badge/?version=latest
:target: http://pynlpl.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
.. image:: http://applejack.science.ru.nl/lamabadge.php/pynlpl
:target: http://applejack.science.ru.nl/languagemachines/
.. image:: https://zenodo.org/badge/759484.svg
:target: https://zenodo.org/badge/latestdoi/759484
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language
Processing. It contains various modules useful for common, and less common, NLP
tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and
frequency lists, and to build simple language model. There are also more
complex data types and algorithms. Moreover, there are parsers for file formats
common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to
interface with various NLP specific servers. PyNLPl most notably features a
very extensive library for working with FoLiA XML (Format for Linguistic
Annotatation).
The library is a divided into several packages and modules. It works on Python
2.7, as well as Python 3.
The following modules are available:
- ``pynlpl.datatypes`` - Extra datatypes (priority queues, patterns, tries)
- ``pynlpl.evaluation`` - Evaluation & experiment classes (parameter search, wrapped
progressive sampling, class evaluation (precision/recall/f-score/auc), sampler, confusion matrix, multithreaded experiment pool)
- ``pynlpl.formats.cgn`` - Module for parsing CGN (Corpus Gesproken Nederlands) part-of-speech tags
- ``pynlpl.formats.folia`` - Extensive library for reading and manipulating the
documents in `FoLiA `_ format (Format for Linguistic Annotation).
- ``pynlpl.formats.fql`` - Extensive library for the FoLiA Query Language (FQL),
built on top of ``pynlpl.formats.folia``. FQL is currently documented `here
`__.
- ``pynlpl.formats.cql`` - Parser for the Corpus Query Language (CQL), as also used by
Corpus Workbench and Sketch Engine. Contains a convertor to FQL.
- ``pynlpl.formats.giza`` - Module for reading GIZA++ word alignment data
- ``pynlpl.formats.moses`` - Module for reading Moses phrase-translation tables.
- ``pynlpl.formats.sonar`` - Largely obsolete module for pre-releases of the
SoNaR corpus, use ``pynlpl.formats.folia`` instead.
- ``pynlpl.formats.timbl`` - Module for reading Timbl output (consider using
`python-timbl `_ instead though)
- ``pynlpl.lm.lm`` - Module for simple language model and reader for ARPA
language model data as well (used by SRILM).
- ``pynlpl.search`` - Various search algorithms (Breadth-first, depth-first,
beam-search, hill climbing, A star, various variants of each)
- ``pynlpl.statistics`` - Frequency lists, Levenshtein, common statistics and
information theory functions
- ``pynlpl.textprocessors`` - Simple tokeniser, n-gram extraction
Installation
--------------------
Download and install the latest stable version directly from the Python Package
Index with ``pip install pynlpl`` (or ``pip3`` for Python 3 on most
systems). For global installations prepend ``sudo``.
Alternatively, clone this repository and run ``python setup.py install`` (or
``python3 setup.py install`` for Python 3 on most system. Prepend ``sudo`` for
global installations.
This software may also be found in the certain Linux distributions, such as
the latest versions as Debian/Ubuntu, as ``python-pynlpl`` and ``python3-pynlpl``.
PyNLPL is also included in our `LaMachine `_ distribution.
Documentation
--------------------
API Documentation can be found `here `__.
PyNLPl-1.2.9/pynlpl/ 0000755 0001750 0000144 00000000000 13442242642 015032 5 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/pynlpl/__init__.py 0000644 0001750 0000144 00000001022 13442224605 017135 0 ustar proycon users 0000000 0000000 """PyNLPl, pronounced as "pineapple", is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for example the computation of n-grams, frequency lists and distributions, language models. There are also more complex data types, such as Priority Queues, and search algorithms, such as Beam Search.
The library is divided into several packages and modules. It is designed for Python 2.6 and upwards. Including Python 3."""
VERSION = "1.2.9"
PyNLPl-1.2.9/pynlpl/algorithms.py 0000664 0001750 0000144 00000004201 13120030751 017541 0 ustar proycon users 0000000 0000000
###############################################################9
# PyNLPl - Algorithms
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
# http://www.github.com/proycon/pynlpl
# proycon AT anaproy DOT nl
#
# Licensed under GPLv3
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
def sum_to_n(n, size, limit=None): #from http://stackoverflow.com/questions/2065553/python-get-all-numbers-that-add-up-to-a-number
"""Produce all lists of `size` positive integers in decreasing order
that add up to `n`."""
if size == 1:
yield [n]
return
if limit is None:
limit = n
start = (n + size - 1) // size
stop = min(limit, n - size + 1) + 1
for i in range(start, stop):
for tail in sum_to_n(n - i, size - 1, i):
yield [i] + tail
def consecutivegaps(n, leftmargin = 0, rightmargin = 0):
"""Compute all possible single consecutive gaps in any sequence of the specified length. Returns
(beginindex, length) tuples. Runs in O(n(n+1) / 2) time. Argument is the length of the sequence rather than the sequence itself"""
begin = leftmargin
while begin < n:
length = (n - rightmargin) - begin
while length > 0:
yield (begin, length)
length -= 1
begin += 1
def possiblesplits(n, minsplits=2, maxsplits=0):
"""Returns lists of (index,length) tuples, representing all possible splits of a sequence of length n."""
if not maxsplits: maxsplits = n
for nrsplits in range(minsplits,maxsplits + 1):
for split in sum_to_n(n,nrsplits):
split_with_indices = []
begin = 0
for length in split:
split_with_indices.append( (begin, length) )
begin += length
yield split_with_indices
def bytesize(n):
"""Return the required size in bytes to encode the specified integer"""
for i in range(1, 1000):
if n < 2**(8*i):
return i
PyNLPl-1.2.9/pynlpl/clients/ 0000755 0001750 0000144 00000000000 13442242642 016473 5 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/pynlpl/clients/__init__.py 0000664 0001750 0000144 00000000115 12201265173 020577 0 ustar proycon users 0000000 0000000 """This packages contains clients for communicating with specific servers"""
PyNLPl-1.2.9/pynlpl/clients/cornetto.py 0000664 0001750 0000144 00000124721 12201265173 020707 0 ustar proycon users 0000000 0000000 # -*- coding: utf-8 -*-
###############################################################
# PyNLPl - Remote Cornetto Client
# Adapted from code by Fons Laan (ILPS-ISLA, UvA)
# by Maarten van Gompel (proycon)
# http://ilk.uvt.nl/~mvgompel
# Induction for Linguistic Knowledge Research Group
# Universiteit van Tilburg
#
# Licensed under GPLv3
#
# This is a Python library for connecting to a Cornetto database.
# Originally coded by Fons Laan (ILPS-ISLA, University of Amsterdam)
# for DutchSemCor.
#
# The library currently has only a minimal set of functionality compared
# to the original. It will be extended on a need-to basis.
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
import sys
import httplib2 # version 0.6.0+
if sys.version < '3':
import urlparse
import httplib
else:
from urllib import parse as urlparse # renamed to urllib.parse in Python 3.0
import http.client as httplib #renamed in Python 3
import urllib, base64
from sys import stderr
#import pickle
printf = lambda x: sys.stdout.write(x+ "\n")
from lxml import etree
class CornettoClient(object):
def __init__(self, user='gast',password='gast',host='debvisdic.let.vu.nl', port=9002, path = '/doc', scheme='https',debug=False):
self.host = host
self.port = port
self.path = path
self.scheme = scheme
self.debug = debug
self.userid = user
self.passwd = password
def connect(self):
if self.debug:
printf( "cornettodb/views/remote_open()" )
# permission denied on cornetto with apache
# http = httplib2.Http( ".cache" )
try:
http = httplib2.Http(disable_ssl_certificate_validation=True)
except TypeError:
print >>stderr, "[CornettoClient] WARNING: Older version of httplib2! Can not disable_ssl_certificate_validation"
http = httplib2.Http() #for older httplib2
# VU DEBVisDic authentication
http.add_credentials( self.userid, self.passwd )
params = ""
# query = "action=init" # obsolete
query = "action=connect"
fragment = ""
db_url_tuple = ( self.scheme, self.host + ':' + str(self.port), self.path, params, query, fragment )
db_url = urlparse.urlunparse( db_url_tuple )
if self.debug:
printf( "db_url: %s" % db_url )
printf( "http.request()..." );
try:
resp, content = http.request( db_url, "GET" )
if self.debug:
printf( "resp:\n%s" % resp )
printf( "content:\n%s" % content )
except:
printf( "...failed." );
# when CORNETTO_HOST is off-line, we do not have a response
resp = None
content = None
return http, resp, content
def get_syn_ids_by_lemma(self, lemma):
"""Returns a list of synset IDs based on a lemma"""
if not isinstance(lemma,unicode):
lemma = unicode(lemma,'utf-8')
http, resp, content = self.connect()
params = ""
fragment = ""
path = "cdb_syn"
if self.debug:
printf( "cornettodb/views/query_remote_syn_lemma: db_opt: %s" % path )
query_opt = "dict_search"
if self.debug:
printf( "cornettodb/views/query_remote_syn_lemma: query_opt: %s" % query_opt )
qdict = {}
qdict[ "action" ] = "queryList"
qdict[ "word" ] = lemma.encode('utf-8')
query = urllib.urlencode( qdict )
db_url_tuple = ( self.scheme, self.host + ':' + str(self.port), path, params, query, fragment )
db_url = urlparse.urlunparse( db_url_tuple )
if self.debug:
printf( "db_url: %s" % db_url )
resp, content = http.request( db_url, "GET" )
if self.debug:
printf( "resp:\n%s" % resp )
printf( "content:\n%s" % content )
# printf( "content is of type: %s" % type( content ) )
dict_list = []
dict_list = eval( content ) # string to list
synsets = []
items = len( dict_list )
if self.debug:
printf( "items: %d" % items )
# syn dict: like lu dict, but without pos: part-of-speech
for dict in dict_list:
if self.debug:
printf( dict )
seq_nr = dict[ "seq_nr" ] # sense number
value = dict[ "value" ] # lexical unit identifier
form = dict[ "form" ] # lemma
label = dict[ "label" ] # label to be shown
if self.debug:
printf( "seq_nr: %s" % seq_nr )
printf( "value: %s" % value )
printf( "form: %s" % form )
printf( "label: %s" % label )
if value != "":
synsets.append( value )
return synsets
def get_lu_ids_by_lemma(self, lemma, targetpos = None):
"""Returns a list of lexical unit IDs based on a lemma and a pos tag"""
if not isinstance(lemma,unicode):
lemma = unicode(lemma,'utf-8')
http, resp, content = self.connect()
params = ""
fragment = ""
path = "cdb_lu"
query_opt = "dict_search"
qdict = {}
qdict[ "action" ] = "queryList"
qdict[ "word" ] = lemma.encode('utf-8')
query = urllib.urlencode( qdict )
db_url_tuple = ( self.scheme, self.host + ':' + str(self.port), path, params, query, fragment )
db_url = urlparse.urlunparse( db_url_tuple )
if self.debug:
printf( "db_url: %s" % db_url )
resp, content = http.request( db_url, "GET" )
if self.debug:
printf( "resp:\n%s" % resp )
printf( "content:\n%s" % content )
# printf( "content is of type: %s" % type( content ) )
dict_list = []
dict_list = eval( content ) # string to list
ids = []
items = len( dict_list )
if self.debug:
printf( "items: %d" % items )
for d in dict_list:
if self.debug:
printf( d )
seq_nr = d[ "seq_nr" ] # sense number
value = d[ "value" ] # lexical unit identifier
form = d[ "form" ] # lemma
label = d[ "label" ] # label to be shown
pos = d[ "pos" ] # label to be shown
if self.debug:
printf( "seq_nr: %s" % seq_nr )
printf( "value: %s" % value )
printf( "form: %s" % form )
printf( "label: %s" % label )
if value != "" and ((not targetpos) or (targetpos and pos == targetpos)):
ids.append( value )
return ids
def get_synset_xml(self,syn_id):
"""
call cdb_syn with synset identifier -> returns the synset xml;
"""
http, resp, content = self.connect()
params = ""
fragment = ""
path = "cdb_syn"
if self.debug:
printf( "cornettodb/views/query_remote_syn_id: db_opt: %s" % path )
# output_opt: plain, html, xml
# 'xml' is actually xhtml (with markup), but it is not valid xml!
# 'plain' is actually valid xml (without markup)
output_opt = "plain"
if self.debug:
printf( "cornettodb/views/query_remote_syn_id: output_opt: %s" % output_opt )
action = "runQuery"
if self.debug:
printf( "cornettodb/views/query_remote_syn_id: action: %s" % action )
printf( "cornettodb/views/query_remote_syn_id: query: %s" % syn_id )
qdict = {}
qdict[ "action" ] = action
qdict[ "query" ] = syn_id
qdict[ "outtype" ] = output_opt
query = urllib.urlencode( qdict )
db_url_tuple = ( self.scheme, self.host + ':' + str(self.port), path, params, query, fragment )
db_url = urlparse.urlunparse( db_url_tuple )
if self.debug:
printf( "db_url: %s" % db_url )
resp, content = http.request( db_url, "GET" )
if self.debug:
printf( "resp:\n%s" % resp )
# printf( "content:\n%s" % content )
# printf( "content is of type: %s" % type( content ) ) #
xml_data = eval( content )
return etree.fromstring( xml_data )
def get_lus_from_synset(self, syn_id):
"""Returns a list of (word, lu_id) tuples given a synset ID"""
root = self.get_synset_xml(syn_id)
elem_synonyms = root.find( ".//synonyms" )
lus = []
for elem_synonym in elem_synonyms:
synonym_str = elem_synonym.get( "c_lu_id-previewtext" ) # get "c_lu_id-previewtext" attribute
# synonym_str ends with ":"
synonym = synonym_str.split( ':' )[ 0 ].strip()
lus.append( (synonym, elem_synonym.get( "c_lu_id") ) )
return lus
def get_lu_from_synset(self, syn_id, lemma = None):
"""Returns (lu_id, synonyms=[(word, lu_id)] ) tuple given a synset ID and a lemma"""
if not lemma:
return self.get_lus_from_synset(syn_id) #alias
if not isinstance(lemma,unicode):
lemma = unicode(lemma,'utf-8')
root = self.get_synset_xml(syn_id)
elem_synonyms = root.find( ".//synonyms" )
lu_id = None
synonyms = []
for elem_synonym in elem_synonyms:
synonym_str = elem_synonym.get( "c_lu_id-previewtext" ) # get "c_lu_id-previewtext" attribute
# synonym_str ends with ":"
synonym = synonym_str.split( ':' )[ 0 ].strip()
if synonym != lemma:
synonyms.append( (synonym, elem_synonym.get("c_lu_id")) )
if self.debug:
printf( "synonym add: %s" % synonym )
else:
lu_id = elem_synonym.get( "c_lu_id" ) # get "c_lu_id" attribute
if self.debug:
printf( "lu_id: %s" % lu_id )
printf( "synonym skip lemma: %s" % synonym )
return lu_id, synonyms
##################################################################################################################
# ORIGINAL AND AS-OF-YET UNUSED CODE (included for later porting)
##################################################################################################################
"""
--------------------------------------------------------------------------------
Original Author: Fons Laan, ILPS-ISLA, University of Amsterdam
Original Project: DutchSemCor
Original Name: cornettodb/views.py
Original Version: 0.2
Goal: Cornetto views definitions
Original functions:
index( request )
local_open()
remote_open( self.debug )
search( request )
search_local( dict_in, search_query )
search_remote( dict_in, search_query )
cornet_check_lusyn( utf8_lemma )
query_remote_lusyn_id( syn_id_self.debug, http, utf8_lemma, syn_id )
query_cornet( keyword, category )
query_remote_syn_lemma( self.debug, http, utf8_lemma )
query_remote_syn_id( self.debug, http, utf8_lemma, syn_id, domains_abbrev )
query_remote_lu_lemma( self.debug, http, utf8_lemma )
query_remote_lu_id( self.debug, http, lu_id )
FL-04-Sep-2009: Created
FL-03-Nov-2009: Removed http global: sometimes it was None; missed initialization?
FL-01-Feb-2010: Added Category filtering
FL-15-Feb-2010: Tag counts -> separate qx query
FL-07-Apr-2010: Merge canonical + textual examples
FL-10-Jun-2010: Latest Change
MvG-29-Sep-2010: Turned into minimal CornettoClient class, some new functions added, many old ones disabled until necessary
"""
# def query_remote(self, dict_in, search_query ):
# if self.debug: printf( "cornettodb/views/query_remote" )
# http, resp, content = self.remote_open()
# if resp is None:
# raise Exception("No response")
# status = int( resp.get( "status" ) )
# if self.debug: printf( "status: %d" % status )
# if status != 200:
# # e.g. 400: Bad Request, 404: Not Found
# raise Exception("Error in request")
# path = dict_in[ 'dbopt' ]
# if self.debug: printf( "cornettodb/views/query_remote: db_opt: %s" % path )
# output_opt = dict_in[ 'outputopt' ]
# if self.debug: printf( "cornettodb/views/query_remote: output_opt: %s" % output_opt )
# query_opt = dict_in[ 'queryopt' ]
# if self.debug: printf( "cornettodb/views/query_remote: query_opt: %s" % query_opt )
# params = ""
# fragment = ""
# qdict = {}
# if query_opt == "dict_search":
# # query = "action=queryList&word=" + search_query
# qdict[ "action" ] = "queryList"
# qdict[ "word" ] = search_query
# elif query_opt == "query_entry":
# # query = "action=runQuery&query=" + search_query
# # query += "&outtype=" + output_opt
# qdict[ "action" ] = "runQuery"
# qdict[ "query" ] = search_query
# qdict[ "outtype" ] = output_opt
# # instead of "subtree" there is also "tree" and "full subtree"
# elif query_opt == "subtree_entry":
# # query = "action=subtree&query=" + search_query
# # query += "&arg=ILR" # ILR = Internal Language Relations, RILR = Reversed ...
# # query += "&outtype=" + output_opt
# qdict[ "action" ] = "subtree"
# qdict[ "query" ] = search_query
# qdict[ "arg" ] = "ILR" # ILR = Internal Language Relations, RILR = Reversed ...
# qdict[ "outtype" ] = output_opt
# # More functions, see DEBVisDic docu:
# # Save entry
# # Delete entry
# # Next sense number
# # "Translate" synsets
# query = urllib.urlencode( qdict )
# db_url_tuple = ( self.scheme, self.host+ ':' + str(self.post), self.path, params, query, fragment )
# db_url = urlparse.urlunparse( db_url_tuple )
# if self.debug: printf( "db_url: %s" % db_url )
# resp, content = http.request( db_url, "GET" )
# printf( "resp:\n%s" % resp )
# if self.debug: printf( "content:\n%s" % content )
# if content.startswith( '[' ) and content.endswith( ']' ):
# reply = eval( content ) # string -> list
# islist = True
# else:
# reply = content
# islist = False
# return reply
# def cornet_check_lusyn( self, utf8_lemma ):
# http, resp, content = remote_open( self.debug )
# # get the raw (unfiltered) lexical unit identifiers for this lemma
# lu_ids_lemma = query_remote_lu_lemma( http, utf8_lemma )
# # get the synset identifiers for this lemma
# syn_ids_lemma = query_remote_syn_lemma( http, utf8_lemma )
# lu_ids_syn = []
# for syn_id in syn_ids_lemma:
# lu_id = query_remote_lusyn_id( http, utf8_lemma, syn_id )
# lu_ids_syn.append( lu_id )
# return lu_ids_lemma, syn_ids_lemma, lu_ids_syn
# def query_remote_lusyn_id( http, utf8_lemma, syn_id ):
# """
# query_remote_lusyn_id\
# call cdb_syn with synset identifier -> synset xml -> lu_id lemma
# """
# scheme = settings.CORNETTO_PROTOCOL
# netloc = settings.CORNETTO_HOST + ':' + str( settings.CORNETTO_PORT )
# params = ""
# fragment = ""
# path = "cdb_syn"
# if self.debug:
# printf( "cornettodb/views/query_remote_lusyn_id: db_opt: %s" % path )
# # output_opt: plain, html, xml
# # 'xml' is actually xhtml (with markup), but it is not valid xml!
# # 'plain' is actually valid xml (without markup)
# output_opt = "plain"
# if self.debug:
# printf( "cornettodb/views/query_remote_lusyn_id: output_opt: %s" % output_opt )
# action = "runQuery"
# if self.debug:
# printf( "cornettodb/views/query_remote_lusyn_id: action: %s" % action )
# printf( "cornettodb/views/query_remote_lusyn_id: query: %s" % syn_id )
#
# qdict = {}
# qdict[ "action" ] = action
# qdict[ "query" ] = syn_id
# qdict[ "outtype" ] = output_opt
# query = urllib.urlencode( qdict )
# db_url_tuple = ( scheme, netloc, path, params, query, fragment )
# db_url = urlparse.urlunparse( db_url_tuple )
# if self.debug:
# printf( "db_url: %s" % db_url )
# resp, content = http.request( db_url, "GET" )
# if self.debug:
# printf( "resp:\n%s" % resp )
# # printf( "content:\n%s" % content )
# # printf( "content is of type: %s" % type( content ) ) #
# xml_data = eval( content )
# root = etree.fromstring( xml_data )
# synonyms = []
# # find anywhere in the tree
# elem_synonyms = root.find( ".//synonyms" )
# for elem_synonym in elem_synonyms:
# synonym_str = elem_synonym.get( "c_lu_id-previewtext" ) # get "c_lu_id-previewtext" attribute
# # synonym_str ends with ":"
# synonym = synonym_str.split( ':' )[ 0 ].strip()
# utf8_synonym = synonym.encode( 'utf-8' )
# if utf8_synonym != utf8_lemma:
# synonyms.append( synonym )
# if self.debug:
# printf( "synonym add: %s" % synonym )
# else:
# lu_id = elem_synonym.get( "c_lu_id" ) # get "c_lu_id" attribute
# if self.debug:
# printf( "lu_id: %s" % lu_id )
# printf( "synonym skip lemma: %s" % synonym )
# return lu_id
# def query_cornet( annotator_id, utf8_lemma, category ):
# """\
# cornet_query()
# A variant of query_remote(), combining several queries for the dutchsemcor GUI
# -1- call cdb_syn with lemma -> syn_ids;
# -2- for each syn_id, call cdb_syn ->synset xml
# -3- for each synset xml, find lu_id
# -4- for each lu_id, call cdb_lu ->lu xml
# -5- collect required info from lu & syn xml
# """
# self.debug = False # this function
# printf( "cornettodb/views/cornet_query()" )
# if utf8_lemma is None or utf8_lemma == "":
# printf( "No lemma" )
# return
# else:
# printf( "lemma: %s" % utf8_lemma.decode( 'utf-8' ).encode( 'latin-1' ) )
# printf( "category: %s" % category )
# http, resp, content = remote_open( self.debug )
# if resp is None:
# template = "cornettodb/error.html"
# dictionary = { 'DSC_HOME' : settings.DSC_HOME }
# return template, dictionary
# status = int( resp.get( "status" ) )
# printf( "status: %d" % status )
# if status != 200:
# # e.g. 400: Bad Request, 404: Not Found
# printf( "status: %d\nreason: %s" % ( resp.status, resp.reason ) )
# dict_err = \
# {
# "status" : settings.CORNETTO_HOST + " error: " + str(status),
# "msg" : resp.reason
# }
# return dict_err
# # read the domain cvs file, and return the dictionaries
# domains_dutch, domains_abbrev = get_domains()
# syn_ids = [] # used syn_ids, skipping filtered
# lu_ids = [] # used lu_ids, skipping filtered
# lu_ids_syn = [] # lu_ids derived from syn_ids, unfiltered
# # get the raw (unfiltered) synset identifiers for this lemma
# syn_lemma_self.debug = False
# syn_ids_raw = query_remote_syn_lemma( syn_lemma_self.debug, http, utf8_lemma )
# # get the raw (unfiltered) lexical unit identifiers for this lemma
# lu_lemma_self.debug = False
# lu_ids_raw = query_remote_lu_lemma( lu_lemma_self.debug, http, utf8_lemma )
# # required lu info from the lu xml:
# resumes_lu = []
# morphos_lu = []
# examplestext_lulist = [] # list-of-lists
# examplestype_lulist = [] # list-of-lists
# examplessubtype_lulist = [] # list-of-lists
# # required syn info from the synset xml:
# definitions_syn = [] # list
# differentiaes_syn = [] # list
# synonyms_synlist = [] # list-of-lists
# relations_synlist = [] # list-of-lists
# hyperonyms_synlist = [] # list-of-lists
# hyponyms_synlist = [] # list-of-lists
# relations_synlist = [] # list-of-lists
# relnames_synlist = [] # list-of-lists
# domains_synlist = [] # list-of-lists
# remained = 0 # maybe less than lu_ids because of category filtering
# for syn_id in syn_ids_raw:
# if self.debug:
# printf( "syn_id: %s" % syn_id )
# syn_id_self.debug = False
# lu_id, definition, differentiae, synonyms, hyperonyms, hyponyms, relations, relnames, domains = \
# query_remote_syn_id( syn_id_self.debug, http, utf8_lemma, syn_id, domains_abbrev )
# lu_ids_syn.append( lu_id )
# lui_id_self.debug = False
# if self.debug:
# printf( "lu_id: %s" % lu_id )
# formcat, morpho, resume, examples_text, examples_type, examples_subtype = \
# query_remote_lu_id( lui_id_self.debug, http, lu_id )
# if not ( \
# ( category == '?' ) or \
# ( category == 'a' and formcat == 'adj' ) or \
# ( category == 'n' and formcat == 'noun' ) or \
# ( category == 'v' and formcat == 'verb' ) ):
# if self.debug:
# printf( "filtered category: formcat=%s, lu_id=%s" % (formcat, lu_id) )
# continue
# # collect all information
# syn_ids.append( syn_id )
# lu_ids.append( lu_id )
# definitions_syn.append( definition )
# differentiaes_syn.append( differentiae )
# synonyms_synlist.append( synonyms )
# hyperonyms_synlist.append( hyperonyms )
# relations_synlist.append( relations )
# relnames_synlist.append( relnames )
# hyponyms_synlist.append( hyponyms )
# domains_synlist.append( domains )
# resumes_lu.append( resume )
# morphos_lu.append( morpho )
# examplestext_lulist.append( examples_text )
# examplestype_lulist.append( examples_type )
# examplessubtype_lulist.append( examples_subtype )
# if self.debug:
# printf( "morpho: %s\nresume: %s\nexamples:" % (morpho, resume) )
# for canoexample in canoexamples:
# printf( canoexample.encode('latin-1') ) # otherwise fails with non-ascii chars
# for textexample in textexamples:
# printf( textexample.encode('latin-1') ) # otherwise fails with non-ascii chars
# lusyn_mismatch = False # assume no problem
# # Compare number of lu ids with syn_ids
# if len( lu_ids_raw ) != len( syn_ids_raw): # length mismatch
# lusyn_mismatch = True
# printf( "query_cornet: %d lu ids, %d syn ids: NO MATCH" % (len(lu_ids_raw), len(syn_ids_raw) ) )
# # Check lu_ids from syn to lu_ids_raw (from lemma)
# for i in range( len( lu_ids_raw ) ):
# lu_id_raw = lu_ids_raw[ i ]
# try:
# idx = lu_ids_syn.index( lu_id_raw )
# if lu_ids_syn.count( lu_id_raw ) != 1:
# lusyn_mismatch = True
# printf( "query_cornet: %s lu id: DUPLICATES" % lu_id_raw )
# except:
# lusyn_mismatch = True
# printf( "query_cornet: %s lu id: NOT FOUND" % lu_id_raw )
# dictlist = []
# for i in range( len( syn_ids ) ):
# # printf( "i: %d" % i )
# dict = {}
# dict[ "no" ] = i
# lu_id = lu_ids[ i ]
# dict[ "lu_id" ] = lu_id
# syn_id = syn_ids[ i ]
# dict[ "syn_id" ] = syn_id
# dict[ "tag_count" ] = '?'
# resume = resumes_lu[ i ]
# dict[ "resume" ] = resume
# morpho = morphos_lu[ i ]
# dict[ "morpho" ] = morpho
# examplestext = examplestext_lulist[ i ]
# dict[ "examplestext"] = examplestext
# examplestype = examplestype_lulist[ i ]
# dict[ "examplestype"] = examplestype
# examplessubtype = examplessubtype_lulist[ i ]
# dict[ "examplessubtype"] = examplessubtype
# definition = definitions_syn[ i ]
# dict[ "definition" ] = definition
# differentiae = differentiaes_syn[ i ]
# dict[ "differentiae" ] = differentiae
# synonyms = synonyms_synlist[ i ]
# dict[ "synonyms"] = synonyms
# hyperonyms = hyperonyms_synlist[ i ]
# dict[ "hyperonyms"] = hyperonyms
# hyponyms = hyponyms_synlist[ i ]
# dict[ "hyponyms"] = hyponyms
# relations = relations_synlist[ i ]
# dict[ "relations"] = relations
# relnames = relnames_synlist[ i ]
# dict[ "relnames"] = relnames
# domains = domains_synlist[ i ]
# dict[ "domains"] = domains
# dictlist.append( dict )
# # pack in "superdict"
# result = \
# {
# "status" : "ok",
# "source" : "cornetto",
# "lusyn_mismatch" : lusyn_mismatch,
# "lusyn_retrieved" : len( syn_ids_raw ),
# "lusyn_remained" : len( syn_ids ),
# "lists_data" : dictlist
# }
# return result
# def query_remote_lu_lemma( utf8_lemma ):
# """\
# call cdb_lu with lemma -> yields lexical units
# """
# scheme = settings.CORNETTO_PROTOCOL
# netloc = settings.CORNETTO_HOST + ':' + str( settings.CORNETTO_PORT )
# params = ""
# fragment = ""
# path = "cdb_lu"
# if self.debug:
# printf( "cornettodb/views/query_remote_lu_lemma: db_opt: %s" % path )
# action = "queryList"
# if self.debug:
# printf( "cornettodb/views/query_remote_lu_lemma: action: %s" % action )
# qdict = {}
# qdict[ "action" ] = action
# qdict[ "word" ] = utf8_lemma
# query = urllib.urlencode( qdict )
# db_url_tuple = ( scheme, netloc, path, params, query, fragment )
# db_url = urlparse.urlunparse( db_url_tuple )
# if self.debug:
# printf( "db_url: %s" % db_url )
# resp, content = http.request( db_url, "GET" )
# if self.debug:
# printf( "resp:\n%s" % resp )
# printf( "content:\n%s" % content )
# # printf( "content is of type: %s" % type( content ) )
# dict_list = []
# dict_list = eval( content ) # string to list
# ids = []
# items = len( dict_list )
# if self.debug:
# printf( "items: %d" % items )
# # lu dict: like syn dict, but with pos: part-of-speech
# for dict in dict_list:
# if self.debug:
# printf( dict )
# seq_nr = dict[ "seq_nr" ] # sense number
# value = dict[ "value" ] # lexical unit identifier
# form = dict[ "form" ] # lemma
# pos = dict[ "pos" ] # part of speech
# label = dict[ "label" ] # label to be shown
# if self.debug:
# printf( "seq_nr: %s" % seq_nr )
# printf( "value: %s" % value )
# printf( "form: %s" % form )
# printf( "pos: %s" % pos )
# printf( "label: %s" % label )
# if value != "":
# ids.append( value )
# return ids
# def lemma2formcats( utf8_lemma ):
# """\
# get the form-cats for this lemma.
# """
# self.debug = False
# http, resp, content = remote_open( self.debug )
# if resp is None:
# template = "cornettodb/error.html"
# dictionary = { 'DSC_HOME' : settings.DSC_HOME }
# return template, dictionary
# status = int( resp.get( "status" ) )
# if status != 200:
# # e.g. 400: Bad Request, 404: Not Found
# printf( "status: %d\nreason: %s" % ( resp.status, resp.reason ) )
# template = "cornettodb/error.html"
# message = "Cornetto " + _( "initialization" )
# dict = \
# {
# 'DSC_HOME': settings.DSC_HOME,
# 'message': message,
# 'status': resp.status,
# 'reason': resp.reason, \
# }
# return template, dictionary
# # get the lexical unit identifiers for this lemma
# lu_ids = query_remote_lu_lemma( self.debug, http, utf8_lemma )
# scheme = settings.CORNETTO_PROTOCOL
# netloc = settings.CORNETTO_HOST + ':' + str( settings.CORNETTO_PORT )
# params = ""
# fragment = ""
# path = "cdb_lu"
# if self.debug:
# printf( "cornettodb/views/query_remote_lu_id_formcat: db_opt: %s" % path )
# output_opt = "plain"
# if self.debug:
# printf( "cornettodb/views/query_remote_lu_id_formcat: output_opt: %s" % output_opt )
# action = "runQuery"
# if self.debug:
# printf( "cornettodb/views/query_remote_lu_id_formcat: action: %s" % action )
# formcats = []
# for lu_id in lu_ids:
# if self.debug:
# printf( "cornettodb/views/query_remote_lu_id_formcat: query: %s" % lu_id )
# qdict = {}
# qdict[ "action" ] = action
# qdict[ "query" ] = lu_id
# qdict[ "outtype" ] = output_opt
# query = urllib.urlencode( qdict )
# db_url_tuple = ( scheme, netloc, path, params, query, fragment )
# db_url = urlparse.urlunparse( db_url_tuple )
# if self.debug:
# printf( "db_url: %s" % db_url )
# resp, content = http.request( db_url, "GET" )
# if self.debug:
# printf( "resp:\n%s" % resp )
# xml_data = eval( content )
# root = etree.fromstring( xml_data )
# # morpho
# morpho = ""
# elem_form = root.find( ".//form" )
# if elem_form is not None:
# formcat = elem_form.get( "form-cat" ) # get "form-cat" attribute
# if formcat is None:
# formcat = '?'
# count = formcats.count( formcat )
# if count == 0:
# formcats.append( formcat )
# return formcats
# def query_remote_lu_id(lu_id ):
# """\
# call cdb_lu with lexical unit identifier -> yields the lexical unit xml;
# from the xml collect the morpho-syntax, resumes+definitions, examples.
# """
# scheme = settings.CORNETTO_PROTOCOL
# netloc = settings.CORNETTO_HOST + ':' + str( settings.CORNETTO_PORT )
# params = ""
# fragment = ""
# path = "cdb_lu"
# if self.debug:
# printf( "cornettodb/views/query_remote_lu_id: db_opt: %s" % path )
# # output_opt: plain, html, xml
# # 'xml' is actually xhtml (with markup), but it is not valid xml!
# # 'plain' is actually valid xml (without markup)
# output_opt = "plain"
# if self.debug:
# printf( "cornettodb/views/query_remote_lu_id: output_opt: %s" % output_opt )
# action = "runQuery"
# if self.debug:
# printf( "cornettodb/views/query_remote_lu_id: action: %s" % action )
# printf( "cornettodb/views/query_remote_lu_id: query: %s" % lu_id )
#
# qdict = {}
# qdict[ "action" ] = action
# qdict[ "query" ] = lu_id
# qdict[ "outtype" ] = output_opt
# query = urllib.urlencode( qdict )
# db_url_tuple = ( scheme, netloc, path, params, query, fragment )
# db_url = urlparse.urlunparse( db_url_tuple )
# if self.debug:
# printf( "db_url: %s" % db_url )
# resp, content = http.request( db_url, "GET" )
# if self.debug:
# printf( "resp:\n%s" % resp )
# # printf( "content:\n%s" % content )
# # printf( "content is of type: %s" % type( content ) ) #
# xml_data = eval( content )
# root = etree.fromstring( xml_data )
# # morpho
# morpho = ""
# elem_form = root.find( ".//form" )
# if elem_form is not None:
# formcat = elem_form.get( "form-cat" ) # get "form-cat" attribute
# if formcat is not None:
# if formcat == "adj":
# morpho = 'a'
# elif formcat == "noun":
# morpho = 'n'
# elem_article = root.find( ".//sy-article" )
# if elem_article is not None and elem_article.text is not None:
# article = elem_article.text # lidwoord
# morpho += "-" + article
# elem_count = root.find( ".//sem-countability" )
# if elem_count is not None and elem_count.text is not None:
# countability = elem_count.text
# if countability == "count":
# morpho += "-t"
# elif countability == "uncount":
# morpho += "-nt"
# elif formcat == "verb":
# morpho = 'v'
# elem_trans = root.find( ".//sy-trans" )
# if elem_trans is not None and elem_trans.text is not None:
# transitivity = elem_trans.text
# if transitivity == "tran":
# morpho += "-tr"
# elif transitivity == "intr":
# morpho += "-intr"
# else: # should not occur
# morpho += "-"
# morpho += transitivity
# elem_separ = root.find( ".//sy-separ" )
# if elem_separ is not None and elem_separ.text is not None:
# separability = elem_separ.text
# if separability == "sch":
# morpho += "-sch"
# elif separability == "onsch":
# morpho += "-onsch"
# else: # should not occur
# morpho += "-"
# morpho += separability
# elem_reflexiv = root.find( ".//sy-reflexiv" )
# if elem_reflexiv is not None and elem_reflexiv.text is not None:
# reflexivity = elem_reflexiv.text
# if reflexivity == "refl":
# morpho += "-refl"
# elif reflexivity == "nrefl":
# morpho += "-nrefl"
# else: # should not occur
# morpho += "-"
# morpho += reflexivity
# elif formcat == "adverb":
# morpho = 'd'
# else:
# morpho = '?'
# # find anywhere in the tree
# elem_resume = root.find( ".//sem-resume" )
# if elem_resume is not None:
# resume = elem_resume.text
# else:
# resume = ""
# examples_text = []
# examples_type = []
# examples_subtype = []
# # find anywhere in the tree
# examples = root.findall( ".//example" )
# for example in examples:
# example_id = example.get( "r_ex_id" )
# elem_type = example.find( "syntax_example/sy-type" )
# if elem_type is not None:
# type_text = elem_type.text
# if type_text is None:
# type_text = ""
# else:
# type_text = ""
# elem_subtype = example.find( "syntax_example/sy-subtype" )
# if elem_subtype is not None:
# subtype_text = elem_subtype.text
# if subtype_text is None:
# subtype_text = ""
# else:
# subtype_text = ""
# # there can be a canonical and/or textual example,
# # they share the type and subtype
# elem_canonical = example.find( "form_example/canonicalform" ) # find child
# if elem_canonical is not None and elem_canonical.text is not None:
# example_text = elem_canonical.text
# example_out = example_text.encode( "iso-8859-1", "replace" )
# if self.debug:
# printf( "subtype, r_ex_id: %s: %s" % ( example_id, example_out ) )
# if subtype_text != "idiom":
# examples_text.append( example_text )
# examples_type.append( type_text )
# examples_subtype.append( subtype_text )
# else:
# if self.debug:
# printf( "filter idiom: %s" % example_out)
#
# elem_textual = example.find( "form_example/textualform" ) # find child
# if elem_textual is not None and elem_textual.text is not None:
# example_text = elem_textual.text
# example_out = example_text.encode( "iso-8859-1", "replace" )
# if self.debug:
# printf( "subtype r_ex_id: %s: %s" % ( example_id, example_out ) )
# if subtype_text != "idiom":
# examples_text.append( example_text )
# examples_type.append( type_text )
# examples_subtype.append( subtype_text )
# else:
# if self.debug:
# printf( "filter idiom: %s" % example_out)
# return formcat, morpho, resume, examples_text, examples_type, examples_subtype
# def get_synset(self, syn_id, utf8_lemma, domains_abbrev ):
# """Parse synset data"""
# root = self.get_synset_xml(syn_id)
# synonyms = []
# # find anywhere in the tree
# elem_synonyms = root.find( ".//synonyms" )
# for elem_synonym in elem_synonyms:
# synonym_str = elem_synonym.get( "c_lu_id-previewtext" ) # get "c_lu_id-previewtext" attribute
# # synonym_str ends with ":"
# synonym = synonym_str.split( ':' )[ 0 ].strip()
# utf8_synonym = synonym.encode( 'utf-8' )
# if utf8_synonym != utf8_lemma:
# synonyms.append( synonym )
# if self.debug:
# printf( "synonym add: %s" % synonym )
# else:
# lu_id = elem_synonym.get( "c_lu_id" ) # get "c_lu_id" attribute
# if self.debug:
# printf( "lu_id: %s" % lu_id )
# printf( "synonym skip lemma: %s" % synonym )
# definition = ""
# elem_definition = root.find( ".//definition" )
# if elem_definition is not None and elem_definition.text is not None:
# definition = elem_definition.text
# differentiae = ""
# elem_differentiae = root.find( "./differentiae/" )
# if elem_differentiae is not None and elem_differentiae.text is not None:
# differentiae = elem_differentiae.text
# if self.debug:
# print( "definition: %s" % definition.encode( 'utf-8' ) )
# print( "differentiae: %s" % differentiae.encode( 'utf-8' ) )
# hyperonyms = []
# hyponyms = []
# relations_all = []
# relnames_all = []
# # find internal anywhere in the tree
# elem_intrelations = root.find( ".//wn_internal_relations" )
# for elem_relation in elem_intrelations:
# relations = []
# relation_str = elem_relation.get( "target-previewtext" ) # get "target-previewtext" attribute
# name = elem_relation.get( "relation_name" )
# target = elem_relation.get( "targer" )
# relation_list = relation_str.split( ',' )
# for relation_str in relation_list:
# relation = relation_str.split( ':' )[ 0 ].strip()
# relations.append( relation )
# relations_all.append( relation )
# relnames_all.append( name )
# if name == "HAS_HYPERONYM":
# if self.debug:
# printf( "target: %s" % target )
# hyperonyms.append( relations )
# elif name == "HAS_HYPONYM":
# if self.debug:
# printf( "target: %s" % target )
# hyponyms.append( relations )
# # we could keep the relation sub-lists separate on the basis of their "target" attribute
# # but for now we flatten the lists
# hyperonyms = flatten( hyperonyms )
# hyponyms = flatten( hyponyms )
# if self.debug:
# printf( "hyperonyms: %s" % hyperonyms )
# printf( "hyponyms: %s" % hyponyms )
# domains = []
# # find anywhere in the tree
# wn_domains = root.find( ".//wn_domains" )
# for dom_relation in wn_domains:
# domains_en = dom_relation.get( "term" ) # get "term" attribute
# if self.debug:
# if domains_en:
# printf( "domain: %s" % domains_en )
#
# # use dutch domain name[s], abbreviated
# domain_list = domains_en.split( ' ' )
# for domain_en in domain_list:
# try:
# domain_nl = domains_abbrev[ domain_en ]
# if domain_nl.endswith( '.' ): # remove trailing '.'
# domain_nl = domain_nl[ : -1] # remove last character
# except:
# printf( "failed to convert domain: %s" % domain_en )
# domain_nl = domain_en
# if domains.count( domain_nl ) == 0: # append if new
# domains.append( domain_nl )
# return lu_id, definition, differentiae, synonyms, hyperonyms, hyponyms, relations_all, relnames_all, domains
PyNLPl-1.2.9/pynlpl/clients/freeling.py 0000664 0001750 0000144 00000007556 12201265173 020653 0 ustar proycon users 0000000 0000000 ###############################################################
# PyNLPl - FreeLing Library
# by Maarten van Gompel (proycon)
# http://ilk.uvt.nl/~mvgompel
# Radboud University Nijmegen
#
# Licensed under GPLv3
#
# This is a Python library for on-the-fly communication with
# a FreeLing server. Allowing on-the-fly lemmatisation and
# PoS-tagging. It is recommended to pass your data on a
# sentence-by-sentence basis to FreeLingClient.process()
#
# Make sure to start Freeling (analyzer) with the --server
# and --flush flags !!!!!
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
from pynlpl.common import u
import socket
import sys
class FreeLingClient(object):
def __init__(self, host, port, encoding='utf-8', timeout=120.0):
"""Initialise the client, set channel to the path and filename where the server's .in and .out pipes are (without extension)"""
self.encoding = encoding
self.BUFSIZE = 10240
self.socket = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
self.socket.settimeout(timeout)
self.socket.connect( (host,int(port)) )
self.encoding = encoding
self.socket.sendall('RESET_STATS\0')
r = self.socket.recv(self.BUFSIZE)
if not r.strip('\0') == 'FL-SERVER-READY':
raise Exception("Server not ready")
def process(self, sourcewords, debug=False):
"""Process a list of words, passing it to the server and realigning the output with the original words"""
if isinstance( sourcewords, list ) or isinstance( sourcewords, tuple ):
sourcewords_s = " ".join(sourcewords)
else:
sourcewords_s = sourcewords
sourcewords = sourcewords.split(' ')
self.socket.sendall(sourcewords_s.encode(self.encoding) +'\n\0')
if debug: print("Sent:",sourcewords_s.encode(self.encoding),file=sys.stderr)
results = []
done = False
while not done:
data = b""
while not data:
buffer = self.socket.recv(self.BUFSIZE)
if debug: print("Buffer: ["+repr(buffer)+"]",file=sys.stderr)
if buffer[-1] == '\0':
data += buffer[:-1]
done = True
break
else:
data += buffer
data = u(data,self.encoding)
if debug: print("Received:",data,file=sys.stderr)
for i, line in enumerate(data.strip(' \t\0\r\n').split('\n')):
if not line.strip():
done = True
break
else:
cols = line.split(" ")
subwords = cols[0].lower().split("_")
if len(cols) > 2: #this seems a bit odd?
for word in subwords: #split multiword expressions
results.append( (word, cols[1], cols[2], i, len(subwords) > 1 ) ) #word, lemma, pos, index, multiword?
sourcewords = [ w.lower() for w in sourcewords ]
alignment = []
for i, sourceword in enumerate(sourcewords):
found = False
best = 0
distance = 999999
for j, (targetword, lemma, pos, index, multiword) in enumerate(results):
if sourceword == targetword and abs(i-j) < distance:
found = True
best = j
distance = abs(i-j)
if found:
alignment.append(results[best])
else:
alignment.append((None,None,None,None,False)) #no alignment found
return alignment
PyNLPl-1.2.9/pynlpl/clients/frogclient.py 0000644 0001750 0000144 00000012631 13271635402 021204 0 ustar proycon users 0000000 0000000 ###############################################################
# PyNLPl - Frog Client - Version 1.4.1
# by Maarten van Gompel (proycon)
# http://ilk.uvt.nl/~mvgompel
# Induction for Linguistic Knowledge Research Group
# Universiteit van Tilburg
#
# Derived from code by Rogier Kraf
#
# Licensed under GPLv3
#
# This is a Python library for on-the-fly communication with
# a Frog/Tadpole Server. Allowing on-the-fly lemmatisation and
# PoS-tagging. It is recommended to pass your data on a
# sentence-by-sentence basis to FrogClient.process()
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
from pynlpl.common import u
import socket
class FrogClient:
def __init__(self,host="localhost",port=12345, server_encoding="utf-8", returnall=False, timeout=120.0):
"""Create a client connecting to a Frog or Tadpole server."""
self.BUFSIZE = 4096
self.socket = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
self.socket.settimeout(timeout)
self.socket.connect( (host,int(port)) )
self.server_encoding = server_encoding
self.returnall = returnall
def process(self,input_data, source_encoding="utf-8", return_unicode = True, oldfrog=False):
"""Receives input_data in the form of a str or unicode object, passes this to the server, with proper consideration for the encodings, and returns the Frog output as a list of tuples: (word,pos,lemma,morphology), each of these is a proper unicode object unless return_unicode is set to False, in which case raw strings will be returned. Return_unicode is no longer optional, it is fixed to True, parameter is still there only for backwards-compatibility."""
if isinstance(input_data, list) or isinstance(input_data, tuple):
input_data = " ".join(input_data)
input_data = u(input_data, source_encoding) #decode (or preferably do this in an earlier stage)
input_data = input_data.strip(' \t\n')
s = input_data.encode(self.server_encoding) +b'\r\n'
if not oldfrog: s += b'EOT\r\n'
self.socket.sendall(s) #send to socket in desired encoding
output = []
done = False
while not done:
data = b""
while not data.endswith(b'\n'):
moredata = self.socket.recv(self.BUFSIZE)
if not moredata: break
data += moredata
data = u(data,self.server_encoding)
for line in data.strip(' \t\r\n').split('\n'):
if line == "READY":
done = True
break
elif line:
line = line.split('\t') #split on tab
if len(line) > 4 and line[0].isdigit(): #first column is token number
if line[0] == '1' and output:
if self.returnall:
output.append( (None,None,None,None, None,None,None, None) )
else:
output.append( (None,None,None,None) )
fields = line[1:]
parse1=parse2=ner=chunk=""
word,lemma,morph,pos = fields[0:4]
if len(fields) > 5:
ner = fields[5]
if len(fields) > 6:
chunk = fields[6]
if len(fields) >= 8:
parse1 = fields[7]
parse2 = fields[8]
if len(fields) < 5:
raise Exception("Can't process response line from Frog: ", repr(line), " got unexpected number of fields ", str(len(fields) + 1))
if self.returnall:
output.append( (word,lemma,morph,pos,ner,chunk,parse1,parse2) )
else:
output.append( (word,lemma,morph,pos) )
return output
def process_aligned(self,input_data, source_encoding="utf-8", return_unicode = True):
output = self.process(input_data, source_encoding, return_unicode)
outputwords = [ x[0] for x in output ]
inputwords = input_data.strip(' \t\n').split(' ')
alignment = self.align(inputwords, outputwords)
for i, _ in enumerate(inputwords):
targetindex = alignment[i]
if targetindex == None:
if self.returnall:
yield (None,None,None,None,None,None,None,None)
else:
yield (None,None,None,None)
else:
yield output[targetindex]
def align(self,inputwords, outputwords):
"""For each inputword, provides the index of the outputword"""
alignment = []
cursor = 0
for inputword in inputwords:
if len(outputwords) > cursor and outputwords[cursor] == inputword:
alignment.append(cursor)
cursor += 1
elif len(outputwords) > cursor+1 and outputwords[cursor+1] == inputword:
alignment.append(cursor+1)
cursor += 2
else:
alignment.append(None)
cursor += 1
return alignment
def __del__(self):
self.socket.close()
PyNLPl-1.2.9/pynlpl/common.py 0000664 0001750 0000144 00000010304 12201265173 016670 0 ustar proycon users 0000000 0000000 #!/usr/bin/env python
#-*- coding:utf-8 -*-
###############################################################9
# PyNLPl - Common functions
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
# http://www.github.com/proycon/pynlpl
# proycon AT anaproy DOT nl
#
# Licensed under GPLv3
#
# This contains very common functions and language extensions
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
import datetime
from sys import stderr, version
## From http://code.activestate.com/recipes/413486/ (r7)
def Enum(*names):
##assert names, "Empty enums are not supported" # <- Don't like empty enums? Uncomment!
class EnumClass(object):
__slots__ = names
def __iter__(self): return iter(constants)
def __len__(self): return len(constants)
def __getitem__(self, i): return constants[i]
def __repr__(self): return 'Enum' + str(names)
def __str__(self): return 'enum ' + str(constants)
class EnumValue(object):
__slots__ = ('__value')
def __init__(self, value): self.__value = value
Value = property(lambda self: self.__value)
EnumType = property(lambda self: EnumType)
def __hash__(self): return hash(self.__value)
def __cmp__(self, other):
# C fans might want to remove the following assertion
# to make all enums comparable by ordinal value {;))
assert self.EnumType is other.EnumType, "Only values from the same enum are comparable"
return cmp(self.__value, other.__value)
def __invert__(self): return constants[maximum - self.__value]
def __bool__(self): return bool(self.__value)
def __nonzero__(self): return bool(self.__value) #Python 2.x
def __repr__(self): return str(names[self.__value])
maximum = len(names) - 1
constants = [None] * len(names)
for i, each in enumerate(names):
val = EnumValue(i)
setattr(EnumClass, each, val)
constants[i] = val
constants = tuple(constants)
EnumType = EnumClass()
return EnumType
def u(s, encoding = 'utf-8', errors='strict'):
#ensure s is properly unicode.. wrapper for python 2.6/2.7,
if version < '3':
#ensure the object is unicode
if isinstance(s, unicode):
return s
else:
return unicode(s, encoding,errors=errors)
else:
#will work on byte arrays
if isinstance(s, str):
return s
else:
return str(s,encoding,errors=errors)
def b(s):
#ensure s is bytestring
if version < '3':
#ensure the object is unicode
if isinstance(s, str):
return s
else:
return s.encode('utf-8')
else:
#will work on byte arrays
if isinstance(s, bytes):
return s
else:
return s.encode('utf-8')
def isstring(s): #Is this a proper string?
return isinstance(s, str) or (version < '3' and isinstance(s, unicode))
def log(msg, **kwargs):
"""Generic log method. Will prepend timestamp.
Keyword arguments:
system - Name of the system/module
indent - Integer denoting the desired level of indentation
streams - List of streams to output to
stream - Stream to output to (singleton version of streams)
"""
if 'debug' in kwargs:
if 'currentdebug' in kwargs:
if kwargs['currentdebug'] < kwargs['debug']:
return False
else:
return False #no currentdebug passed, assuming no debug mode and thus skipping message
s = "[" + datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "] "
if 'system' in kwargs:
s += "[" + system + "] "
if 'indent' in kwargs:
s += ("\t" * int(kwargs['indent']))
s += u(msg)
if s[-1] != '\n':
s += '\n'
if 'streams' in kwargs:
streams = kwargs['streams']
elif 'stream' in kwargs:
streams = [kwargs['stream']]
else:
streams = [stderr]
for stream in streams:
stream.write(s)
return s
PyNLPl-1.2.9/pynlpl/datatypes.py 0000664 0001750 0000144 00000043726 12201265173 017414 0 ustar proycon users 0000000 0000000 #---------------------------------------------------------------
# PyNLPl - Data Types
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
# http://www.github.com/proycon/pynlpl
# proycon AT anaproy DOT nl
#
# Based in large part on MIT licensed code from
# AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html
# Peter Norvig
#
# Licensed under GPLv3
#
#----------------------------------------------------------------
"""This library contains various extra data types, based to a certain extend on MIT-licensed code from Peter Norvig, AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html"""
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
from pynlpl.common import u
import random
import bisect
import array
from sys import version as PYTHONVERSION
class Queue(object): #from AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html
"""Queue is an abstract class/interface. There are three types:
Python List: A Last In First Out Queue (no Queue object necessary).
FIFOQueue(): A First In First Out Queue.
PriorityQueue(lt): Queue where items are sorted by lt, (default <).
Each type supports the following methods and functions:
q.append(item) -- add an item to the queue
q.extend(items) -- equivalent to: for item in items: q.append(item)
q.pop() -- return the top item from the queue
len(q) -- number of items in q (also q.__len())."""
def extend(self, items):
"""Append all elements from items to the queue"""
for item in items: self.append(item)
#note: A Python list is a LIFOQueue / Stack
class FIFOQueue(Queue): #adapted from AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html
"""A First-In-First-Out Queue"""
def __init__(self, data = []):
self.data = data
self.start = 0
def append(self, item):
self.data.append(item)
def __len__(self):
return len(self.data) - self.start
def extend(self, items):
"""Append all elements from items to the queue"""
self.data.extend(items)
def pop(self):
"""Retrieve the next element in line, this will remove it from the queue"""
e = self.data[self.start]
self.start += 1
if self.start > 5 and self.start > len(self.data)//2:
self.data = self.data[self.start:]
self.start = 0
return e
class PriorityQueue(Queue): #Heavily adapted/extended, originally from AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html
"""A queue in which the maximum (or minumum) element is returned first,
as determined by either an external score function f (by default calling
the objects score() method). If minimize=True, the item with minimum f(x) is
returned first; otherwise is the item with maximum f(x) or x.score().
length can be set to an integer > 0. Items will only be added to the queue if they're better or equal to the worst scoring item. If set to zero, length is unbounded.
blockworse can be set to true if you want to prohibit adding worse-scoring items to the queue. Only items scoring better than the *BEST* one are added.
blockequal can be set to false if you also want to prohibit adding equally-scoring items to the queue.
(Both parameters default to False)
"""
def __init__(self, data =[], f = lambda x: x.score, minimize=False, length=0, blockworse=False, blockequal=False,duplicates=True):
self.data = []
self.f = f
self.minimize=minimize
self.length = length
self.blockworse=blockworse
self.blockequal=blockequal
self.duplicates= duplicates
self.bestscore = None
for item in data:
self.append(item)
def append(self, item):
"""Adds an item to the priority queue (in the right place), returns True if successfull, False if the item was blocked (because of a bad score)"""
f = self.f(item)
if callable(f):
score = f()
else:
score = f
if not self.duplicates:
for s, i in self.data:
if s == score and item == i:
#item is a duplicate, don't add it
return False
if self.length and len(self.data) == self.length:
#Fixed-length priority queue, abort when queue is full and new item scores worst than worst scoring item.
if self.minimize:
worstscore = self.data[-1][0]
if score >= worstscore:
return False
else:
worstscore = self.data[0][0]
if score <= worstscore:
return False
if self.blockworse and self.bestscore != None:
if self.minimize:
if score > self.bestscore:
return False
else:
if score < self.bestscore:
return False
if self.blockequal and self.bestscore != None:
if self.bestscore == score:
return False
if (self.bestscore == None) or (self.minimize and score < self.bestscore) or (not self.minimize and score > self.bestscore):
self.bestscore = score
bisect.insort(self.data, (score, item))
if self.length:
#fixed length queue: queue is now too long, delete worst items
while len(self.data) > self.length:
if self.minimize:
del self.data[-1]
else:
del self.data[0]
return True
def __exists__(self, item):
return (item in self.data)
def __len__(self):
return len(self.data)
def __iter__(self):
"""Iterate over all items, in order from best to worst!"""
if self.minimize:
f = lambda x: x
else:
f = reversed
for score, item in f(self.data):
yield item
def __getitem__(self, i):
"""Item 0 is always the best item!"""
if isinstance(i, slice):
indices = i.indices(len(self))
if self.minimize:
return PriorityQueue([ self.data[j][1] for j in range(*indices) ],self.f, self.minimize, self.length, self.blockworse, self.blockequal)
else:
return PriorityQueue([ self.data[(-1 * j) - 1][1] for j in range(*indices) ],self.f, self.minimize, self.length, self.blockworse, self.blockequal)
else:
if self.minimize:
return self.data[i][1]
else:
return self.data[(-1 * i) - 1][1]
def pop(self):
"""Retrieve the next element in line, this will remove it from the queue"""
if self.minimize:
return self.data.pop(0)[1]
else:
return self.data.pop()[1]
def score(self, i):
"""Return the score for item x (cheap lookup), Item 0 is always the best item"""
if self.minimize:
return self.data[i][0]
else:
return self.data[(-1 * i) - 1][0]
def prune(self, n):
"""prune all but the first (=best) n items"""
if self.minimize:
self.data = self.data[:n]
else:
self.data = self.data[-1 * n:]
def randomprune(self,n):
"""prune down to n items at random, disregarding their score"""
self.data = random.sample(self.data, n)
def stochasticprune(self,n):
"""prune down to n items, chance of an item being pruned is reverse proportional to its score"""
raise NotImplemented
def prunebyscore(self, score, retainequalscore=False):
"""Deletes all items below/above a certain score from the queue, depending on whether minimize is True or False. Note: It is recommended (more efficient) to use blockworse=True / blockequal=True instead! Preventing the addition of 'worse' items."""
if retainequalscore:
if self.minimize:
f = lambda x: x[0] <= score
else:
f = lambda x: x[0] >= score
else:
if self.minimize:
f = lambda x: x[0] < score
else:
f = lambda x: x[0] > score
self.data = filter(f, self.data)
def __eq__(self, other):
return (self.data == other.data) and (self.minimize == other.minimize)
def __repr__(self):
return repr(self.data)
def __add__(self, other):
"""Priority queues can be added up, as long as they all have minimize or maximize (rather than mixed). In case of fixed-length queues, the FIRST queue in the operation will be authorative for the fixed lengthness of the result!"""
assert (isinstance(other, PriorityQueue) and self.minimize == other.minimize)
return PriorityQueue(self.data + other.data, self.f, self.minimize, self.length, self.blockworse, self.blockequal)
class Tree(object):
"""Simple tree structure. Nodes are themselves trees."""
def __init__(self, value = None, children = None):
self.parent = None
self.value = value
if not children:
self.children = None
else:
for c in children:
self.append(c)
def leaf(self):
"""Is this a leaf node or not?"""
return not self.children
def __len__(self):
if not self.children:
return 0
else:
return len(self.children)
def __bool__(self):
return True
def __iter__(self):
"""Iterate over all items in the tree"""
for c in self.children:
return c
def append(self, item):
"""Add an item to the Tree"""
if not isinstance(item, Tree):
return ValueError("Can only append items of type Tree")
if not self.children: self.children = []
item.parent = self
self.children.append(item)
def __getitem__(self, index):
"""Retrieve a specific item, by index, from the Tree"""
assert isinstance(index,int)
try:
return self.children[index]
except:
raise
def __str__(self):
return str(self.value)
def __unicode__(self): #Python 2.x
return u(self.value)
class Trie(object):
"""Simple trie structure. Nodes are themselves tries, values are stored on the edges, not the nodes."""
def __init__(self, sequence = None):
self.parent = None
self.children = None
self.value = None
if sequence:
self.append(sequence)
def leaf(self):
"""Is this a leaf node or not?"""
return not self.children
def root(self):
"""Returns True if this is the root of the Trie"""
return not self.parent
def __len__(self):
if not self.children:
return 0
else:
return len(self.children)
def __bool__(self):
return True
def __iter__(self):
if self.children:
for key in self.children.keys():
yield key
def items(self):
if self.children:
for key, trie in self.children.items():
yield key, trie
def __setitem__(self, key, subtrie):
if not isinstance(subtrie, Trie):
return ValueError("Can only set items of type Trie, got " + str(type(subtrie)))
if not self.children: self.children = {}
subtrie.value = key
subtrie.parent = self
self.children[key] = subtrie
def append(self, sequence):
if not sequence:
return self
if not self.children:
self.children = {}
if not (sequence[0] in self.children):
self.children[sequence[0]] = Trie()
return self.children[sequence[0]].append( sequence[1:] )
else:
return self.children[sequence[0]].append( sequence[1:] )
def find(self, sequence):
if not sequence:
return self
elif self.children and sequence[0] in self.children:
return self.children[sequence[0]].find(sequence[1:])
else:
return False
def __contains__(self, key):
return (key in self.children)
def __getitem__(self, key):
try:
return self.children[key]
except:
raise
def size(self):
"""Size is number of nodes under the trie, including the current node"""
if self.children:
return sum( ( c.size() for c in self.children.values() ) ) + 1
else:
return 1
def path(self):
"""Returns the path to the current node"""
if self.parent:
return (self,) + self.parent.path()
else:
return (self,)
def depth(self):
"""Returns the depth of the current node"""
if self.parent:
return 1 + self.parent.depth()
else:
return 1
def sequence(self):
if self.parent:
if self.value:
return (self.value,) + self.parent.sequence()
else:
return self.parent.sequence()
else:
return (self,)
def walk(self, leavesonly=True, maxdepth=None, _depth = 0):
"""Depth-first search, walking through trie, returning all encounterd nodes (by default only leaves)"""
if self.children:
if not maxdepth or (maxdepth and _depth < maxdepth):
for key, child in self.children.items():
if child.leaf():
yield child
else:
for results in child.walk(leavesonly, maxdepth, _depth + 1):
yield results
FIXEDGAP = 128
DYNAMICGAP = 129
if PYTHONVERSION > '3':
#only available for Python 3
class Pattern:
def __init__(self, data, classdecoder=None):
assert isinstance(data, bytes)
self.data = data
self.classdecoder = classdecoder
@staticmethod
def fromstring(s, classencoder): #static
data = b''
for s in s.split():
data += classencoder[s]
return Pattern(data)
def __str__(self):
s = ""
for cls in self:
s += self.classdecoder[int.from_bytes(cls)]
def iterbytes(self, begin=0, end=0):
i = 0
l = len(self.data)
n = 0
if (end != begin):
slice = True
else:
slice = False
while i < l:
size = self.data[i]
if (size < 128): #everything from 128 onward is reserved for markers
if not slice:
yield self.data[i+1:i+1+size]
else:
n += 1
if n >= begin and n < end:
yield self.data[i+1:i+1+size]
i += 1 + size
else:
raise ValueError("Size >= 128")
def __iter__(self):
for b in self.iterbytes():
yield Pattern(b, self.classdecoder)
def __bytes__(self):
return self.data
def __len__(self):
"""Return n"""
i = 0
l = len(self.data)
n = 0
while i < l:
size = self.data[i]
if (size < 128):
n += 1
i += 1 + size
else:
raise ValueError("Size >= 128")
def __getitem__(self, index):
assert isinstance(index, int)
for b in self.iterbytes(index,index+1):
return Pattern(b, self.classdecoder)
def __getslice__(self, begin, end):
slice = b''
for b in self.iterbytes(begin,end):
slice += b
return slice
def __add__(self, other):
assert isinstance(other, Pattern)
return Pattern(self.data + other.data, self.classdecoder)
def __eq__(self, other):
return self.data == other.data
class PatternSet:
def __init__(self):
self.data = set()
def add(self, pattern):
self.data.add(pattern.data)
def remove(self, pattern):
self.data.remove(pattern.data)
def __len__(self):
return len(self.data)
def __bool__(self):
return len(self.data) > 0
def __contains__(self, pattern):
return pattern.data in self.data
def __iter__(self):
for patterndata in self.data:
yield Pattern(patterndata)
class PatternMap:
def __init__(self, default=None):
self.data = {}
self.default = default
def __getitem__(self, pattern):
assert isinstance(pattern, Pattern)
if not self.default is None:
try:
return self.data[pattern.data]
except KeyError:
return self.default
else:
return self.data[pattern.data]
def __setitem__(self, pattern, value):
self.data[pattern.data] = value
def __delitem__(self, pattern):
del self.data[pattern.data]
def __len__(self):
return len(self.data)
def __bool__(self):
return len(self.data) > 0
def __contains__(self, pattern):
return pattern.data in self.data
def __iter__(self):
for patterndata in self.data:
yield Pattern(patterndata)
def items(self):
for patterndata, value in self.data.items():
yield Pattern(patterndata), value
#class SuffixTree(object):
# def __init__(self):
# self.data = {}
#
#
# def append(self, seq):
# if len(seq) > 1:
# for item in seq:
# self.append(item)
# else:
#
#
# def compile(self, s):
PyNLPl-1.2.9/pynlpl/evaluation.py 0000644 0001750 0000144 00000064406 13372265040 017564 0 ustar proycon users 0000000 0000000 ###############################################################
# PyNLPl - Evaluation Library
# by Maarten van Gompel (proycon)
# http://ilk.uvt.nl/~mvgompel
# Induction for Linguistic Knowledge Research Group
# Universiteit van Tilburg
#
# Licensed under GPLv3
#
# This is a Python library with classes and functions for evaluation
# and experiments .
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
from pynlpl.common import u
import sys
if sys.version < '3':
from codecs import getwriter
stderr = getwriter('utf-8')(sys.stderr)
stdout = getwriter('utf-8')(sys.stdout)
else:
stderr = sys.stderr
stdout = sys.stdout
import io
from pynlpl.statistics import FrequencyList
from collections import defaultdict
try:
import numpy as np
except ImportError:
np = None
import subprocess
import itertools
import time
import random
import math
import copy
import datetime
import os.path
def auc(x, y, reorder=False): #from sklearn, http://scikit-learn.org, licensed under BSD License
"""Compute Area Under the Curve (AUC) using the trapezoidal rule
This is a general fuction, given points on a curve. For computing the area
under the ROC-curve, see :func:`auc_score`.
Parameters
----------
x : array, shape = [n]
x coordinates.
y : array, shape = [n]
y coordinates.
reorder : boolean, optional (default=False)
If True, assume that the curve is ascending in the case of ties, as for
an ROC curve. If the curve is non-ascending, the result will be wrong.
Returns
-------
auc : float
Examples
--------
>>> import numpy as np
>>> from sklearn import metrics
>>> y = np.array([1, 1, 2, 2])
>>> pred = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
>>> metrics.auc(fpr, tpr)
0.75
See also
--------
auc_score : Computes the area under the ROC curve
"""
if np is None:
raise ImportError("No numpy installed")
# XXX: Consider using ``scipy.integrate`` instead, or moving to
# ``utils.extmath``
if not isinstance(x, np.ndarray): x = np.array(x)
if not isinstance(x, np.ndarray): y = np.array(y)
if x.shape[0] < 2:
raise ValueError('At least 2 points are needed to compute'
' area under curve, but x.shape = %s' % x.shape)
if reorder:
# reorder the data points according to the x axis and using y to
# break ties
x, y = np.array(sorted(points for points in zip(x, y))).T
h = np.diff(x)
else:
h = np.diff(x)
if np.any(h < 0):
h *= -1
assert not np.any(h < 0), ("Reordering is not turned on, and "
"The x array is not increasing: %s" % x)
area = np.sum(h * (y[1:] + y[:-1])) / 2.0
return area
def mae(absolute_error_values):
if np is None:
return sum(absolute_error_values) / len(absolute_error_values)
else:
return np.mean(absolute_error_values)
def rmse(squared_error_values):
if np is None:
return math.sqrt(sum(squared_error_values)/len(squared_error_values))
else:
return math.sqrt(np.mean(squared_error_values))
class ProcessFailed(Exception):
pass
class ConfusionMatrix(FrequencyList):
"""Confusion Matrix"""
def __str__(self):
"""Print Confusion Matrix in table form"""
o = "== Confusion Matrix == (hor: goals, vert: observations)\n\n"
keys = set()
for goalkey,observationkey in self._count.keys():
keys.add(goalkey)
keys.add(observationkey)
keys = sorted( keys)
linemask = "%20s"
cells = ['']
for keyH in keys:
l = len(keyH)
if l < 4:
l = 4
elif l > 15:
l = 15
linemask += " %" + str(l) + "s"
cells.append(keyH)
linemask += "\n"
o += linemask % tuple(cells)
for keyV in keys:
linemask = "%20s"
cells = [keyV]
for keyH in keys:
l = len(keyH)
if l < 4:
l = 4
elif l > 15:
l = 15
linemask += " %" + str(l) + "d"
try:
count = self._count[(keyH, keyV)]
except:
count = 0
cells.append(count)
linemask += "\n"
o += linemask % tuple(cells)
return o
class ClassEvaluation(object):
def __init__(self, goals = [], observations = [], missing = {}, encoding ='utf-8'):
assert len(observations) == len(goals)
self.observations = copy.copy(observations)
self.goals = copy.copy(goals)
self.classes = set(self.observations + self.goals)
self.tp = defaultdict(int)
self.fp = defaultdict(int)
self.tn = defaultdict(int)
self.fn = defaultdict(int)
self.missing = missing
self.encoding = encoding
self.computed = False
if self.observations:
self.compute()
def append(self, goal, observation):
self.goals.append(goal)
self.observations.append(observation)
self.classes.add(goal)
self.classes.add(observation)
self.computed = False
def precision(self, cls=None, macro=False):
if not self.computed: self.compute()
if cls:
if self.tp[cls] + self.fp[cls] > 0:
return self.tp[cls] / (self.tp[cls] + self.fp[cls])
else:
#return float('nan')
return 0
else:
if len(self.observations) > 0:
if macro:
return sum( ( self.precision(x) for x in set(self.goals) ) ) / len(set(self.classes))
else:
return sum( ( self.precision(x) for x in self.goals ) ) / len(self.goals)
else:
#return float('nan')
return 0
def recall(self, cls=None, macro=False):
if not self.computed: self.compute()
if cls:
if self.tp[cls] + self.fn[cls] > 0:
return self.tp[cls] / (self.tp[cls] + self.fn[cls])
else:
#return float('nan')
return 0
else:
if len(self.observations) > 0:
if macro:
return sum( ( self.recall(x) for x in set(self.goals) ) ) / len(set(self.classes))
else:
return sum( ( self.recall(x) for x in self.goals ) ) / len(self.goals)
else:
#return float('nan')
return 0
def specificity(self, cls=None, macro=False):
if not self.computed: self.compute()
if cls:
if self.tn[cls] + self.fp[cls] > 0:
return self.tn[cls] / (self.tn[cls] + self.fp[cls])
else:
#return float('nan')
return 0
else:
if len(self.observations) > 0:
if macro:
return sum( ( self.specificity(x) for x in set(self.goals) ) ) / len(set(self.classes))
else:
return sum( ( self.specificity(x) for x in self.goals ) ) / len(self.goals)
else:
#return float('nan')
return 0
def accuracy(self, cls=None):
if not self.computed: self.compute()
if cls:
if self.tp[cls] + self.tn[cls] + self.fp[cls] + self.fn[cls] > 0:
return (self.tp[cls]+self.tn[cls]) / (self.tp[cls] + self.tn[cls] + self.fp[cls] + self.fn[cls])
else:
#return float('nan')
return 0
else:
if len(self.observations) > 0:
return sum( ( self.tp[x] for x in self.tp ) ) / len(self.observations)
else:
#return float('nan')
return 0
def fscore(self, cls=None, beta=1, macro=False):
if not self.computed: self.compute()
if cls:
prec = self.precision(cls)
rec = self.recall(cls)
if prec * rec > 0:
return (1 + beta*beta) * ((prec * rec) / (beta*beta * prec + rec))
else:
#return float('nan')
return 0
else:
if len(self.observations) > 0:
if macro:
return sum( ( self.fscore(x,beta) for x in set(self.goals) ) ) / len(set(self.classes))
else:
return sum( ( self.fscore(x,beta) for x in self.goals ) ) / len(self.goals)
else:
#return float('nan')
return 0
def tp_rate(self, cls=None, macro=False):
if not self.computed: self.compute()
if cls:
if self.tp[cls] > 0:
return self.tp[cls] / (self.tp[cls] + self.fn[cls])
else:
return 0
else:
if len(self.observations) > 0:
if macro:
return sum( ( self.tp_rate(x) for x in set(self.goals) ) ) / len(set(self.classes))
else:
return sum( ( self.tp_rate(x) for x in self.goals ) ) / len(self.goals)
else:
#return float('nan')
return 0
def fp_rate(self, cls=None, macro=False):
if not self.computed: self.compute()
if cls:
if self.fp[cls] > 0:
return self.fp[cls] / (self.tn[cls] + self.fp[cls])
else:
return 0
else:
if len(self.observations) > 0:
if macro:
return sum( ( self.fp_rate(x) for x in set(self.goals) ) ) / len(set(self.classes))
else:
return sum( ( self.fp_rate(x) for x in self.goals ) ) / len(self.goals)
else:
#return float('nan')
return 0
def auc(self, cls=None, macro=False):
if not self.computed: self.compute()
if cls:
tpr = self.tp_rate(cls)
fpr = self.fp_rate(cls)
return auc([0,fpr,1], [0,tpr,1])
else:
if len(self.observations) > 0:
if macro:
return sum( ( self.auc(x) for x in set(self.goals) ) ) / len(set(self.classes))
else:
return sum( ( self.auc(x) for x in self.goals ) ) / len(self.goals)
else:
#return float('nan')
return 0
def __iter__(self):
for g,o in zip(self.goals, self.observations):
yield g,o
def compute(self):
self.tp = defaultdict(int)
self.fp = defaultdict(int)
self.tn = defaultdict(int)
self.fn = defaultdict(int)
for cls, count in self.missing.items():
self.fn[cls] = count
for goal, observation in self:
if goal == observation:
self.tp[observation] += 1
elif goal != observation:
self.fp[observation] += 1
self.fn[goal] += 1
l = len(self.goals) + sum(self.missing.values())
for o in self.classes:
self.tn[o] = l - self.tp[o] - self.fp[o] - self.fn[o]
self.computed = True
def confusionmatrix(self, casesensitive =True):
return ConfusionMatrix(zip(self.goals, self.observations), casesensitive)
def outputmetrics(self):
o = "Accuracy: " + str(self.accuracy()) + "\n"
o += "Samples: " + str(len(self.goals)) + "\n"
o += "Correct: " + str(sum( ( self.tp[x] for x in set(self.goals)) ) ) + "\n"
o += "Recall (microav): "+ str(self.recall()) + "\n"
o += "Recall (macroav): "+ str(self.recall(None,True)) + "\n"
o += "Precision (microav): " + str(self.precision()) + "\n"
o += "Precision (macroav): "+ str(self.precision(None,True)) + "\n"
o += "Specificity (microav): " + str(self.specificity()) + "\n"
o += "Specificity (macroav): "+ str(self.specificity(None,True)) + "\n"
o += "F-score1 (microav): " + str(self.fscore()) + "\n"
o += "F-score1 (macroav): " + str(self.fscore(None,1,True)) + "\n"
return o
def __str__(self):
if not self.computed: self.compute()
o = "%-15s TP\tFP\tTN\tFN\tAccuracy\tPrecision\tRecall(TPR)\tSpecificity(TNR)\tF-score\n" % ("")
for cls in sorted(set(self.classes)):
cls = u(cls)
o += "%-15s %d\t%d\t%d\t%d\t%4f\t%4f\t%4f\t%4f\t%4f\n" % (cls, self.tp[cls], self.fp[cls], self.tn[cls], self.fn[cls], self.accuracy(cls), self.precision(cls), self.recall(cls),self.specificity(cls), self.fscore(cls) )
return o + "\n" + self.outputmetrics()
def __unicode__(self): #Python 2.x
return str(self)
class OrdinalEvaluation(ClassEvaluation):
def __init__(self, goals = [], observations = [], missing = {}, encoding ='utf-8'):
ClassEvaluation.__init__(self,goals,observations,missing,encoding='utf-8')
def compute(self):
assert not False in [type(cls) == int for cls in self.classes]
ClassEvaluation.compute(self)
self.error = defaultdict(list)
self.squared_error = defaultdict(list)
for goal, observation in self:
self.error[observation].append(abs(goal-observation))
self.squared_error[observation].append(abs(goal-observation)**2)
def mae(self, cls=None):
if not self.computed: self.compute()
if cls:
return mae(self.error[cls])
else:
return mae(sum([self.error[x] for x in set(self.goals)], []))
def rmse(self, cls=None):
if not self.computed: self.compute()
if cls:
return rmse(self.squared_error[cls])
else:
return rmse(sum([self.squared_error[x] for x in set(self.goals)], []))
class AbstractExperiment(object):
def __init__(self, inputdata = None, **parameters):
self.inputdata = inputdata
self.parameters = self.defaultparameters()
for parameter, value in parameters.items():
self.parameters[parameter] = value
self.process = None
self.creationtime = datetime.datetime.now()
self.begintime = self.endtime = 0
def defaultparameters(self):
return {}
def duration(self):
if self.endtime and self.begintime:
return self.endtime - self.begintime
else:
return 0
def start(self):
"""Start as a detached subprocess, immediately returning execution to caller."""
raise Exception("Not implemented yet, make sure to overload the start() method in your Experiment class")
def done(self, warn=True):
"""Is the subprocess done?"""
if not self.process:
raise Exception("Not implemented yet or process not started yet, make sure to overload the done() method in your Experiment class")
self.process.poll()
if self.process.returncode == None:
return False
elif self.process.returncode > 0:
raise ProcessFailed()
else:
self.endtime = datetime.datetime.now()
return True
def run(self):
if hasattr(self,'start'):
self.start()
self.wait()
else:
raise Exception("Not implemented yet, make sure to overload the run() method!")
def startcommand(self, command, cwd, stdout, stderr, *arguments, **parameters):
argdelimiter=' '
printcommand = True
cmd = command
if arguments:
cmd += ' ' + " ".join([ u(x) for x in arguments])
if parameters:
for key, value in parameters.items():
if key == 'argdelimiter':
argdelimiter = value
elif key == 'printcommand':
printcommand = value
elif isinstance(value, bool) and value == True:
cmd += ' ' + key
elif key[-1] != '=':
cmd += ' ' + key + argdelimiter + str(value)
else:
cmd += ' ' + key + str(value)
if printcommand:
print("STARTING COMMAND: " + cmd, file=stderr)
self.begintime = datetime.datetime.now()
if not cwd:
self.process = subprocess.Popen(cmd, shell=True,stdout=stdout,stderr=stderr)
else:
self.process = subprocess.Popen(cmd, shell=True,cwd=cwd,stdout=stdout,stderr=stderr)
#pid = process.pid
#os.waitpid(pid, 0) #wait for process to finish
return self.process
def wait(self):
while not self.done():
time.sleep(1)
pass
def score(self):
raise Exception("Not implemented yet, make sure to overload the score() method")
def delete(self):
pass
def sample(self, size):
"""Return a sample of the input data"""
raise Exception("Not implemented yet, make sure to overload the sample() method")
class ExperimentPool(object):
def __init__(self, size):
self.size = size
self.queue = []
self.running = []
def append(self, experiment):
assert isinstance(experiment, AbstractExperiment)
self.queue.append( experiment )
def __len__(self):
return len(self.queue)
def __iter__(self):
return iter(self.queue)
def start(self, experiment):
experiment.start()
self.running.append( experiment )
def poll(self, haltonerror=True):
done = []
for experiment in self.running:
try:
if experiment.done():
done.append( experiment )
except ProcessFailed:
print("ERROR: One experiment in the pool failed: " + repr(experiment.inputdata) + repr(experiment.parameters), file=stderr)
if haltonerror:
raise
else:
done.append( experiment )
for experiment in done:
self.running.remove( experiment )
return done
def run(self, haltonerror=True):
while True:
#check how many processes are done
done = self.poll(haltonerror)
for experiment in done:
yield experiment
#start new processes
while self.queue and len(self.running) < self.size:
self.start( self.queue.pop(0) )
if not self.queue and not self.running:
break
class WPSParamSearch(object):
"""ParamSearch with support for Wrapped Progressive Sampling"""
def __init__(self, experimentclass, inputdata, size, parameterscope, poolsize=1, sizefunc=None, prunefunc=None, constraintfunc = None, delete=True): #parameterscope: {'parameter':[values]}
self.ExperimentClass = experimentclass
self.inputdata = inputdata
self.poolsize = poolsize #0 or 1: sequential execution (uses experiment.run() ), >1: parallel execution using ExperimentPool (uses experiment.start() )
self.maxsize = size
self.delete = delete #delete intermediate experiments
if self.maxsize == -1:
self.sizefunc = lambda x,y: self.maxsize
else:
if sizefunc != None:
self.sizefunc = sizefunc
else:
self.sizefunc = lambda i, maxsize: round((maxsize/100.0)*i*i)
#prunefunc should return a number between 0 and 1, indicating how much is pruned. (for example: 0.75 prunes three/fourth of all combinations, retaining only 25%)
if prunefunc != None:
self.prunefunc = prunefunc
else:
self.prunefunc = lambda i: 0.5
if constraintfunc != None:
self.constraintfunc = constraintfunc
else:
self.constraintfunc = lambda x: True
#compute all parameter combinations:
if isinstance(parameterscope, dict):
verboseparameterscope = [ self._combine(x,y) for x,y in parameterscope.items() ]
else:
verboseparameterscope = [ self._combine(x,y) for x,y in parameterscope ]
self.parametercombinations = [ (x,0) for x in itertools.product(*verboseparameterscope) if self.constraintfunc(dict(x)) ] #generator
def _combine(self,name, values): #TODO: can't we do this inline in a list comprehension?
l = []
for value in values:
l.append( (name, value) )
return l
def searchbest(self):
solution = None
for s in iter(self):
solution = s
return solution[0]
def test(self,i=None):
#sample size elements from inputdata
if i is None or self.maxsize == -1:
data = self.inputdata
else:
size = int(self.sizefunc(i, self.maxsize))
if size > self.maxsize:
return []
data = self.ExperimentClass.sample(self.inputdata, size)
#run on ALL available parameter combinations and retrieve score
newparametercombinations = []
if self.poolsize <= 1:
#Don't use experiment pool, sequential execution
for parameters,score in self.parametercombinations:
experiment = self.ExperimentClass(data, **dict(parameters))
experiment.run()
newparametercombinations.append( (parameters, experiment.score()) )
if self.delete:
experiment.delete()
else:
#Use experiment pool, parallel execution
pool = ExperimentPool(self.poolsize)
for parameters,score in self.parametercombinations:
pool.append( self.ExperimentClass(data, **dict(parameters)) )
for experiment in pool.run(False):
newparametercombinations.append( (experiment.parameters, experiment.score()) )
if self.delete:
experiment.delete()
return newparametercombinations
def __iter__(self):
i = 0
while True:
i += 1
newparametercombinations = self.test(i)
#prune the combinations, keeping only the best
prune = int(round(self.prunefunc(i) * len(newparametercombinations)))
self.parametercombinations = sorted(newparametercombinations, key=lambda v: v[1])[prune:]
yield [ x[0] for x in self.parametercombinations ]
if len(self.parametercombinations) <= 1:
break
class ParamSearch(WPSParamSearch):
"""A simpler version of ParamSearch without Wrapped Progressive Sampling"""
def __init__(self, experimentclass, inputdata, parameterscope, poolsize=1, constraintfunc = None, delete=True): #parameterscope: {'parameter':[values]}
prunefunc = lambda x: 0
super(ParamSearch, self).__init__(experimentclass, inputdata, -1, parameterscope, poolsize, None,prunefunc, constraintfunc, delete)
def __iter__(self):
for parametercombination, score in sorted(self.test(), key=lambda v: v[1]):
yield parametercombination, score
def filesampler(files, testsetsize = 0.1, devsetsize = 0, trainsetsize = 0, outputdir = '', encoding='utf-8'):
"""Extract a training set, test set and optimally a development set from one file, or multiple *interdependent* files (such as a parallel corpus). It is assumed each line contains one instance (such as a word or sentence for example)."""
if not isinstance(files, list):
files = list(files)
total = 0
for filename in files:
f = io.open(filename,'r', encoding=encoding)
count = 0
for line in f:
count += 1
f.close()
if total == 0:
total = count
elif total != count:
raise Exception("Size mismatch, when multiple files are specified they must contain the exact same amount of lines! (" +str(count) + " vs " + str(total) +")")
#support for relative values:
if testsetsize < 1:
testsetsize = int(total * testsetsize)
if devsetsize < 1 and devsetsize > 0:
devsetsize = int(total * devsetsize)
if testsetsize >= total or devsetsize >= total or testsetsize + devsetsize >= total:
raise Exception("Test set and/or development set too large! No samples left for training set!")
trainset = {}
testset = {}
devset = {}
for i in range(1,total+1):
trainset[i] = True
for i in random.sample(trainset.keys(), int(testsetsize)):
testset[i] = True
del trainset[i]
if devsetsize > 0:
for i in random.sample(trainset.keys(), int(devsetsize)):
devset[i] = True
del trainset[i]
if trainsetsize > 0:
newtrainset = {}
for i in random.sample(trainset.keys(), int(trainsetsize)):
newtrainset[i] = True
trainset = newtrainset
for filename in files:
if not outputdir:
ftrain = io.open(filename + '.train','w',encoding=encoding)
else:
ftrain = io.open(outputdir + '/' + os.path.basename(filename) + '.train','w',encoding=encoding)
if not outputdir:
ftest = io.open(filename + '.test','w',encoding=encoding)
else:
ftest = io.open(outputdir + '/' + os.path.basename(filename) + '.test','w',encoding=encoding)
if devsetsize > 0:
if not outputdir:
fdev = io.open(filename + '.dev','w',encoding=encoding)
else:
fdev = io.open(outputdir + '/' + os.path.basename(filename) + '.dev','w',encoding=encoding)
f = io.open(filename,'r',encoding=encoding)
for linenum, line in enumerate(f):
if linenum+1 in trainset:
ftrain.write(line)
elif linenum+1 in testset:
ftest.write(line)
elif devsetsize > 0 and linenum+1 in devset:
fdev.write(line)
f.close()
ftrain.close()
ftest.close()
if devsetsize > 0: fdev.close()
PyNLPl-1.2.9/pynlpl/formats/ 0000755 0001750 0000144 00000000000 13442242642 016505 5 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/pynlpl/formats/__init__.py 0000664 0001750 0000144 00000000125 12201265173 020612 0 ustar proycon users 0000000 0000000 """This package contains modules for reading and/or writing specific file formats"""
PyNLPl-1.2.9/pynlpl/formats/cgn.py 0000664 0001750 0000144 00000007360 12201265173 017632 0 ustar proycon users 0000000 0000000 #-*- coding:utf-8 -*-
###############################################################
# PyNLPl - Corpus Gesproken Nederlands
# by Maarten van Gompel (proycon)
# http://ilk.uvt.nl/~mvgompel
# Induction for Linguistic Knowledge Research Group
# Universiteit van Tilburg
#
# Licensed under GPLv3
#
# Classes for reading CGN (still to be added). Most notably, contains a function for decoding
# PoS features like "N(soort,ev,basis,onz,stan)" into a data structure.
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
import sys
if sys.version < '3':
from codecs import getwriter
stderr = getwriter('utf-8')(sys.stderr)
stdout = getwriter('utf-8')(sys.stdout)
else:
stderr = sys.stderr
stdout = sys.stdout
from pynlpl.formats import folia
from pynlpl.common import Enum
class InvalidTagException(Exception):
pass
class InvalidFeatureException(Exception):
pass
subsets = {
'ntype': ['soort','eigen'],
'getal': ['ev','mv','getal',],
'genus': ['zijd','onz','masc','fem','genus'],
'naamval': ['stan','gen','dat','nomin','obl','bijz'],
'spectype': ['afgebr','afk','deeleigen','symb','vreemd','enof','meta','achter','comment','onverst'],
'conjtype': ['neven','onder'],
'vztype': ['init','versm','fin'],
'npagr': ['agr','evon','rest','evz','mv','agr3','evmo','rest3','evf'],
'lwtype': ['bep','onbep'],
'vwtype': ['pers','pr','refl','recip','bez','vb','vrag','betr','excl','aanw','onbep'],
'pdtype': ['adv-pron','pron','det','grad'],
'status': ['vol','red','nadr'],
'persoon': ['1','2','2v','2b','3','3p','3m','3v','3o','persoon'],
'positie': ['prenom','postnom', 'nom','vrij'],
'buiging': ['zonder','met-e','met-s'],
'getal-n' : ['zonder-v','mv-n','zonder-n'],
'graad' : ['basis','comp','sup','dim'],
'wvorm': ['pv','inf','vd','od'],
'pvtijd': ['tgw','verl','conj'],
'pvagr': ['ev','mv','met-t'],
'numtype': ['hoofd','rang'],
'dial': ['dial'],
}
constraints = {
'getal':['N','VNW'],
'npagr':['VNW','LID'],
'pvagr':['WW'],
}
def parse_cgn_postag(rawtag, raisefeatureexceptions = False):
global subsets, constraints
"""decodes PoS features like "N(soort,ev,basis,onz,stan)" into a PosAnnotation data structure
based on CGN tag overview compiled by Matje van de Camp"""
begin = rawtag.find('(')
if rawtag[-1] == ')' and begin > 0:
tag = folia.PosAnnotation(None, cls=rawtag,set='http://ilk.uvt.nl/folia/sets/cgn')
head = rawtag[0:begin]
tag.append( folia.Feature, subset='head',cls=head)
rawfeatures = rawtag[begin+1:-1].split(',')
for rawfeature in rawfeatures:
if rawfeature:
found = False
for subset, classes in subsets.items():
if rawfeature in classes:
if subset in constraints:
if not head in constraints[subset]:
continue #constraint not met!
found = True
tag.append( folia.Feature, subset=subset,cls=rawfeature)
break
if not found:
print("\t\tUnknown feature value: " + rawfeature + " in " + rawtag, file=stderr)
if raisefeatureexceptions:
raise InvalidFeatureException("Unknown feature value: " + rawfeature + " in " + rawtag)
else:
continue
return tag
else:
raise InvalidTagException("Not a valid CGN tag")
PyNLPl-1.2.9/pynlpl/formats/cql.py 0000644 0001750 0000144 00000023615 12526407557 017657 0 ustar proycon users 0000000 0000000 #---------------------------------------------------------------
# PyNLPl - Corpus Query Language (CQL)
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
# http://proycon.github.com/folia
# http://www.github.com/proycon/pynlpl
# proycon AT anaproy DOT nl
#
# Parser and interpreter for a basic subset of the Corpus Query Language
#
# Licensed under GPLv3
#
#----------------------------------------------------------------
from __future__ import print_function, unicode_literals, division, absolute_import
from pynlpl.fsa import State, NFA
import re
import sys
OPERATORS = ('=','!=')
MAXINTERVAL = 99
class SyntaxError(Exception):
pass
class ValueExpression(object):
def __init__(self, values):
self.values = values #disjunction
@staticmethod
def parse(s,i):
values = ""
assert s[i] == '"'
i += 1
while not (s[i] == '"' and s[i-1] != "\\"):
values += s[i]
i += 1
values = values.split("|")
return ValueExpression(values), i+1
def __len__(self):
return len(self.values)
def __iter__(self):
for x in self.values:
yield x
def __getitem__(self,index):
return self.values[index]
class AttributeExpression(object):
def __init__(self, attribute, operator, valueexpression):
self.attribute = attribute
self.operator = operator
self.valueexpr = valueexpression
@staticmethod
def parse(s,i):
while s[i] == " ":
i +=1
if s[i] == '"':
#no attribute and no operator, use defaults:
attribute = "word"
operator = "="
else:
attribute = ""
while s[i] not in (' ','!','>','<','='):
attribute += s[i]
i += 1
if not attribute:
raise SyntaxError("Expected attribute name, none found")
operator = ""
while s[i] in (' ','!','>','<','='):
if s[i] != ' ':
operator += s[i]
i += 1
if operator not in OPERATORS:
raise SyntaxError("Expected operator, got '" + operator + "'")
if s[i] != '"':
raise SyntaxError("Expected start of value expression (doublequote) in position " + str(i) + ", got " + s[i])
valueexpr, i = ValueExpression.parse(s,i)
return AttributeExpression(attribute,operator, valueexpr), i
class TokenExpression(object):
def __init__(self, attribexprs=[], interval=None):
self.attribexprs = attribexprs
self.interval = interval
@staticmethod
def parse(s,i):
attribexprs = []
while s[i] == " ":
i +=1
if s[i] == '"':
attribexpr,i = AttributeExpression.parse(s,i)
attribexprs.append(attribexpr)
elif s[i] == "[":
i += 1
while True:
while s[i] == " ":
i +=1
if s[i] == "&":
attribexpr,i = AttributeExpression.parse(s,i+1)
attribexprs.append(attribexpr)
elif s[i] == "]":
i += 1
break
elif not attribexprs:
attribexpr,i = AttributeExpression.parse(s,i)
attribexprs.append(attribexpr)
else:
raise SyntaxError("Unexpected char whilst parsing token expression, position " + str(i) + ": " + s[i])
else:
raise SyntaxError("Expected token expression starting with either \" or [, got: " + s[i])
if i == len(s):
interval = None #end of query!
elif s[i] == "{":
#interval expression, find end:
interval = None
for j in range(i+1, len(s)):
if s[j] == "}":
interval = s[i+1:j]
if interval is None:
raise SyntaxError("Interval expression started but no end-brace found")
i += len(interval) + 2
try:
if ',' in interval:
interval = tuple(int(x) for x in interval.split(","))
if len(interval) != 2:
raise SyntaxError("Invalid interval: " + interval)
elif '-' in interval: #alternative
interval = tuple(int(x) for x in interval.split("-"))
if len(interval) != 2:
raise SyntaxError("Invalid interval: " + interval)
else:
interval = (int(interval),int(interval))
except ValueError:
raise SyntaxError("Invalid interval: " + interval)
elif s[i] == "?":
interval = (0,1)
i += 1
elif s[i] == "+":
interval = (1,MAXINTERVAL)
i += 1
elif s[i] == "*":
interval = (0,MAXINTERVAL)
i += 1
else:
interval = None
return TokenExpression(attribexprs,interval),i
def __len__(self):
return len(self.attribexprs)
def __iter__(self):
for x in self.attribexprs:
yield x
def __getitem__(self,index):
return self.attribexprs[index]
def nfa(self, nextstate):
"""Returns an initial state for an NFA"""
if self.interval:
mininterval, maxinterval = self.interval #pylint: disable=unpacking-non-sequence
nextstate2 = nextstate
for i in range(maxinterval):
state = State(transitions=[(self,self.match, nextstate2)])
if i+1> mininterval:
if nextstate is not nextstate2: state.transitions.append((self,self.match, nextstate))
if maxinterval == MAXINTERVAL:
state.epsilon.append(state)
break
nextstate2 = state
return state
else:
state = State(transitions=[(self,self.match, nextstate)])
return state
def match(self, value):
match = False
for _, attribexpr in enumerate(self):
annottype = attribexpr.attribute
if annottype == 'text': annottype = 'word'
if attribexpr.operator == "!=":
negate = True
elif attribexpr.operator == "=":
negate = False
else:
raise Exception("Unexpected operator " + attribexpr.operator)
if len(attribexpr.valueexpr) > 1:
expr = re.compile("^(" + "|".join(attribexpr.valueexpr) + ")$")
else:
expr = re.compile("^" + attribexpr.valueexpr[0] + '$')
match = (expr.match(value[annottype]) is not None)
if negate:
match = not match
if not match:
return False
return True
class Query(object):
def __init__(self, s):
self.tokenexprs = []
i = 0
l = len(s)
while i < l:
if s[i] == " ":
i += 1
else:
tokenexpr,i = TokenExpression.parse(s,i)
self.tokenexprs.append(tokenexpr)
def __len__(self):
return len(self.tokenexprs)
def __iter__(self):
for x in self.tokenexprs:
yield x
def __getitem__(self,index):
return self.tokenexprs[index]
def nfa(self):
"""convert the expression into an NFA"""
finalstate = State(final=True)
nextstate = finalstate
for tokenexpr in reversed(self):
state = tokenexpr.nfa(nextstate)
nextstate = state
return NFA(state)
def __call__(self, tokens, debug=False):
"""Execute the CQL expression, pass a list of tokens/annotations using keyword arguments: word, pos, lemma, etc"""
if not tokens:
raise Exception("Pass a list of tokens/annotation using keyword arguments! (word,pos,lemma, or others)")
#convert the expression into an NFA
nfa = self.nfa()
if debug:
print(repr(nfa), file=sys.stderr)
return list(nfa.find(tokens,debug))
def cql2fql(cq):
fq = "SELECT FOR SPAN "
if not isinstance(cq, Query):
cq = Query(cq)
for i, token in enumerate(cq):
if i > 0: fq += " & "
fq += "w"
if token.interval:
fq += " {" + str(token.interval[0]) + "," + str(token.interval[1])+ "} "
else:
fq += " "
if token.attribexprs:
fq += "WHERE "
for j, attribexpr in enumerate(token):
if j > 0:
fq += " AND "
fq += "("
if attribexpr.operator == "!=":
operator = "NOTMATCHES"
elif attribexpr.operator == "=":
operator = "MATCHES"
else:
raise Exception("Invalid operator: " + attribexpr.operator)
if attribexpr.attribute in ("word","text"):
if len(attribexpr.valueexpr) > 1:
fq += "text " + operator + " \"^(" + "|".join(attribexpr.valueexpr) + ")$\" "
else:
fq += "text " + operator + " \"^" + attribexpr.valueexpr[0] + "$\" "
else:
annottype = attribexpr.attribute
if annottype == "tag":
annottype = "pos"
elif annottype == "lempos":
raise Exception("lempos not supported in CQL to FQL conversion, use pos and lemma separately")
fq += annottype + " HAS class "
if len(attribexpr.valueexpr) > 1:
fq += operator + " \"^(" + "|".join(attribexpr.valueexpr) + ")$\" "
else:
fq += operator + " \"^" + attribexpr.valueexpr[0] + "$\" "
fq += ")"
return fq
PyNLPl-1.2.9/pynlpl/formats/dutchsemcor.py 0000664 0001750 0000144 00000020201 12201265173 021370 0 ustar proycon users 0000000 0000000 #-*- coding:utf-8 -*-
###############################################################
# PyNLPl - DutchSemCor
# by Maarten van Gompel (proycon)
# http://ilk.uvt.nl/~mvgompel
# Induction for Linguistic Knowledge Research Group
# Universiteit van Tilburg
#
# Licensed under GPLv3
#
# Modified by Ruben Izquierdo
# We need also to store the TIMBL distance to the nearest neighboor
#
# Collection of formats for the DutchSemCor project
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
from pynlpl.common import u
import sys
if sys.version < '3':
from codecs import getwriter
stderr = getwriter('utf-8')(sys.stderr)
stdout = getwriter('utf-8')(sys.stdout)
else:
stderr = sys.stderr
stdout = sys.stdout
from pynlpl.formats.timbl import TimblOutput
from pynlpl.statistics import Distribution
import io
class WSDSystemOutput(object):
def __init__(self, filename = None):
self.data = {}
self.distances={}
self.maxDistance=1
if filename:
self.load(filename)
def append(self, word_id, senses,distance=0):
# Commented by Ruben, there are some ID's that are repeated in all sonar test files...
#assert (not word_id in self.data)
if isinstance(senses, Distribution):
self.data[word_id] = ( (x,y) for x,y in senses ) #PATCH UNDONE (#TODO: this is a patch, something's not right in Distribution?)
self.distances[word_id]=distance
if distance > self.maxDistance:
self.maxDistance=distance
return
else:
assert isinstance(senses, list) and len(senses) >= 1
self.distances[word_id]=distance
if distance > self.maxDistance:
self.maxDistance=distance
if len(senses[0]) == 1:
#not a (sense_id, confidence) tuple! compute equal confidence for all elements automatically:
confidence = 1 / float(len(senses))
self.data[word_id] = [ (x,confidence) for x in senses ]
else:
fulldistr = True
for sense, confidence in senses:
if confidence == None:
fulldistr = False
break
if fulldistr:
self.data[word_id] = Distribution(senses)
else:
self.data[word_id] = senses
def getMaxDistance(self):
return self.maxDistance
def __iter__(self):
for word_id, senses in self.data.items():
yield word_id, senses,self.distances[word_id]
def __len__(self):
return len(self.data)
def __getitem__(self, word_id):
"""Returns the sense distribution for the given word_id"""
return self.data[word_id]
def load(self, filename):
f = io.open(filename,'r',encoding='utf-8')
for line in f:
fields = line.strip().split(" ")
word_id = fields[0]
if len(fields[1:]) == 1:
#only one sense, no confidence expressed:
self.append(word_id, [(fields[1],None)])
else:
senses = []
distance=-1
for i in range(1,len(fields),2):
if i+1==len(fields):
#The last field is the distance
if fields[i][:4]=='+vdi': #Support for previous format of wsdout
distance=float(fields[i][4:])
else:
distance=float(fields[i])
else:
if fields[i+1] == '?': fields[i+1] = None
senses.append( (fields[i], fields[i+1]) )
self.append(word_id, senses,distance)
f.close()
def save(self, filename):
f = io.open(filename,'w',encoding='utf-8')
for word_id, senses,distance in self:
f.write(word_id)
for sense, confidence in senses:
if confidence == None: confidence = "?"
f.write(" " + str(sense) + " " + str(confidence))
if word_id in self.distances.keys():
f.write(' '+str(self.distances[word_id]))
f.write("\n")
f.close()
def out(self, filename):
for word_id, senses,distance in self:
print(word_id,distance,end="")
for sense, confidence in senses:
if confidence == None: confidence = "?"
print(" " + sense + " " + str(confidence),end="")
print()
def senses(self, bestonly=False):
"""Returns a list of all predicted senses"""
l = []
for word_id, senses,distance in self:
for sense, confidence in senses:
if not sense in l: l.append(sense)
if bestonly:
break
return l
def loadfromtimbl(self, filename):
timbloutput = TimblOutput(io.open(filename,'r',encoding='utf-8'))
for i, (features, referenceclass, predictedclass, distribution, distance) in enumerate(timbloutput):
if distance != None:
#distance='+vdi'+str(distance)
distance=float(distance)
if len(features) == 0:
print("WARNING: Empty feature vector in " + filename + " (line " + str(i+1) + ") skipping!!",file=stderr)
continue
word_id = features[0] #note: this is an assumption that must be adhered to!
if distribution:
self.append(word_id, distribution,distance)
def fromTimblToWsdout(self,fileTimbl,fileWsdout):
timbloutput = TimblOutput(io.open(fileTimbl,'r',encoding='utf-8'))
wsdoutfile = io.open(fileWsdout,'w',encoding='utf-8')
for i, (features, referenceclass, predictedclass, distribution, distance) in enumerate(timbloutput):
if len(features) == 0:
print("WARNING: Empty feature vector in " + fileTimbl + " (line " + str(i+1) + ") skipping!!",file=stderr)
continue
word_id = features[0] #note: this is an assumption that must be adhered to!
if distribution:
wsdoutfile.write(word_id+' ')
for sense, confidence in distribution:
if confidence== None: confidence='?'
wsdoutfile.write(sense+' '+str(confidence)+' ')
wsdoutfile.write(str(distance)+'\n')
wsdoutfile.close()
class DataSet(object): #for testsets/trainingsets
def __init__(self, filename):
self.sense = {} #word_id => (sense_id, lemma,pos)
self.targetwords = {} #(lemma,pos) => [sense_id]
f = io.open(filename,'r',encoding='utf-8')
for line in f:
if len(line) > 0 and line[0] != '#':
fields = line.strip('\n').split('\t')
word_id = fields[0]
sense_id = fields[1]
lemma = fields[2]
pos = fields[3]
self.sense[word_id] = (sense_id, lemma, pos)
if not (lemma,pos) in self.targetwords:
self.targetwords[(lemma,pos)] = []
if not sense_id in self.targetwords[(lemma,pos)]:
self.targetwords[(lemma,pos)].append(sense_id)
f.close()
def __getitem__(self, word_id):
return self.sense[self._sanitize(word_id)]
def getsense(self, word_id):
return self.sense[self._sanitize(word_id)][0]
def getlemma(self, word_id):
return self.sense[self._sanitize(word_id)][1]
def getpos(self, word_id):
return self.sense[self._sanitize(word_id)][2]
def _sanitize(self, word_id):
return u(word_id)
def __contains__(self, word_id):
return (self._sanitize(word_id) in self.sense)
def __iter__(self):
for word_id, (sense, lemma, pos) in self.sense.items():
yield (word_id, sense, lemma, pos)
def senses(self, lemma, pos):
return self.targetwords[(lemma,pos)]
PyNLPl-1.2.9/pynlpl/formats/folia.py 0000644 0001750 0000144 00001376513 13442227145 020172 0 ustar proycon users 0000000 0000000 # -*- coding: utf-8 -*-
#----------------------------------------------------------------
# PyNLPl - FoLiA Format Module
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
#
# https://proycon.github.io/folia
# httsp://github.com/proycon/pynlpl
# proycon AT anaproy DOT nl
#
# Module for reading, editing and writing FoLiA XML
#
# Licensed under GPLv3
#
#----------------------------------------------------------------
#pylint: disable=redefined-builtin,trailing-whitespace,superfluous-parens,bad-classmethod-argument,wrong-import-order,wrong-import-position,ungrouped-imports
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
import sys
from copy import copy, deepcopy
from datetime import datetime
from collections import OrderedDict
import inspect
import itertools
import glob
import os
import re
try:
import io
except ImportError:
#old-Python 2.6 fallback
import codecs as io
import multiprocessing
import bz2
import gzip
import random
from lxml import etree as ElementTree
from lxml.builder import ElementMaker
if sys.version < '3':
from StringIO import StringIO #pylint: disable=import-error,wrong-import-order
from urllib import urlopen #pylint: disable=no-name-in-module,wrong-import-order
else:
from io import StringIO, BytesIO #pylint: disable=wrong-import-order,ungrouped-imports
from urllib.request import urlopen #pylint: disable=E0611,wrong-import-order,ungrouped-imports
if sys.version < '3':
from codecs import getwriter #pylint: disable=wrong-import-order,ungrouped-imports
stderr = getwriter('utf-8')(sys.stderr)
stdout = getwriter('utf-8')(sys.stdout)
else:
stderr = sys.stderr
stdout = sys.stdout
from pynlpl.common import u, isstring
from pynlpl.formats.foliaset import SetDefinition, DeepValidationError
import pynlpl.algorithms
LXE=True #use lxml instead of built-in ElementTree (default)
#foliaspec:version:FOLIAVERSION
#The FoLiA version
FOLIAVERSION = "1.5.1"
LIBVERSION = FOLIAVERSION + '.88' #== FoLiA version + library revision
#0.9.1.31 is the first version with Python 3 support
#foliaspec:namespace:NSFOLIA
#The FoLiA XML namespace
NSFOLIA = "http://ilk.uvt.nl/folia"
print("WARNING: The FoLiA library pynlpl.formats.folia is being used but this version is now deprecated and is replaced by FoLiAPy (pip install folia), see https://github.com/proycon/foliapy. Please update your software if you are a developer, if you are an end-user you can safely ignore this message.",file=sys.stderr)
NSDCOI = "http://lands.let.ru.nl/projects/d-coi/ns/1.0"
nslen = len(NSFOLIA) + 2
nslendcoi = len(NSDCOI) + 2
TMPDIR = "/tmp/" #will be used for downloading temporary data (external subdocuments)
DOCSTRING_GENERIC_ATTRIBS = """ id (str): An ID for the element. IDs must be unique for the entire document. They may not contain colons or spaces, and must start with a letter. (they must adhere to XML's NCName type). This is a generic FoLiA attribute.
set (str): The FoLiA set for this element. This is a generic FoLiA attribute.
cls (str): The class for this element. This is a generic FoLiA attribute.
annotator (str): A name or ID for the annotator. This is a generic FoLiA attribute.
annotatortype: Should be either ``AnnotatorType.MANUAL`` or ``AnnotatorType.AUTO``, indicating whether the annotation was performed manually or by an automated process. This is a generic FoLiA attribute.
confidence (float): A value between 0 and 1 indicating the degree of confidence the annotator has that this the annotation is correct.. This is a generic FoLiA attribute.
n (int): An index number to indicate the element is part of an sequence (does not affect the placement of the element).
src (str): Speech annotation attribute, refers to a media file (audio/video) that this element describes. This is a generic FoLiA attribute.
speaker (str): Speech annotation attribute: a name or ID of the speaker. This is a generic FoLiA attribute.
begintime (str): Speech annotation attribute: the time (in ``hh:mm:ss.mmm`` format, relative to the media file in ``src``) when the audio that this element describes starts. This is a generic FoLiA attribute.
endtime (str): Speech annotation attribute: the time (in ``hh:mm:ss.mmm`` format, relative to the media file in ``src``) when the audio that this element describes starts. This is a generic FoLiA attribute.
textclass (str): Refers to the textclass from which this annotation is derived (defaults to "current")>. This is a generic FoLiA attribute.
contents (list): Alternative for ``*args``, exists for purely syntactic reasons.
"""
ILLEGAL_UNICODE_CONTROL_CHARACTERS = {} #XML does not like unicode control characters
for ordinal in range(0x20):
if chr(ordinal) not in '\t\r\n':
ILLEGAL_UNICODE_CONTROL_CHARACTERS[ordinal] = None
class Mode:
MEMORY = 0 #The entire FoLiA structure will be loaded into memory. This is the default and is required for any kind of document manipulation.
XPATH = 1 #The full XML structure will be loaded into memory, but conversion to FoLiA objects occurs only upon querying. The full power of XPath is available.
class AnnotatorType:
UNSET = None
AUTO = "auto"
MANUAL = "manual"
#foliaspec:attributes
#Defines all common FoLiA attributes (as part of the Attrib enumeration)
class Attrib:
ID, CLASS, ANNOTATOR, CONFIDENCE, N, DATETIME, BEGINTIME, ENDTIME, SRC, SPEAKER, TEXTCLASS, METADATA = range(12)
#foliaspec:annotationtype
#Defines all annotation types (as part of the AnnotationType enumeration)
class AnnotationType:
TEXT, TOKEN, DIVISION, PARAGRAPH, LIST, FIGURE, WHITESPACE, LINEBREAK, SENTENCE, POS, LEMMA, DOMAIN, SENSE, SYNTAX, CHUNKING, ENTITY, CORRECTION, ERRORDETECTION, PHON, SUBJECTIVITY, MORPHOLOGICAL, EVENT, DEPENDENCY, TIMESEGMENT, GAP, NOTE, ALIGNMENT, COMPLEXALIGNMENT, COREFERENCE, SEMROLE, METRIC, LANG, STRING, TABLE, STYLE, PART, UTTERANCE, ENTRY, TERM, DEFINITION, EXAMPLE, PHONOLOGICAL, PREDICATE, OBSERVATION, SENTIMENT, STATEMENT = range(46)
#Alternative is a special one, not declared and not used except for ID generation
class TextCorrectionLevel: #THIS IS NOW COMPLETELY OBSOLETE AND ONLY HERE FOR BACKWARD COMPATIBILITY!
CORRECTED, UNCORRECTED, ORIGINAL, INLINE = range(4)
class MetaDataType: #THIS IS NOW COMPLETELY OBSOLETE AND ONLY HERE FOR BACKWARD COMPATIBILITY! Metadata type is a free-fill field with only native predefined
NATIVE = "native"
CMDI = "cmdi"
IMDI = "imdi"
class NoSuchAnnotation(Exception):
"""Exception raised when the requested type of annotation does not exist for the selected element"""
pass
class NoSuchText(Exception):
"""Exception raised when the requested type of text content does not exist for the selected element"""
pass
class NoSuchPhon(Exception):
"""Exception raised when the requested type of phonetic content does not exist for the selected element"""
pass
class InconsistentText(Exception):
"""Exception raised when the the text of a structural element is inconsistent with text on deeper levels"""
pass
class DuplicateAnnotationError(Exception):
pass
class DuplicateIDError(Exception):
"""Exception raised when an identifier that is already in use is assigned again to another element"""
pass
class NoDefaultError(Exception):
pass
class UnresolvableTextContent(Exception):
pass
class MalformedXMLError(Exception):
pass
class ParseError(Exception):
def __init__(self, msg, cause=None):
self.cause = cause
Exception.__init__(self, msg)
class ModeError(Exception):
pass
class MetaDataError(Exception):
pass
class DocumentNotLoaded(Exception): #for alignments to external documents
pass
class GenerateIDException(Exception):
pass
class CorrectionHandling:
EITHER,CURRENT, ORIGINAL = range(3)
def checkversion(version, REFVERSION=FOLIAVERSION):
"""Checks FoLiA version, returns 1 if the document is newer than the library, -1 if it is older, 0 if it is equal"""
try:
for refversion, docversion in zip([int(x) for x in REFVERSION.split('.')], [int(x) for x in version.split('.')]):
if docversion > refversion:
return 1 #doc is newer than library
elif docversion < refversion:
return -1 #doc is older than library
return 0 #versions are equal
except ValueError:
raise ValueError("Unable to parse document FoLiA version, invalid syntax")
def parsetime(s):
"""Internal function to parse the time parses time in HH:MM:SS.mmm format.
Returns:
a four-tuple ``(hours,minutes,seconds,milliseconds)``
"""
try:
fields = s.split('.')
subfields = fields[0].split(':')
H = int(subfields[0])
M = int(subfields[1])
S = int(subfields[2])
if len(subfields) > 3:
m = int(subfields[3])
else:
m = 0
if len(fields) > 1:
m = int(fields[1])
return (H,M,S,m)
except:
raise ValueError("Invalid timestamp, must be in HH:MM:SS.mmm format: " + s)
def parsecommonarguments(object, doc, annotationtype, required, allowed, **kwargs):
"""Internal function to parse common FoLiA attributes and sets up the instance accordingly. Do not invoke directly."""
object.doc = doc #The FoLiA root document
if required is None:
required = tuple()
if allowed is None:
allowed = tuple()
supported = required + allowed
if 'generate_id_in' in kwargs:
try:
kwargs['id'] = kwargs['generate_id_in'].generate_id(object.__class__)
except GenerateIDException:
pass #ID could not be generated, just skip
del kwargs['generate_id_in']
if 'id' in kwargs:
if Attrib.ID not in supported:
raise ValueError("ID is not supported on " + object.__class__.__name__)
isncname(kwargs['id'])
object.id = kwargs['id']
del kwargs['id']
elif Attrib.ID in required:
raise ValueError("ID is required for " + object.__class__.__name__)
else:
object.id = None
if 'set' in kwargs:
if Attrib.CLASS not in supported and not object.SETONLY:
raise ValueError("Set is not supported on " + object.__class__.__name__)
if not kwargs['set']:
object.set ="undefined"
else:
object.set = kwargs['set']
del kwargs['set']
if object.set:
if doc and (not (annotationtype in doc.annotationdefaults) or not (object.set in doc.annotationdefaults[annotationtype])):
if object.set in doc.alias_set:
object.set = doc.alias_set[object.set]
elif doc.autodeclare:
doc.annotations.append( (annotationtype, object.set ) )
doc.annotationdefaults[annotationtype] = {object.set: {} }
else:
raise ValueError("Set '" + object.set + "' is used for " + object.__class__.__name__ + ", but has no declaration!")
elif annotationtype in doc.annotationdefaults and len(doc.annotationdefaults[annotationtype]) == 1:
object.set = list(doc.annotationdefaults[annotationtype].keys())[0]
elif object.ANNOTATIONTYPE == AnnotationType.TEXT:
object.set = "undefined" #text content needs never be declared (for backward compatibility) and is in set 'undefined'
elif Attrib.CLASS in required: #or (hasattr(object,'SETONLY') and object.SETONLY):
raise ValueError("Set is required for " + object.__class__.__name__)
if 'class' in kwargs:
if not Attrib.CLASS in supported:
raise ValueError("Class is not supported for " + object.__class__.__name__)
object.cls = kwargs['class']
del kwargs['class']
elif 'cls' in kwargs:
if not Attrib.CLASS in supported:
raise ValueError("Class is not supported on " + object.__class__.__name__)
object.cls = kwargs['cls']
del kwargs['cls']
elif Attrib.CLASS in required:
raise ValueError("Class is required for " + object.__class__.__name__)
if object.cls and not object.set:
if doc and doc.autodeclare:
if not (annotationtype, 'undefined') in doc.annotations:
doc.annotations.append( (annotationtype, 'undefined') )
doc.annotationdefaults[annotationtype] = {'undefined': {} }
object.set = 'undefined'
else:
raise ValueError("Set is required for " + object.__class__.__name__ + ". Class '" + object.cls + "' assigned without set.")
if 'annotator' in kwargs:
if not Attrib.ANNOTATOR in supported:
raise ValueError("Annotator is not supported for " + object.__class__.__name__)
object.annotator = kwargs['annotator']
del kwargs['annotator']
elif doc and annotationtype in doc.annotationdefaults and object.set in doc.annotationdefaults[annotationtype] and 'annotator' in doc.annotationdefaults[annotationtype][object.set]:
object.annotator = doc.annotationdefaults[annotationtype][object.set]['annotator']
elif Attrib.ANNOTATOR in required:
raise ValueError("Annotator is required for " + object.__class__.__name__)
if 'annotatortype' in kwargs:
if not Attrib.ANNOTATOR in supported:
raise ValueError("Annotatortype is not supported for " + object.__class__.__name__)
if kwargs['annotatortype'] == 'auto' or kwargs['annotatortype'] == AnnotatorType.AUTO:
object.annotatortype = AnnotatorType.AUTO
elif kwargs['annotatortype'] == 'manual' or kwargs['annotatortype'] == AnnotatorType.MANUAL:
object.annotatortype = AnnotatorType.MANUAL
else:
raise ValueError("annotatortype must be 'auto' or 'manual', got " + repr(kwargs['annotatortype']))
del kwargs['annotatortype']
elif doc and annotationtype in doc.annotationdefaults and object.set in doc.annotationdefaults[annotationtype] and 'annotatortype' in doc.annotationdefaults[annotationtype][object.set]:
object.annotatortype = doc.annotationdefaults[annotationtype][object.set]['annotatortype']
elif Attrib.ANNOTATOR in required:
raise ValueError("Annotatortype is required for " + object.__class__.__name__)
if 'confidence' in kwargs:
if not Attrib.CONFIDENCE in supported:
raise ValueError("Confidence is not supported")
if kwargs['confidence'] is not None:
try:
object.confidence = float(kwargs['confidence'])
assert object.confidence >= 0.0 and object.confidence <= 1.0
except:
raise ValueError("Confidence must be a floating point number between 0 and 1, got " + repr(kwargs['confidence']) )
del kwargs['confidence']
elif Attrib.CONFIDENCE in required:
raise ValueError("Confidence is required for " + object.__class__.__name__)
if 'n' in kwargs:
if not Attrib.N in supported:
raise ValueError("N is not supported for " + object.__class__.__name__)
object.n = kwargs['n']
del kwargs['n']
elif Attrib.N in required:
raise ValueError("N is required for " + object.__class__.__name__)
if 'datetime' in kwargs:
if not Attrib.DATETIME in supported:
raise ValueError("Datetime is not supported")
if isinstance(kwargs['datetime'], datetime):
object.datetime = kwargs['datetime']
else:
#try:
object.datetime = parse_datetime(kwargs['datetime'])
#except:
# raise ValueError("Unable to parse datetime: " + str(repr(kwargs['datetime'])))
del kwargs['datetime']
elif doc and annotationtype in doc.annotationdefaults and object.set in doc.annotationdefaults[annotationtype] and 'datetime' in doc.annotationdefaults[annotationtype][object.set]:
object.datetime = doc.annotationdefaults[annotationtype][object.set]['datetime']
elif Attrib.DATETIME in required:
raise ValueError("Datetime is required for " + object.__class__.__name__)
if 'src' in kwargs:
if not Attrib.SRC in supported:
raise ValueError("Source is not supported for " + object.__class__.__name__)
object.src = kwargs['src']
del kwargs['src']
elif Attrib.SRC in required:
raise ValueError("Source is required for " + object.__class__.__name__)
if 'begintime' in kwargs:
if not Attrib.BEGINTIME in supported:
raise ValueError("Begintime is not supported for " + object.__class__.__name__)
object.begintime = parsetime(kwargs['begintime'])
del kwargs['begintime']
elif Attrib.BEGINTIME in required:
raise ValueError("Begintime is required for " + object.__class__.__name__)
if 'endtime' in kwargs:
if not Attrib.ENDTIME in supported:
raise ValueError("Endtime is not supported for " + object.__class__.__name__)
object.endtime = parsetime(kwargs['endtime'])
del kwargs['endtime']
elif Attrib.ENDTIME in required:
raise ValueError("Endtime is required for " + object.__class__.__name__)
if 'speaker' in kwargs:
if not Attrib.SPEAKER in supported:
raise ValueError("Speaker is not supported for " + object.__class__.__name__)
object.speaker = kwargs['speaker']
del kwargs['speaker']
elif Attrib.SPEAKER in required:
raise ValueError("Speaker is required for " + object.__class__.__name__)
if 'auth' in kwargs:
if kwargs['auth'] in ('no','false'):
object.auth = False
else:
object.auth = bool(kwargs['auth'])
del kwargs['auth']
else:
object.auth = object.__class__.AUTH
if 'text' in kwargs:
if kwargs['text']:
object.settext(kwargs['text'])
del kwargs['text']
if 'phon' in kwargs:
if kwargs['phon']:
object.setphon(kwargs['phon'])
del kwargs['phon']
if 'textclass' in kwargs:
if not Attrib.TEXTCLASS in supported:
raise ValueError("Textclass is not supported for " + object.__class__.__name__)
object.textclass = kwargs['textclass']
del kwargs['textclass']
else:
if Attrib.TEXTCLASS in supported:
object.textclass = "current"
if 'metadata' in kwargs:
if not Attrib.METADATA in supported:
raise ValueError("Metadata is not supported for " + object.__class__.__name__)
object.metadata = kwargs['metadata']
if doc:
try:
doc.submetadata[kwargs['metadata']]
except KeyError:
raise KeyError("No such metadata defined: " + kwargs['metadata'])
del kwargs['metadata']
if object.XLINK:
if 'href' in kwargs:
object.href =kwargs['href']
del kwargs['href']
if 'xlinktype' in kwargs:
object.xlinktype = kwargs['xlinktype']
del kwargs['xlinktype']
if 'xlinkrole' in kwargs:
object.xlinkrole = kwargs['xlinkrole']
del kwargs['xlinkrole']
if 'xlinklabel' in kwargs:
object.xlinklabel = kwargs['xlinklabel']
del kwargs['xlinklabel']
if 'xlinkshow' in kwargs:
object.xlinkshow = kwargs['xlinkshow']
del kwargs['xlinklabel']
if 'xlinktitle' in kwargs:
object.xlinktitle = kwargs['xlinktitle']
del kwargs['xlinktitle']
if doc and doc.debug >= 2:
print(" @id = ", repr(object.id),file=stderr)
print(" @set = ", repr(object.set),file=stderr)
print(" @class = ", repr(object.cls),file=stderr)
print(" @annotator = ", repr(object.annotator),file=stderr)
print(" @annotatortype= ", repr(object.annotatortype),file=stderr)
print(" @confidence = ", repr(object.confidence),file=stderr)
print(" @n = ", repr(object.n),file=stderr)
print(" @datetime = ", repr(object.datetime),file=stderr)
#set index
if object.id and doc:
if object.id in doc.index:
if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Duplicate ID not permitted:" + object.id,file=stderr)
raise DuplicateIDError("Duplicate ID not permitted: " + object.id)
else:
if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Adding to index: " + object.id,file=stderr)
doc.index[object.id] = object
#Parse feature attributes (shortcut for feature specification for some elements)
for c in object.ACCEPTED_DATA:
if issubclass(c, Feature):
if c.SUBSET in kwargs:
if kwargs[c.SUBSET]:
object.append(c,cls=kwargs[c.SUBSET])
del kwargs[c.SUBSET]
return kwargs
def norm_spaces(s):
"""Normalize spaces, splits on whitespace (\n\r\t\s) and rejoins (faster than a s/\s+// regexp)"""
return ' '.join(s.split())
def parse_datetime(s): #source: http://stackoverflow.com/questions/2211362/how-to-parse-xsddatetime-format
"""Returns (datetime, tz offset in minutes) or (None, None)."""
m = re.match(r""" ^
(?P-?[0-9]{4}) - (?P[0-9]{2}) - (?P[0-9]{2})
T (?P[0-9]{2}) : (?P[0-9]{2}) : (?P[0-9]{2})
(?P\.[0-9]{1,6})?
(?P
Z | (?P[-+][0-9]{2}) : (?P[0-9]{2})
)?
$ """, s, re.X)
if m is not None:
values = m.groupdict()
#if values["tz"] in ("Z", None):
# tz = 0
#else:
# tz = int(values["tz_hr"]) * 60 + int(values["tz_min"])
if values["microsecond"] is None:
values["microsecond"] = 0
else:
values["microsecond"] = values["microsecond"][1:]
values["microsecond"] += "0" * (6 - len(values["microsecond"]))
values = dict((k, int(v)) for k, v in values.items() if not k.startswith("tz"))
try:
return datetime(**values) # , tz
except ValueError:
pass
return None
def xmltreefromstring(s):
"""Internal function, deals with different Python versions, unicode strings versus bytes, and with the leak bug in lxml"""
if sys.version < '3':
#Python 2
if isinstance(s,unicode): #pylint: disable=undefined-variable
s = s.encode('utf-8')
try:
return ElementTree.parse(StringIO(s), ElementTree.XMLParser(collect_ids=False))
except TypeError:
return ElementTree.parse(StringIO(s), ElementTree.XMLParser()) #older lxml, may leak!!!!
else:
#Python 3
if isinstance(s,str):
s = s.encode('utf-8')
try:
return ElementTree.parse(BytesIO(s), ElementTree.XMLParser(collect_ids=False))
except TypeError:
return ElementTree.parse(BytesIO(s), ElementTree.XMLParser()) #older lxml, may leak!!!!
def xmltreefromfile(filename):
"""Internal function to read an XML file"""
try:
return ElementTree.parse(filename, ElementTree.XMLParser(collect_ids=False))
except TypeError:
return ElementTree.parse(filename, ElementTree.XMLParser()) #older lxml, may leak!!
def makeelement(E, tagname, **kwargs):
"""Internal function"""
if sys.version < '3':
try:
kwargs2 = {}
for k,v in kwargs.items():
kwargs2[k.encode('utf-8')] = v.encode('utf-8')
#return E._makeelement(tagname.encode('utf-8'), **{ k.encode('utf-8'): v.encode('utf-8') for k,v in kwargs.items() } ) #In one go fails on some older Python 2.6s
return E._makeelement(tagname.encode('utf-8'), **kwargs2 ) #pylint: disable=protected-access
except ValueError as e:
try:
#older versions of lxml may misbehave, compensate:
e = E._makeelement(tagname.encode('utf-8')) #pylint: disable=protected-access
for k,v in kwargs.items():
e.attrib[k.encode('utf-8')] = v
return e
except ValueError:
print(e,file=stderr)
print("tagname=",tagname,file=stderr)
print("kwargs=",kwargs,file=stderr)
raise e
else:
return E._makeelement(tagname,**kwargs) #pylint: disable=protected-access
def commonancestors(Class, *args):
"""Generator function to find common ancestors of a particular type for any two or more FoLiA element instances.
The function produces all common ancestors of the type specified, starting from the closest one up to the most distant one.
Parameters:
Class: The type of ancestor to find, should be the :class:`AbstractElement` class or any subclass thereof (not an instance!)
*args: The elements to find the common ancestors of, elements are instances derived from :class:`AbstractElement`
Yields:
instance derived from :class:`AbstractElement`: A common ancestor of the arguments, an instance of the specified ``Class``.
"""
commonancestors = None #pylint: disable=redefined-outer-name
for sibling in args:
ancestors = list( sibling.ancestors(Class) )
if commonancestors is None:
commonancestors = copy(ancestors)
else:
removeancestors = []
for a in commonancestors: #pylint: disable=not-an-iterable
if not a in ancestors:
removeancestors.append(a)
for a in removeancestors:
commonancestors.remove(a)
if commonancestors:
for commonancestor in commonancestors:
yield commonancestor
class AbstractElement(object):
"""Abstract base class from which all FoLiA elements are derived.
This class implements many generic methods that are available on all FoLiA elements.
To see if an element is a FoLiA element, as opposed to any other python object, do::
isinstance(x, AbstractElement)
Generic FoLiA attributes can be accessed on all instances derived from this class:
* ``element.id`` (str) - The unique identifier of the element
* ``element.set`` (str) - The set the element pertains to.
* ``element.cls`` (str) - The assigned class, i.e. the actual value of \
the annotation, defined in the set. Classes correspond with tagsets in this case of many annotation types. \
Note that since *class* is already a reserved keyword in python, the library consistently uses ``cls`` everywhere.
* ``element.annotator`` (str) - The name or ID of the annotator who added/modified this element
* ``element.annotatortype`` - The type of annotator, can be either ``folia.AnnotatorType.MANUAL`` or ``folia.AnnotatorType.AUTO``
* ``element.confidence`` (float) - A confidence value expressing
* ``element.datetime`` (datetime.datetime) - The date and time when the element was added/modified.
* ``element.n`` (str) - An ordinal label, used for instance in enumerated list contexts, numbered sections, etc..
The following generic attributes are specific to a speech context:
* ``element.src`` (str) - A URL or filename referring the an audio or video file containing the speech. Access this attribute using the ``element.speaker_src()`` method, as it is inheritable from ancestors.
* ``element.speaker`` (str) - The name of ID of the speaker. Access this attribute using the ``element.speech_speaker()`` method, as it is inheritable from ancestors.
* ``element.begintime`` (4-tuple) - The time in the above source fragment when the phonetic content of this element starts, this is a ``(hours, minutes,seconds,milliseconds)`` tuple.
* ``element.endtime`` (4-tuple) - The time in the above source fragment when the phonetic content of this element ends, this is a ``(hours, minutes,seconds,milliseconds)`` tuple.
Not all attributes are allowed, unset or unavailable attributes will always default to ``None``.
Note:
This class should never be instantiated directly, as it is abstract!
See also:
:meth:`AbstractElement.__init__`
"""
def __init__(self, doc, *args, **kwargs):
"""Constructor for most FoLiA elements.
Parameters:
doc (:class:`Document`): The FoLiA document this element will pertain to. It will not be automatically added though.
*args: Child elements to add to this element, mostly instances derived from :class:`AbstractElement`
Keyword Arguments:
{generic_attribs}
generate_id_in (:class:`AbstractElement`): Instead of providing an explicit ID, the library can attempt to automatically generate an ID based on a convention where suffixes are applied to the ID of the parent element. This keyword argument takes the intended parent element (an instance derived from :class:`AbstractElement`) as value.
Not all of the generic FoLiA attributes are applicable to all elements. The class properties ``REQUIRED_ATTRIBS`` and ``OPTIONAL_ATTRIBS`` prescribe which are required or allowed.
""".format(generic_attribs=DOCSTRING_GENERIC_ATTRIBS)
if not isinstance(doc, Document) and not doc is None:
raise Exception("Expected first parameter to be instance of Document, got " + str(type(doc)))
self.doc = doc
self.parent = None
self.data = []
kwargs = parsecommonarguments(self, doc, self.ANNOTATIONTYPE, self.REQUIRED_ATTRIBS, self.OPTIONAL_ATTRIBS,**kwargs)
for child in args:
self.append(child)
if 'contents' in kwargs:
if isinstance(kwargs['contents'], list):
for child in kwargs['contents']:
self.append(child)
else:
self.append(kwargs['contents'])
del kwargs['contents']
for key in kwargs:
if key[0] == '{': #this is a parameter in a different alien namespace, ignore it
continue
elif key not in ("processor","space"): #ignore some FoLiA 2.0 attributes for limited forward compatibility
raise ValueError("Parameter '" + key + "' not supported by " + self.__class__.__name__)
def __getattr__(self, attr):
"""Internal method"""
#overriding getattr so we can get defaults here rather than needing a copy on each element, saves memory
if attr in ('set','cls','confidence','annotator','annotatortype','datetime','n','href','src','speaker','begintime','endtime','xlinktype','xlinktitle','xlinklabel','xlinkrole','xlinkshow','label', 'textclass', 'metadata'):
return None
else:
return super(AbstractElement, self).__getattribute__(attr)
#def __del__(self):
# if self.doc and self.doc.debug:
# print >>stderr, "[PyNLPl FoLiA DEBUG] Removing " + repr(self)
# for child in self.data:
# del child
# self.doc = None
# self.parent = None
# del self.data
def description(self):
"""Obtain the description associated with the element.
Raises:
:class:`NoSuchAnnotation` if there is no associated description."""
for e in self:
if isinstance(e, Description):
return e.value
raise NoSuchAnnotation
def textcontent(self, cls='current', correctionhandling=CorrectionHandling.CURRENT):
"""Get the text content explicitly associated with this element (of the specified class).
Unlike :meth:`text`, this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the :class:`TextContent` instance rather than the actual text!
Parameters:
cls (str): The class of the text content to obtain, defaults to ``current``.
correctionhandling: Specifies what content to retrieve when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current content. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the content prior to correction, and ``CorrectionHandling.EITHER`` if you don't care.
Returns:
The phonetic content (:class:`TextContent`)
Raises:
:class:`NoSuchText` if there is no text content for the element
See also:
:meth:`text`
:meth:`phoncontent`
:meth:`phon`
"""
if not self.PRINTABLE: #only printable elements can hold text
raise NoSuchText
#Find explicit text content (same class)
for e in self:
if isinstance(e, TextContent):
if cls is None or e.cls == cls:
return e
elif isinstance(e, Correction):
try:
return e.textcontent(cls, correctionhandling)
except NoSuchText:
pass
raise NoSuchText
def stricttext(self, cls='current'):
"""Alias for :meth:`text` with ``strict=True``"""
return self.text(cls,strict=True)
def findcorrectionhandling(self, cls):
"""Find the proper correctionhandling given a textclass by looking in the underlying corrections where it is reused"""
if cls == "current":
return CorrectionHandling.CURRENT
elif cls == "original":
return CorrectionHandling.ORIGINAL #backward compatibility
else:
correctionhandling = None
#but any other class may be anything
#Do we have corrections at all? otherwise no need to bother
for correction in self.select(Correction):
#yes, in which branch is the text class found?
found = False
hastext = False
if correction.hasnew():
found = True
doublecorrection = correction.new().count(Correction) > 0
if doublecorrection: return None #skipping text validation, correction is too complex (nested) to handle for now
for t in correction.new().select(TextContent):
hastext = True
if t.cls == cls:
if correctionhandling is not None and correctionhandling != CorrectionHandling.CURRENT:
return None #inconsistent
else:
correctionhandling = CorrectionHandling.CURRENT
break
elif correction.hascurrent():
found = True
doublecorrection = correction.current().count(Correction) > 0
if doublecorrection: return None #skipping text validation, correction is too complex (nested) to handle for now
for t in correction.current().select(TextContent):
hastext = True
if t.cls == cls:
if correctionhandling is not None and correctionhandling != CorrectionHandling.CURRENT:
return None #inconsistent
else:
correctionhandling = CorrectionHandling.CURRENT
break
if correction.hasoriginal():
found = True
doublecorrection = correction.original().count(Correction) > 0
if doublecorrection: return None #skipping text validation, correction is too complex (nested) to handle for now
for t in correction.original().select(TextContent):
hastext = True
if t.cls == cls:
if correctionhandling is not None and correctionhandling != CorrectionHandling.ORIGINAL:
return None #inconsistent
else:
correctionhandling = CorrectionHandling.ORIGINAL
break
if correctionhandling is None:
#well, we couldn't find our textclass in any correction, just fall back to current and let text validation fail if needed
return CorrectionHandling.CURRENT
def textvalidation(self, warnonly=None):
"""Run text validation on this element. Checks whether any text redundancy is consistent and whether offsets are valid.
Parameters:
warnonly (bool): Warn only (True) or raise exceptions (False). If set to None then this value will be determined based on the document's FoLiA version (Warn only before FoLiA v1.5)
Returns:
bool
"""
if warnonly is None and self.doc and self.doc.version:
warnonly = (checkversion(self.doc.version, '1.5.0') < 0) #warn only for documents older than FoLiA v1.5
valid = True
for cls in self.doc.textclasses:
if self.hastext(cls, strict=True) and not isinstance(self, (Linebreak, Whitespace)):
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] Text validation on " + repr(self),file=stderr)
correctionhandling = self.findcorrectionhandling(cls)
if correctionhandling is None:
#skipping text validation, correction is too complex (nested) to handle for now; just assume valid (benefit of the doubt)
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] SKIPPING Text validation on " + repr(self) + ", too complex to handle (nested corrections or inconsistent use)",file=stderr)
return True #just assume it's valid then
strictnormtext = self.text(cls,retaintokenisation=False,strict=True, normalize_spaces=True)
deepnormtext = self.text(cls,retaintokenisation=False,strict=False, normalize_spaces=True)
if strictnormtext != deepnormtext:
valid = False
deviation = 0
for i, (c1,c2) in enumerate(zip(strictnormtext,deepnormtext)):
if c1 != c2:
deviation = i
break
msg = "Text for " + self.__class__.__name__ + ", ID " + str(self.id) + ", class " + cls + ", is inconsistent: EXPECTED (after normalization) *****>\n" + deepnormtext + "\n****> BUT FOUND (after normalization) ****>\n" + strictnormtext + "\n******* DEVIATION POINT: " + strictnormtext[max(0,deviation-10):deviation] + "<*HERE*>" + strictnormtext[deviation:deviation+10]
if warnonly:
print("TEXT VALIDATION ERROR: " + msg,file=sys.stderr)
else:
raise InconsistentText(msg)
#validate offsets
tc = self.textcontent(cls)
if tc.offset is not None:
#we can't validate the reference of this element yet since it may point to higher level elements still being created!! we store it in a buffer that will
#be processed by pendingvalidation() after parsing and prior to serialisation
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] Queing element for later offset validation: " + repr(self),file=stderr)
self.doc.offsetvalidationbuffer.append( (self, cls) )
return valid
def toktext(self,cls='current'):
"""Alias for :meth:`text` with ``retaintokenisation=True``"""
return self.text(cls,retaintokenisation=True)
def text(self, cls='current', retaintokenisation=False, previousdelimiter="",strict=False, correctionhandling=CorrectionHandling.CURRENT, normalize_spaces=False):
"""Get the text associated with this element (of the specified class)
The text will be constructed from child-elements whereever possible, as they are more specific.
If no text can be obtained from the children and the element has itself text associated with
it, then that will be used.
Parameters:
cls (str): The class of the text content to obtain, defaults to ``current``.
retaintokenisation (bool): If set, the space attribute on words will be ignored, otherwise it will be adhered to and text will be detokenised as much as possible. Defaults to ``False``.
previousdelimiter (str): Can be set to a delimiter that was last outputed, useful when chaining calls to :meth:`text`. Defaults to an empty string.
strict (bool): Set this iif you are strictly interested in the text explicitly associated with the element, without recursing into children. Defaults to ``False``.
correctionhandling: Specifies what text to retrieve when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current text. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the text prior to correction, and ``CorrectionHandling.EITHER`` if you don't care.
normalize_spaces (bool): Return the text with multiple spaces, linebreaks, tabs normalized to single spaces
Example::
word.text()
Returns:
The text of the element (``unicode`` instance in Python 2, ``str`` in Python 3)
Raises:
:class:`NoSuchText`: if no text is found at all.
"""
if strict:
return self.textcontent(cls, correctionhandling).text(normalize_spaces=normalize_spaces)
if self.TEXTCONTAINER:
s = ""
for e in self:
if isstring(e):
s += e
elif e.PRINTABLE:
if s: s += e.TEXTDELIMITER #for AbstractMarkup, will usually be ""
s += e.text()
if normalize_spaces:
return norm_spaces(s)
else:
return s
elif not self.PRINTABLE: #only printable elements can hold text
raise NoSuchText
else:
#Get text from children first
delimiter = ""
s = ""
for e in self:
#was: e.PRINTABLE and not isinstance(e, TextContent) and not isinstance(e, String):
if isinstance(e, (AbstractStructureElement, Correction, AbstractSpanAnnotation)): #AbstractSpanAnnotation is needed when requesting text() on nested span annotations
try:
s += e.text(cls,retaintokenisation, delimiter,False,correctionhandling)
#delimiter will be buffered and only printed upon next iteration, this prevents the delimiter being outputted at the end of a sequence and to be compounded with other delimiters
delimiter = e.gettextdelimiter(retaintokenisation)
except NoSuchText:
#No text, that's okay, just continue
continue
if not s and self.hastext(cls, correctionhandling):
s = self.textcontent(cls, correctionhandling).text()
if s and previousdelimiter:
s = previousdelimiter + s
if s:
if normalize_spaces:
return norm_spaces(s)
else:
return s
else:
#No text found at all :`(
raise NoSuchText
def phoncontent(self, cls='current', correctionhandling=CorrectionHandling.CURRENT):
"""Get the phonetic content explicitly associated with this element (of the specified class).
Unlike :meth:`phon`, this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the PhonContent instance rather than the actual text!
Parameters:
cls (str): The class of the phonetic content to obtain, defaults to ``current``.
correctionhandling: Specifies what content to retrieve when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current content. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the content prior to correction, and ``CorrectionHandling.EITHER`` if you don't care.
Returns:
The phonetic content (:class:`PhonContent`)
Raises:
:class:`NoSuchPhon` if there is no phonetic content for the element
See also:
:meth:`phon`
:meth:`textcontent`
:meth:`text`
"""
if not self.SPEAKABLE: #only printable elements can hold text
raise NoSuchPhon
#Find explicit text content (same class)
for e in self:
if isinstance(e, PhonContent):
if cls is None or e.cls == cls:
return e
elif isinstance(e, Correction):
try:
return e.phoncontent(cls, correctionhandling)
except NoSuchPhon:
pass
raise NoSuchPhon
def speech_src(self):
"""Retrieves the URL/filename of the audio or video file associated with the element.
The source is inherited from ancestor elements if none is specified. For this reason, always use this method rather than access the ``src`` attribute directly.
Returns:
str or None if not found
"""
if self.src:
return self.src
elif self.parent:
return self.parent.speech_src()
else:
return None
def speech_speaker(self):
"""Retrieves the speaker of the audio or video file associated with the element.
The source is inherited from ancestor elements if none is specified. For this reason, always use this method rather than access the ``src`` attribute directly.
Returns:
str or None if not found
"""
if self.speaker:
return self.speaker
elif self.parent:
return self.parent.speech_speaker()
else:
return None
def phon(self, cls='current', previousdelimiter="", strict=False,correctionhandling=CorrectionHandling.CURRENT):
"""Get the phonetic representation associated with this element (of the specified class)
The phonetic content will be constructed from child-elements whereever possible, as they are more specific.
If no phonetic content can be obtained from the children and the element has itself phonetic content associated with
it, then that will be used.
Parameters:
cls (str): The class of the phonetic content to obtain, defaults to ``current``.
retaintokenisation (bool): If set, the space attribute on words will be ignored, otherwise it will be adhered to and phonetic content will be detokenised as much as possible. Defaults to ``False``.
previousdelimiter (str): Can be set to a delimiter that was last outputed, useful when chaining calls to :meth:`phon`. Defaults to an empty string.
strict (bool): Set this if you are strictly interested in the phonetic content explicitly associated with the element, without recursing into children. Defaults to ``False``.
correctionhandling: Specifies what phonetic content to retrieve when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current phonetic content. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the phonetic content prior to correction, and ``CorrectionHandling.EITHER`` if you don't care.
Example::
word.phon()
Returns:
The phonetic content of the element (``unicode`` instance in Python 2, ``str`` in Python 3)
Raises:
:class:`NoSuchPhon`: if no phonetic conent is found at all.
See also:
:meth:`phoncontent`: Retrieves the phonetic content as an element rather than a string
:meth:`text`
:meth:`textcontent`
"""
if strict:
return self.phoncontent(cls,correctionhandling).phon()
if self.PHONCONTAINER:
s = ""
for e in self:
if isstring(e):
s += e
else:
try:
if s: s += e.TEXTDELIMITER #We use TEXTDELIMITER for phon too
except AttributeError:
pass
s += e.phon()
return s
elif not self.SPEAKABLE: #only readable elements can hold phonetic content
raise NoSuchPhon
else:
#Get text from children first
delimiter = ""
s = ""
for e in self:
if e.SPEAKABLE and not isinstance(e, PhonContent) and not isinstance(e,String):
try:
s += e.phon(cls, delimiter,False,correctionhandling)
#delimiter will be buffered and only printed upon next iteration, this prevents the delimiter being outputted at the end of a sequence and to be compounded with other delimiters
delimiter = e.gettextdelimiter() #We use TEXTDELIMITER for phon too
except NoSuchPhon:
#No text, that's okay, just continue
continue
if not s and self.hasphon(cls):
s = self.phoncontent(cls,correctionhandling).phon()
if s and previousdelimiter:
return previousdelimiter + s
elif s:
return s
else:
#No text found at all :`(
raise NoSuchPhon
def originaltext(self,cls='original'):
"""Alias for retrieving the original uncorrect text.
A call to :meth:`text` with ``correctionhandling=CorrectionHandling.ORIGINAL``"""
return self.text(cls,correctionhandling=CorrectionHandling.ORIGINAL)
def gettextdelimiter(self, retaintokenisation=False):
"""Return the text delimiter for this class.
Uses the ``TEXTDELIMITER`` attribute but may return a customised one instead."""
if self.TEXTDELIMITER is None:
#no text delimiter of itself, recurse into children to inherit delimiter
for child in reversed(self):
if isinstance(child, AbstractElement):
return child.gettextdelimiter(retaintokenisation)
return ""
else:
return self.TEXTDELIMITER
def feat(self,subset):
"""Obtain the feature class value of the specific subset.
If a feature occurs multiple times, the values will be returned in a list.
Example::
sense = word.annotation(folia.Sense)
synset = sense.feat('synset')
Returns:
str or list
"""
r = None
for f in self:
if isinstance(f, Feature) and f.subset == subset:
if r: #support for multiclass features
if isinstance(r,list):
r.append(f.cls)
else:
r = [r, f.cls]
else:
r = f.cls
if r is None:
raise NoSuchAnnotation
else:
return r
def __ne__(self, other):
return not (self == other)
def __eq__(self, other): #pylint: disable=too-many-return-statements
"""Equality method, tests whether two elements are equal.
Elements are equal if all their attributes and children are equal."""
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - " + repr(self) + " vs " + repr(other),file=stderr)
#Check if we are of the same time
if type(self) != type(other): #pylint: disable=unidiomatic-typecheck
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Type mismatch: " + str(type(self)) + " vs " + str(type(other)),file=stderr)
return False
#Check FoLiA attributes
if self.id != other.id:
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - ID mismatch: " + str(self.id) + " vs " + str(other.id),file=stderr)
return False
if self.set != other.set:
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Set mismatch: " + str(self.set) + " vs " + str(other.set),file=stderr)
return False
if self.cls != other.cls:
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Class mismatch: " + repr(self.cls) + " vs " + repr(other.cls),file=stderr)
return False
if self.annotator != other.annotator:
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Annotator mismatch: " + repr(self.annotator) + " vs " + repr(other.annotator),file=stderr)
return False
if self.annotatortype != other.annotatortype:
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Annotator mismatch: " + repr(self.annotatortype) + " vs " + repr(other.annotatortype),file=stderr)
return False
#Check if we have same amount of children:
mychildren = list(self)
yourchildren = list(other)
if len(mychildren) != len(yourchildren):
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Unequal amount of children",file=stderr)
return False
#Now check equality of children
for mychild, yourchild in zip(mychildren, yourchildren):
if mychild != yourchild:
if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Child mismatch: " + repr(mychild) + " vs " + repr(yourchild) + " (in " + repr(self) + ", id: " + str(self.id) + ")",file=stderr)
return False
#looks like we made it! \o/
return True
def __len__(self):
"""Returns the number of child elements under the current element."""
return len(self.data)
def __nonzero__(self): #Python 2.x
return True
def __bool__(self):
return True
def __hash__(self):
if self.id:
return hash(self.id)
else:
raise TypeError("FoLiA elements are only hashable if they have an ID")
def __iter__(self):
"""Iterate over all children of this element.
Example::
for annotation in word:
...
"""
return iter(self.data)
def __contains__(self, element):
"""Tests if the specified element is part of the children of the element"""
return element in self.data
def __getitem__(self, key):
try:
return self.data[key]
except KeyError:
raise
def __unicode__(self): #Python 2 only
"""Alias for :meth:`text`. Python 2 only."""
return self.text()
def __str__(self):
"""Alias for :meth:`text`"""
return self.text()
def copy(self, newdoc=None, idsuffix=""):
"""Make a deep copy of this element and all its children.
Parameters:
newdoc (:class:`Document`): The document the copy should be associated with.
idsuffix (str or bool): If set to a string, the ID of the copy will be append with this (prevents duplicate IDs when making copies for the same document). If set to ``True``, a random suffix will be generated.
Returns:
a copy of the element
"""
if idsuffix is True: idsuffix = ".copy." + "%08x" % random.getrandbits(32) #random 32-bit hash for each copy, same one will be reused for all children
c = deepcopy(self)
if idsuffix:
c.addidsuffix(idsuffix)
c.setparents()
c.setdoc(newdoc)
return c
def copychildren(self, newdoc=None, idsuffix=""):
"""Generator creating a deep copy of the children of this element.
Invokes :meth:`copy` on all children, parameters are the same.
"""
if idsuffix is True: idsuffix = ".copy." + "%08x" % random.getrandbits(32) #random 32-bit hash for each copy, same one will be reused for all children
for c in self:
if isinstance(c, AbstractElement):
yield c.copy(newdoc,idsuffix)
def addidsuffix(self, idsuffix, recursive = True):
"""Appends a suffix to this element's ID, and optionally to all child IDs as well. There is sually no need to call this directly, invoked implicitly by :meth:`copy`"""
if self.id: self.id += idsuffix
if recursive:
for e in self:
try:
e.addidsuffix(idsuffix, recursive)
except Exception:
pass
def setparents(self):
"""Correct all parent relations for elements within the scop. There is sually no need to call this directly, invoked implicitly by :meth:`copy`"""
for c in self:
if isinstance(c, AbstractElement):
c.parent = self
c.setparents()
def setdoc(self,newdoc):
"""Set a different document. Usually no need to call this directly, invoked implicitly by :meth:`copy`"""
self.doc = newdoc
if self.doc and self.id:
self.doc.index[self.id] = self
for c in self:
if isinstance(c, AbstractElement):
c.setdoc(newdoc)
def hastext(self,cls='current',strict=True, correctionhandling=CorrectionHandling.CURRENT): #pylint: disable=too-many-return-statements
"""Does this element have text (of the specified class)
By default, and unlike :meth:`text`, this checks strictly, i.e. the element itself must have the text and it is not inherited from its children.
Parameters:
cls (str): The class of the text content to obtain, defaults to ``current``.
strict (bool): Set this if you are strictly interested in the text explicitly associated with the element, without recursing into children. Defaults to ``True``.
correctionhandling: Specifies what text to check for when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current text. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the text prior to correction, and ``CorrectionHandling.EITHER`` if you don't care.
Returns:
bool
"""
if not self.PRINTABLE: #only printable elements can hold text
return False
elif self.TEXTCONTAINER:
return True
else:
try:
if strict:
self.textcontent(cls, correctionhandling) #will raise NoSuchTextException when not found
return True
else:
#Check children
for e in self:
if e.PRINTABLE and not isinstance(e, TextContent):
if e.hastext(cls, strict, correctionhandling):
return True
self.textcontent(cls, correctionhandling) #will raise NoSuchTextException when not found
return True
except NoSuchText:
return False
def hasphon(self,cls='current',strict=True,correctionhandling=CorrectionHandling.CURRENT): #pylint: disable=too-many-return-statements
"""Does this element have phonetic content (of the specified class)
By default, and unlike :meth:`phon`, this checks strictly, i.e. the element itself must have the phonetic content and it is not inherited from its children.
Parameters:
cls (str): The class of the phonetic content to obtain, defaults to ``current``.
strict (bool): Set this if you are strictly interested in the phonetic content explicitly associated with the element, without recursing into children. Defaults to ``True``.
correctionhandling: Specifies what phonetic content to check for when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current phonetic content. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the phonetic content prior to correction, and ``CorrectionHandling.EITHER`` if you don't care.
Returns:
bool
"""
if not self.SPEAKABLE: #only printable elements can hold text
return False
elif self.PHONCONTAINER:
return True
else:
try:
if strict:
self.phoncontent(cls, correctionhandling)
return True
else:
#Check children
for e in self:
if e.SPEAKABLE and not isinstance(e, PhonContent):
if e.hasphon(cls, strict, correctionhandling):
return True
self.phoncontent(cls) #will raise NoSuchTextException when not found
return True
except NoSuchPhon:
return False
def settext(self, text, cls='current'):
"""Set the text for this element.
Arguments:
text (str): The text
cls (str): The class of the text, defaults to ``current`` (leave this unless you know what you are doing). There may be only one text content element of each class associated with the element.
"""
self.replace(TextContent, value=text, cls=cls)
def setdocument(self, doc):
"""Associate a document with this element.
Arguments:
doc (:class:`Document`): A document
Each element must be associated with a FoLiA document.
"""
assert isinstance(doc, Document)
if not self.doc:
self.doc = doc
if self.id:
if self.id in doc:
raise DuplicateIDError(self.id)
else:
self.doc.index[id] = self
for e in self: #recursive for all children
if isinstance(e,AbstractElement): e.setdocument(doc)
@classmethod
def accepts(Parentclass, Class, raiseexceptions=True, parentinstance=None):
if Class in Parentclass.ACCEPTED_DATA:
return True
else:
#Class is not in accepted data, but perhaps any of its ancestors is?
for c in Class.__mro__: #iterate over all base/super methods (automatically recurses)
if c is not Class and c in Parentclass.ACCEPTED_DATA:
return True
if raiseexceptions:
extra = ""
if parentinstance and parentinstance.id:
extra = ' (id=' + parentinstance.id + ')'
raise ValueError("Unable to add object of type " + Class.__name__ + " to " + Parentclass.__name__ + " " + extra + ". Type not allowed as child.")
else:
return False
@classmethod
def addable(Class, parent, set=None, raiseexceptions=True):
"""Tests whether a new element of this class can be added to the parent.
This method is mostly for internal use.
This will use the ``OCCURRENCES`` property, but may be overidden by subclasses for more customised behaviour.
Parameters:
parent (:class:`AbstractElement`): The element that is being added to
set (str or None): The set
raiseexceptions (bool): Raise an exception if the element can't be added?
Returns:
bool
Raises:
ValueError
"""
if not parent.__class__.accepts(Class, raiseexceptions, parent):
return False
if Class.OCCURRENCES > 0:
#check if the parent doesn't have too many already
count = parent.count(Class,None,True,[True, AbstractStructureElement]) #never descend into embedded structure annotatioton
if count >= Class.OCCURRENCES:
if raiseexceptions:
if parent.id:
extra = ' (id=' + parent.id + ')'
else:
extra = ''
raise DuplicateAnnotationError("Unable to add another object of type " + Class.__name__ + " to " + parent.__class__.__name__ + " " + extra + ". There are already " + str(count) + " instances of this class, which is the maximum.")
else:
return False
if Class.OCCURRENCES_PER_SET > 0 and set and Class.REQUIRED_ATTRIBS and Attrib.CLASS in Class.REQUIRED_ATTRIBS:
count = parent.count(Class,set,True, [True, AbstractStructureElement])
if count >= Class.OCCURRENCES_PER_SET:
if raiseexceptions:
if parent.id:
extra = ' (id=' + parent.id + ')'
else:
extra = ''
raise DuplicateAnnotationError("Unable to add another object of set " + set + " and type " + Class.__name__ + " to " + parent.__class__.__name__ + " " + extra + ". There are already " + str(count) + " instances of this class, which is the maximum for the set.")
else:
return False
return True
def postappend(self):
"""This method will be called after an element is added to another and does some checks.
It can do extra checks and if necessary raise exceptions to prevent addition. By default makes sure the right document is associated.
This method is mostly for internal use.
"""
#If the element was not associated with a document yet, do so now (and for all unassociated children:
if not self.doc and self.parent.doc:
self.setdocument(self.parent.doc)
if self.doc and self.doc.deepvalidation:
self.deepvalidation()
def addtoindex(self,norecurse=[]):
"""Makes sure this element (and all subelements), are properly added to the index.
Mostly for internal use."""
if self.id:
self.doc.index[self.id] = self
for e in self.data:
if all([not isinstance(e, C) for C in norecurse]):
try:
e.addtoindex(norecurse)
except AttributeError:
pass
def deepvalidation(self):
"""Perform deep validation of this element.
Raises:
:class:`DeepValidationError`
"""
if self.doc and self.doc.deepvalidation and self.set and self.set[0] != '_':
try:
self.doc.setdefinitions[self.set].testclass(self.cls)
except KeyError:
if self.cls and not self.doc.allowadhocsets:
raise DeepValidationError("Set definition " + self.set + " for " + self.XMLTAG + " not loaded!")
except DeepValidationError as e:
errormsg = str(e) + " (in set " + self.set+" for " + self.XMLTAG
if self.id:
errormsg += " with ID " + self.id
errormsg += ")"
raise DeepValidationError(errormsg)
def append(self, child, *args, **kwargs):
"""Append a child element.
Arguments:
child (instance or class): 1) The instance to add (usually an instance derived from :class:`AbstractElement`. or 2) a class subclassed from :class:`AbstractElement`.
Keyword Arguments:
{generic_attribs}
If an *instance* is passed as first argument, it will be appended
If a *class* derived from :class:`AbstractElement` is passed as first argument, an instance will first be created and then appended.
Keyword arguments:
alternative (bool): If set to True, the element will be made into an alternative. (default to False)
Generic example, passing a pre-generated instance::
word.append( folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) )
Generic example, passing a class to be generated::
word.append( folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL )
Generic example, setting text with a class:
word.append( "house", cls='original' )
Returns:
the added element
Raises:
ValueError: The element is not valid in this context
:class:`DuplicateAnnotationError`: There is already such an annotation
See also:
:meth:`add`
:meth:`insert`
:meth:`replace`
""".format(generic_attribs=DOCSTRING_GENERIC_ATTRIBS)
#obtain the set (if available, necessary for checking addability)
if 'set' in kwargs:
set = kwargs['set']
else:
try:
set = child.set
except AttributeError:
set = None
#Check if a Class rather than an instance was passed
Class = None #do not set to child.__class__
if inspect.isclass(child):
Class = child
if Class.addable(self, set):
if 'id' not in kwargs and 'generate_id_in' not in kwargs and ((Class.REQUIRED_ATTRIBS and (Attrib.ID in Class.REQUIRED_ATTRIBS)) or Class.AUTO_GENERATE_ID):
kwargs['generate_id_in'] = self
child = Class(self.doc, *args, **kwargs)
elif args:
raise Exception("Too many arguments specified. Only possible when first argument is a class and not an instance")
dopostappend = True
#Do the actual appending
if not Class and isstring(child):
if self.TEXTCONTAINER or self.PHONCONTAINER:
#element is a text/phon container and directly allows strings as content, add the string as such:
self.data.append(u(child))
dopostappend = False
elif TextContent in self.ACCEPTED_DATA:
#you can pass strings directly (just for convenience), will be made into textcontent automatically.
child = TextContent(self.doc, child )
self.data.append(child)
child.parent = self
elif PhonContent in self.ACCEPTED_DATA:
#you can pass strings directly (just for convenience), will be made into phoncontent automatically (note that textcontent always takes precedence, so you most likely will have to do it explicitly)
child = PhonContent(self.doc, child ) #pylint: disable=redefined-variable-type
self.data.append(child)
child.parent = self
else:
raise ValueError("Unable to append object of type " + child.__class__.__name__ + " to " + self.__class__.__name__ + ". Type not allowed as child.")
elif Class or (isinstance(child, AbstractElement) and child.__class__.addable(self, set)): #(prevents calling addable again if already done above)
if 'alternative' in kwargs and kwargs['alternative']:
child = Alternative(self.doc, child, generate_id_in=self)
self.data.append(child)
child.parent = self
else:
raise ValueError("Unable to append object of type " + child.__class__.__name__ + " to " + self.__class__.__name__ + ". Type not allowed as child.")
if dopostappend: child.postappend()
return child
def insert(self, index, child, *args, **kwargs):
"""Insert a child element at specified index. Returns the added element
If an *instance* is passed as first argument, it will be appended
If a *class* derived from AbstractElement is passed as first argument, an instance will first be created and then appended.
Arguments:
index (int): The position where to insert the chldelement
child: Instance or class
Keyword arguments:
alternative (bool): If set to True, the element will be made into an alternative.
corrected (bool): Used only when passing strings to be made into TextContent elements.
{generic_attribs}
Generic example, passing a pre-generated instance::
word.insert( 3, folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) )
Generic example, passing a class to be generated::
word.insert( 3, folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL )
Generic example, setting text::
word.insert( 3, "house" )
Returns:
the added element
Raises:
ValueError: The element is not valid in this context
:class:`DuplicateAnnotationError`: There is already such an annotation
See also:
:meth:`append`
:meth:`replace`
""".format(generic_attribs=DOCSTRING_GENERIC_ATTRIBS)
#obtain the set (if available, necessary for checking addability)
if 'set' in kwargs:
set = kwargs['set']
else:
try:
set = child.set
except AttributeError:
set = None
#Check if a Class rather than an instance was passed
Class = None #do not set to child.__class__
if inspect.isclass(child):
Class = child
if Class.addable(self, set):
if 'id' not in kwargs and 'generate_id_in' not in kwargs and ((Class.REQUIRED_ATTRIBS and Attrib.ID in Class.REQUIRED_ATTRIBS) or (Class.OPTIONAL_ATTRIBS and Attrib.ID in Class.OPTIONAL_ATTRIBS)):
kwargs['generate_id_in'] = self
child = Class(self.doc, *args, **kwargs)
elif args:
raise Exception("Too many arguments specified. Only possible when first argument is a class and not an instance")
#Do the actual appending
if not Class and (isinstance(child,str) or (sys.version < '3' and isinstance(child,unicode))) and TextContent in self.ACCEPTED_DATA: #pylint: disable=undefined-variable
#you can pass strings directly (just for convenience), will be made into textcontent automatically.
child = TextContent(self.doc, child )
self.data.insert(index, child)
child.parent = self
elif Class or (isinstance(child, AbstractElement) and child.__class__.addable(self, set)): #(prevents calling addable again if already done above)
if 'alternative' in kwargs and kwargs['alternative']:
child = Alternative(self.doc, child, generate_id_in=self) #pylint: disable=redefined-variable-type
self.data.insert(index, child)
child.parent = self
else:
raise ValueError("Unable to append object of type " + child.__class__.__name__ + " to " + self.__class__.__name__ + ". Type not allowed as child.")
child.postappend()
return child
def add(self, child, *args, **kwargs):
"""Add a child element.
This is a higher level function that adds (appends) an annotation to an element, it will simply call :meth:`AbstractElement.append` for token annotation elements that fit within the scope. For span annotation, it will create and find or create the proper annotation layer and insert the element there.
Arguments:
child (instance or class): 1) The instance to add (usually an instance derived from :class:`AbstractElement`. or 2) a class subclassed from :class:`AbstractElement`.
If an *instance* is passed as first argument, it will be appended
If a *class* derived from :class:`AbstractElement` is passed as first argument, an instance will first be created and then appended.
Keyword arguments:
alternative (bool): If set to True, the element will be made into an alternative. (default to False)
{generic_attribs}
Generic example, passing a pre-generated instance::
word.add( folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) )
Generic example, passing a class to be generated::
word.add( folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL )
Generic example, setting text with a class::
word.add( "house", cls='original' )
Returns:
the added element
Raises:
ValueError: The element is not valid in this context
:class:`DuplicateAnnotationError`: There is already such an annotation
See also:
:meth:`add`
:meth:`insert`
:meth:`replace`
""".format(generic_attribs=DOCSTRING_GENERIC_ATTRIBS)
addspanfromspanned = False #add a span annotation element from that which is spanned (i.e. a Word, Morpheme)
addspanfromstructure = False #add a span annotation elements from a structural parent which holds the span layers? (e.g. a Sentence, Paragraph)
if (inspect.isclass(child) and issubclass(child, AbstractSpanAnnotation)) or (not inspect.isclass(child) and isinstance(child, AbstractSpanAnnotation)):
layerclass = ANNOTATIONTYPE2LAYERCLASS[child.ANNOTATIONTYPE]
if isinstance(self, (Word, Morpheme)):
addspanfromspanned = True
elif isinstance(self,AbstractStructureElement): #add a span
addspanfromstructure = True
if addspanfromspanned or addspanfromstructure:
#get the set
if 'set' in kwargs:
set = kwargs['set']
else:
try:
set = self.doc.defaultset(layerclass)
except KeyError:
raise Exception("No set defined when adding span annotation and none could be inferred")
if addspanfromspanned: #pylint: disable=too-many-nested-blocks
#collect ancestors of the current element,
allowedparents = [self] + list(self.ancestors(AbstractStructureElement))
#find common ancestors of structure elements in the arguments, and check whether it has the required annotation layer, create one if necessary
for e in commonancestors(AbstractStructureElement, *[ x for x in args if isinstance(x, AbstractStructureElement)] ):
if e in allowedparents: #is the element in the list of allowed parents according to this element?
if AbstractAnnotationLayer in e.ACCEPTED_DATA or layerclass in e.ACCEPTED_DATA:
try:
layer = next(e.select(layerclass,set,True))
except StopIteration:
layer = e.append(layerclass)
if 'emptyspan' in kwargs and kwargs['emptyspan']:
del kwargs['emptyspan']
return layer.append(child,*[],**kwargs)
else:
return layer.append(child,*args,**kwargs)
raise Exception("Unable to find suitable common ancestor to create annotation layer")
elif addspanfromstructure:
layer = None
for layer in self.layers(child.ANNOTATIONTYPE, set):
pass #last one (only one actually) should be available in outer context
if layer is None:
layer = self.append(layerclass)
return layer.append(child,*args,**kwargs)
else:
#normal behaviour, append
return self.append(child,*args,**kwargs)
@classmethod
def findreplaceables(Class, parent, set=None,**kwargs):
"""Internal method to find replaceable elements. Auxiliary function used by :meth:`AbstractElement.replace`. Can be overriden for more fine-grained control."""
return list(parent.select(Class,set,False))
def updatetext(self):
"""Recompute textual value based on the text content of the children. Only supported on elements that are a ``TEXTCONTAINER``"""
if self.TEXTCONTAINER:
s = ""
for child in self:
if isinstance(child, AbstractElement):
child.updatetext()
s += child.text()
elif isstring(child):
s += child
self.data = [s]
def replace(self, child, *args, **kwargs):
"""Appends a child element like ``append()``, but replaces any existing child element of the same type and set. If no such child element exists, this will act the same as append()
Keyword arguments:
alternative (bool): If set to True, the *replaced* element will be made into an alternative. Simply use :meth:`AbstractElement.append` if you want the added element
to be an alternative.
See :meth:`AbstractElement.append` for more information and all parameters.
"""
if 'set' in kwargs:
set = kwargs['set']
del kwargs['set']
else:
try:
set = child.set
except AttributeError:
set = None
if inspect.isclass(child):
Class = child
replace = Class.findreplaceables(self, set, **kwargs)
elif (self.TEXTCONTAINER or self.PHONCONTAINER) and isstring(child):
#replace will replace ALL text content, removing text markup along the way!
self.data = []
return self.append(child, *args,**kwargs)
else:
Class = child.__class__
kwargs['instance'] = child
replace = Class.findreplaceables(self,set,**kwargs)
del kwargs['instance']
kwargs['set'] = set #was deleted temporarily for findreplaceables
if len(replace) == 0:
#nothing to replace, simply call append
if 'alternative' in kwargs:
del kwargs['alternative'] #has other meaning in append()
return self.append(child, *args, **kwargs)
elif len(replace) > 1:
raise Exception("Unable to replace. Multiple candidates found, unable to choose.")
elif len(replace) == 1:
if 'alternative' in kwargs and kwargs['alternative']:
#old version becomes alternative
if replace[0] in self.data:
self.data.remove(replace[0])
alt = self.append(Alternative)
alt.append(replace[0])
del kwargs['alternative'] #has other meaning in append()
else:
#remove old version competely
self.remove(replace[0])
e = self.append(child, *args, **kwargs)
self.updatetext()
return e
def ancestors(self, Class=None):
"""Generator yielding all ancestors of this element, effectively back-tracing its path to the root element. A tuple of multiple classes may be specified.
Arguments:
*Class: The class or classes (:class:`AbstractElement` or subclasses). Not instances!
Yields:
elements (instances derived from :class:`AbstractElement`)
"""
e = self
while e:
if e.parent:
e = e.parent
if not Class or isinstance(e,Class):
yield e
elif isinstance(Class, tuple):
for C in Class:
if isinstance(e,C):
yield e
else:
break
def ancestor(self, *Classes):
"""Find the most immediate ancestor of the specified type, multiple classes may be specified.
Arguments:
*Classes: The possible classes (:class:`AbstractElement` or subclasses) to select from. Not instances!
Example::
paragraph = word.ancestor(folia.Paragraph)
"""
for e in self.ancestors(tuple(Classes)):
return e
raise NoSuchAnnotation
def xml(self, attribs = None,elements = None, skipchildren = False):
"""Serialises the FoLiA element and all its contents to XML.
Arguments are mostly for internal use.
Returns:
an lxml.etree.Element
See also:
:meth:`AbstractElement.xmlstring` - for direct string output
"""
E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"})
if not attribs: attribs = {}
if not elements: elements = []
if self.id:
attribs['{http://www.w3.org/XML/1998/namespace}id'] = self.id
#Some attributes only need to be added if they are not the same as what's already set in the declaration
if not isinstance(self, AbstractAnnotationLayer):
if '{' + NSFOLIA + '}set' not in attribs: #do not override if overloaded function already set it
try:
if self.set:
if not self.ANNOTATIONTYPE in self.doc.annotationdefaults or len(self.doc.annotationdefaults[self.ANNOTATIONTYPE]) != 1 or list(self.doc.annotationdefaults[self.ANNOTATIONTYPE].keys())[0] != self.set:
if self.set != None:
if self.ANNOTATIONTYPE in self.doc.set_alias and self.set in self.doc.set_alias[self.ANNOTATIONTYPE]:
attribs['{' + NSFOLIA + '}set'] = self.doc.set_alias[self.ANNOTATIONTYPE][self.set] #use alias instead
else:
attribs['{' + NSFOLIA + '}set'] = self.set
except AttributeError:
pass
if '{' + NSFOLIA + '}class' not in attribs: #do not override if caller already set it
try:
if self.cls:
attribs['{' + NSFOLIA + '}class'] = self.cls
except AttributeError:
pass
if '{' + NSFOLIA + '}annotator' not in attribs: #do not override if caller already set it
try:
if self.annotator and ((not (self.ANNOTATIONTYPE in self.doc.annotationdefaults)) or (not ( 'annotator' in self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set])) or (self.annotator != self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set]['annotator'])):
attribs['{' + NSFOLIA + '}annotator'] = self.annotator
if self.annotatortype and ((not (self.ANNOTATIONTYPE in self.doc.annotationdefaults)) or (not ('annotatortype' in self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set])) or (self.annotatortype != self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set]['annotatortype'])):
if self.annotatortype == AnnotatorType.AUTO:
attribs['{' + NSFOLIA + '}annotatortype'] = 'auto'
elif self.annotatortype == AnnotatorType.MANUAL:
attribs['{' + NSFOLIA + '}annotatortype'] = 'manual'
except AttributeError:
pass
if '{' + NSFOLIA + '}confidence' not in attribs: #do not override if caller already set it
if self.confidence:
attribs['{' + NSFOLIA + '}confidence'] = str(self.confidence)
if '{' + NSFOLIA + '}n' not in attribs: #do not override if caller already set it
if self.n:
attribs['{' + NSFOLIA + '}n'] = str(self.n)
if '{' + NSFOLIA + '}auth' not in attribs: #do not override if caller already set it
try:
if not self.AUTH or not self.auth: #(former is static, latter isn't)
attribs['{' + NSFOLIA + '}auth'] = 'no'
except AttributeError:
pass
if '{' + NSFOLIA + '}datetime' not in attribs: #do not override if caller already set it
if self.datetime and ((not (self.ANNOTATIONTYPE in self.doc.annotationdefaults)) or (not ( 'datetime' in self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set])) or (self.datetime != self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set]['datetime'])):
attribs['{' + NSFOLIA + '}datetime'] = self.datetime.strftime("%Y-%m-%dT%H:%M:%S")
if '{' + NSFOLIA + '}src' not in attribs: #do not override if caller already set it
if self.src:
attribs['{' + NSFOLIA + '}src'] = self.src
if '{' + NSFOLIA + '}speaker' not in attribs: #do not override if caller already set it
if self.speaker:
attribs['{' + NSFOLIA + '}speaker'] = self.speaker
if '{' + NSFOLIA + '}begintime' not in attribs: #do not override if caller already set it
if self.begintime:
attribs['{' + NSFOLIA + '}begintime'] = "%02d:%02d:%02d.%03d" % self.begintime
if '{' + NSFOLIA + '}endtime' not in attribs: #do not override if caller already set it
if self.endtime:
attribs['{' + NSFOLIA + '}endtime'] = "%02d:%02d:%02d.%03d" % self.endtime
if '{' + NSFOLIA + '}textclass' not in attribs: #do not override if caller already set it
if self.textclass and self.textclass != "current":
attribs['{' + NSFOLIA + '}textclass'] = self.textclass
if '{' + NSFOLIA + '}metadata' not in attribs: #do not override if caller already set it
if self.metadata:
attribs['{' + NSFOLIA + '}metadata'] = self.metadata
if self.XLINK:
if self.href:
attribs['{http://www.w3.org/1999/xlink}href'] = self.href
if not self.xlinktype:
attribs['{http://www.w3.org/1999/xlink}type'] = "simple"
if self.xlinktype:
attribs['{http://www.w3.org/1999/xlink}type'] = self.xlinktype
if self.xlinklabel:
attribs['{http://www.w3.org/1999/xlink}label'] = self.xlinklabel
if self.xlinkrole:
attribs['{http://www.w3.org/1999/xlink}role'] = self.xlinkrole
if self.xlinkshow:
attribs['{http://www.w3.org/1999/xlink}show'] = self.xlinkshow
if self.xlinktitle:
attribs['{http://www.w3.org/1999/xlink}title'] = self.xlinktitle
omitchildren = []
#Are there predetermined Features in ACCEPTED_DATA?
for c in self.ACCEPTED_DATA:
if issubclass(c, Feature) and c.SUBSET:
#Do we have any of those?
for c2 in self.data:
if c2.__class__ is c and c.SUBSET == c2.SUBSET and c2.cls:
#Yes, serialize them as attributes
attribs[c2.SUBSET] = c2.cls
omitchildren.append(c2) #and skip them as elements
break #only one
e = makeelement(E, '{' + NSFOLIA + '}' + self.XMLTAG, **attribs)
if not skipchildren and self.data:
#append children,
# we want make sure that text elements are in the right order, 'current' class first
# so we first put them in a list
textelements = []
otherelements = []
for child in self:
if isinstance(child, TextContent):
if child.cls == 'current':
textelements.insert(0, child)
else:
textelements.append(child)
elif not child in omitchildren:
otherelements.append(child)
for child in textelements+otherelements:
if (self.TEXTCONTAINER or self.PHONCONTAINER) and isstring(child):
if len(e) == 0:
if e.text:
e.text += child
else:
e.text = child
else:
#add to tail of last child
if e[-1].tail:
e[-1].tail += child
else:
e[-1].tail = child
else:
xml = child.xml() #may return None in rare occassions, meaning we wan to skip
if not xml is None:
e.append(xml)
if elements: #extra elements
for e2 in elements:
if isinstance(e2, str) or (sys.version < '3' and isinstance(e2, unicode)):
if e.text is None:
e.text = e2
else:
e.text += e2
else:
e.append(e2)
return e
def json(self, attribs=None, recurse=True, ignorelist=False):
"""Serialises the FoLiA element and all its contents to a Python dictionary suitable for serialisation to JSON.
Example::
import json
json.dumps(word.json())
Returns:
dict
"""
jsonnode = {}
jsonnode['type'] = self.XMLTAG
if self.id:
jsonnode['id'] = self.id
if self.set:
jsonnode['set'] = self.set
if self.cls:
jsonnode['class'] = self.cls
if self.annotator:
jsonnode['annotator'] = self.annotator
if self.annotatortype:
if self.annotatortype == AnnotatorType.AUTO:
jsonnode['annotatortype'] = "auto"
elif self.annotatortype == AnnotatorType.MANUAL:
jsonnode['annotatortype'] = "manual"
if self.confidence is not None:
jsonnode['confidence'] = self.confidence
if self.n:
jsonnode['n'] = self.n
if self.auth:
jsonnode['auth'] = self.auth
if self.datetime:
jsonnode['datetime'] = self.datetime.strftime("%Y-%m-%dT%H:%M:%S")
if recurse: #pylint: disable=too-many-nested-blocks
jsonnode['children'] = []
if self.TEXTCONTAINER:
jsonnode['text'] = self.text()
if self.PHONCONTAINER:
jsonnode['phon'] = self.phon()
for child in self:
if self.TEXTCONTAINER and isstring(child):
jsonnode['children'].append(child)
elif not self.PHONCONTAINER:
#check ignore list
ignore = False
if ignorelist:
for e in ignorelist:
if isinstance(child,e):
ignore = True
break
if not ignore:
jsonnode['children'].append(child.json(attribs,recurse,ignorelist))
if attribs:
for attrib in attribs:
jsonnode[attrib] = attribs
return jsonnode
def xmlstring(self, pretty_print=False):
"""Serialises this FoLiA element and all its contents to XML.
Returns:
str: a string with XML representation for this element and all its children"""
s = ElementTree.tostring(self.xml(), xml_declaration=False, pretty_print=pretty_print, encoding='utf-8')
if sys.version < '3':
if isinstance(s, str):
s = unicode(s,'utf-8') #pylint: disable=undefined-variable
else:
if isinstance(s,bytes):
s = str(s,'utf-8')
s = s.replace('ns0:','') #ugly patch to get rid of namespace prefix
s = s.replace(':ns0','')
return s
def select(self, Class, set=None, recursive=True, ignore=True, node=None): #pylint: disable=bad-classmethod-argument,redefined-builtin
"""Select child elements of the specified class.
A further restriction can be made based on set.
Arguments:
Class (class): The class to select; any python class (not instance) subclassed off :class:`AbstractElement`
Set (str): The set to match against, only elements pertaining to this set will be returned. If set to None (default), all elements regardless of set will be returned.
recursive (bool): Select recursively? Descending into child elements? Defaults to ``True``.
ignore: A list of Classes to ignore, if set to ``True`` instead of a list, all non-authoritative elements will be skipped (this is the default behaviour and corresponds to the following elements: :class:`Alternative`, :class:`AlternativeLayer`, :class:`Suggestion`, and :class:`folia.Original`. These elements and those contained within are never *authorative*. You may also include the boolean True as a member of a list, if you want to skip additional tags along the predefined non-authoritative ones.
* ``node``: Reserved for internal usage, used in recursion.
Yields:
Elements (instances derived from :class:`AbstractElement`)
Example::
for sense in text.select(folia.Sense, 'cornetto', True, [folia.Original, folia.Suggestion, folia.Alternative] ):
..
"""
#if ignorelist is True:
# ignorelist = default_ignore
if not node:
node = self
for e in self.data: #pylint: disable=too-many-nested-blocks
if (not self.TEXTCONTAINER and not self.PHONCONTAINER) or isinstance(e, AbstractElement):
if ignore is True:
try:
if not e.auth:
continue
except AttributeError:
#not all elements have auth attribute..
pass
elif ignore: #list
doignore = False
for c in ignore:
if c is True:
try:
if not e.auth:
doignore =True
break
except AttributeError:
#not all elements have auth attribute..
pass
elif c == e.__class__ or issubclass(e.__class__,c):
doignore = True
break
if doignore:
continue
if isinstance(e, Class):
if not set is None:
try:
if e.set != set:
continue
except AttributeError:
continue
yield e
if recursive:
for e2 in e.select(Class, set, recursive, ignore, e):
if not set is None:
try:
if e2.set != set:
continue
except AttributeError:
continue
yield e2
def count(self, Class, set=None, recursive=True, ignore=True, node=None):
"""Like :meth:`AbstractElement.select`, but instead of returning the elements, it merely counts them.
Returns:
int
"""
return sum(1 for i in self.select(Class,set,recursive,ignore,node) )
def items(self, founditems=[]): #pylint: disable=dangerous-default-value
"""Returns a depth-first flat list of *all* items below this element (not limited to AbstractElement)"""
l = []
for e in self.data:
if e not in founditems: #prevent going in recursive loops
l.append(e)
if isinstance(e, AbstractElement):
l += e.items(l)
return l
def getmetadata(self, key=None):
"""Get the metadata that applies to this element, automatically inherited from parent elements"""
if self.metadata:
d = self.doc.submetadata[self.metadata]
elif self.parent:
d = self.parent.getmetadata()
elif self.doc:
d = self.doc.metadata
else:
return None
if key:
return d[key]
else:
return d
def getindex(self, child, recursive=True, ignore=True):
"""Get the index at which an element occurs, recursive by default!
Returns:
int
"""
#breadth first search
for i, c in enumerate(self.data):
if c is child:
return i
if recursive: #pylint: disable=too-many-nested-blocks
for i, c in enumerate(self.data):
if ignore is True:
try:
if not c.auth:
continue
except AttributeError:
#not all elements have auth attribute..
pass
elif ignore: #list
doignore = False
for e in ignore:
if e is True:
try:
if not c.auth:
doignore =True
break
except AttributeError:
#not all elements have auth attribute..
pass
elif e == c.__class__ or issubclass(c.__class__,e):
doignore = True
break
if doignore:
continue
if isinstance(c, AbstractElement):
j = c.getindex(child, recursive)
if j != -1:
return i #yes, i ... not j!
return -1
def precedes(self, other):
"""Returns a boolean indicating whether this element precedes the other element"""
try:
ancestor = next(commonancestors(AbstractElement, self, other))
except StopIteration:
raise Exception("Elements share no common ancestor")
#now we just do a depth first search and see who comes first
def callback(e):
if e is self:
return True
elif e is other:
return False
return None
result = ancestor.depthfirstsearch(callback)
if result is None:
raise Exception("Unable to find relation between elements! (shouldn't happen)")
return result
def depthfirstsearch(self, function):
"""Generic depth first search algorithm using a callback function, continues as long as the callback function returns None"""
result = function(self)
if result is not None:
return result
for e in self:
result = e.depthfirstsearch(function)
if result is not None:
return result
return None
def next(self, Class=True, scope=True, reverse=False):
"""Returns the next element, if it is of the specified type and if it does not cross the boundary of the defined scope. Returns None if no next element is found. Non-authoritative elements are never returned.
Arguments:
* ``Class``: The class to select; any python class subclassed off `'AbstractElement``, may also be a tuple of multiple classes. Set to ``True`` to constrain to the same class as that of the current instance, set to ``None`` to not constrain at all
* ``scope``: A list of classes which are never crossed looking for a next element. Set to ``True`` to constrain to a default list of structure elements (Sentence,Paragraph,Division,Event, ListItem,Caption), set to ``None`` to not constrain at all.
"""
if Class is True: Class = self.__class__
if scope is True: scope = STRUCTURESCOPE
structural = Class is not None and issubclass(Class,AbstractStructureElement)
if reverse:
order = reversed
descendindex = -1
else:
order = lambda x: x #pylint: disable=redefined-variable-type
descendindex = 0
child = self
parent = self.parent
while parent: #pylint: disable=too-many-nested-blocks
if len(parent) > 1:
returnnext = False
for e in order(parent):
if e is child:
#we found the current item, next item will be the one to return
returnnext = True
elif returnnext and e.auth and not isinstance(e,AbstractAnnotationLayer) and (not structural or (structural and (not isinstance(e,(AbstractTokenAnnotation,TextContent)) ) )):
if structural and isinstance(e,Correction):
if not list(e.select(AbstractStructureElement)): #skip-over non-structural correction
continue
if Class is None or (isinstance(Class,tuple) and (any(isinstance(e,C) for C in Class))) or isinstance(e,Class):
return e
else:
#this is not yet the element of the type we are looking for, we are going to descend again in the very leftmost (rightmost if reversed) branch only
while e.data:
e = e.data[descendindex]
if not isinstance(e, AbstractElement):
return None #we've gone too far
if e.auth and not isinstance(e,AbstractAnnotationLayer):
if Class is None or (isinstance(Class,tuple) and (any(isinstance(e,C) for C in Class))) or isinstance(e,Class):
return e
else:
#descend deeper
continue
return None
#generational iteration
child = parent
if scope is not None and child.__class__ in scope:
#you shall not pass!
break
parent = parent.parent
return None
def previous(self, Class=True, scope=True):
"""Returns the previous element, if it is of the specified type and if it does not cross the boundary of the defined scope. Returns None if no next element is found. Non-authoritative elements are never returned.
Arguments:
* ``Class``: The class to select; any python class subclassed off `'AbstractElement``. Set to ``True`` to constrain to the same class as that of the current instance, set to ``None`` to not constrain at all
* ``scope``: A list of classes which are never crossed looking for a next element. Set to ``True`` to constrain to a default list of structure elements (Sentence,Paragraph,Division,Event, ListItem,Caption), set to ``None`` to not constrain at all.
"""
return self.next(Class,scope, True)
def leftcontext(self, size, placeholder=None, scope=None):
"""Returns the left context for an element, as a list. This method crosses sentence/paragraph boundaries by default, which can be restricted by setting scope"""
if size == 0: return [] #for efficiency
context = []
e = self
while len(context) < size:
e = e.previous(True,scope)
if not e: break
context.append(e)
if placeholder:
while len(context) < size:
context.append(placeholder)
context.reverse()
return context
def rightcontext(self, size, placeholder=None, scope=None):
"""Returns the right context for an element, as a list. This method crosses sentence/paragraph boundaries by default, which can be restricted by setting scope"""
if size == 0: return [] #for efficiency
context = []
e = self
while len(context) < size:
e = e.next(True,scope)
if not e: break
context.append(e)
if placeholder:
while len(context) < size:
context.append(placeholder)
return context
def context(self, size, placeholder=None, scope=None):
"""Returns this word in context, {size} words to the left, the current word, and {size} words to the right"""
return self.leftcontext(size, placeholder,scope) + [self] + self.rightcontext(size, placeholder,scope)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None, origclass = None):
"""Returns a RelaxNG definition for this element (as an XML element (lxml.etree) rather than a string)"""
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" })
if origclass: cls = origclass
preamble = []
try:
if cls.__doc__:
E2 = ElementMaker(namespace="http://relaxng.org/ns/annotation/0.9", nsmap={'a':'http://relaxng.org/ns/annotation/0.9'} )
preamble.append(E2.documentation(cls.__doc__))
except AttributeError:
pass
if cls.REQUIRED_ATTRIBS is None: cls.REQUIRED_ATTRIBS = () #bit hacky
if cls.OPTIONAL_ATTRIBS is None: cls.OPTIONAL_ATTRIBS = () #bit hacky
attribs = [ ]
if cls.REQUIRED_ATTRIBS and Attrib.ID in cls.REQUIRED_ATTRIBS:
attribs.append( E.attribute(E.data(type='ID',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'),name='id', ns="http://www.w3.org/XML/1998/namespace") )
elif Attrib.ID in cls.OPTIONAL_ATTRIBS:
attribs.append( E.optional( E.attribute(E.data(type='ID',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='id', ns="http://www.w3.org/XML/1998/namespace") ) )
if Attrib.CLASS in cls.REQUIRED_ATTRIBS:
#Set is a tough one, we can't require it as it may be defined in the declaration: we make it optional and need schematron to resolve this later
attribs.append( E.attribute(E.data(type='string',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'),name='class') )
attribs.append( E.optional( E.attribute( E.data(type='string',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'),name='set' ) ) )
elif Attrib.CLASS in cls.OPTIONAL_ATTRIBS:
attribs.append( E.optional( E.attribute(E.data(type='string',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'),name='class') ) )
attribs.append( E.optional( E.attribute(E.data(type='string',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='set' ) ) )
if Attrib.ANNOTATOR in cls.REQUIRED_ATTRIBS or Attrib.ANNOTATOR in cls.OPTIONAL_ATTRIBS:
#Similarly tough
attribs.append( E.optional( E.attribute(E.data(type='string',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='annotator') ) )
attribs.append( E.optional( E.attribute(name='annotatortype') ) )
if Attrib.CONFIDENCE in cls.REQUIRED_ATTRIBS:
attribs.append( E.attribute(E.data(type='double',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='confidence') )
elif Attrib.CONFIDENCE in cls.OPTIONAL_ATTRIBS:
attribs.append( E.optional( E.attribute(E.data(type='double',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='confidence') ) )
if Attrib.N in cls.REQUIRED_ATTRIBS:
attribs.append( E.attribute( E.data(type='string',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'),name='n') )
elif Attrib.N in cls.OPTIONAL_ATTRIBS:
attribs.append( E.optional( E.attribute( E.data(type='string',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'),name='n') ) )
if Attrib.DATETIME in cls.REQUIRED_ATTRIBS:
attribs.append( E.attribute(E.data(type='dateTime',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='datetime') )
elif Attrib.DATETIME in cls.OPTIONAL_ATTRIBS:
attribs.append( E.optional( E.attribute( E.data(type='dateTime',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='datetime') ) )
if Attrib.BEGINTIME in cls.REQUIRED_ATTRIBS:
attribs.append(E.attribute(name='begintime') )
elif Attrib.BEGINTIME in cls.OPTIONAL_ATTRIBS:
attribs.append( E.optional( E.attribute(name='begintime') ) )
if Attrib.ENDTIME in cls.REQUIRED_ATTRIBS:
attribs.append(E.attribute(name='endtime') )
elif Attrib.ENDTIME in cls.OPTIONAL_ATTRIBS:
attribs.append( E.optional( E.attribute(name='endtime') ) )
if Attrib.SRC in cls.REQUIRED_ATTRIBS:
attribs.append(E.attribute(E.data(type='anyURI',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'),name='src') )
elif Attrib.SRC in cls.OPTIONAL_ATTRIBS:
attribs.append( E.optional( E.attribute(E.data(type='anyURI',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'),name='src') ) )
if Attrib.SPEAKER in cls.REQUIRED_ATTRIBS:
attribs.append(E.attribute(E.data(type='string',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='speaker') )
elif Attrib.SPEAKER in cls.OPTIONAL_ATTRIBS:
attribs.append( E.optional( E.attribute(E.data(type='string',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'),name='speaker') ) )
if Attrib.TEXTCLASS in cls.REQUIRED_ATTRIBS:
attribs.append(E.attribute(name='textclass') )
elif Attrib.TEXTCLASS in cls.OPTIONAL_ATTRIBS:
attribs.append( E.optional( E.attribute(name='textclass') ) )
if Attrib.METADATA in cls.REQUIRED_ATTRIBS:
attribs.append(E.attribute(name='metadata') )
elif Attrib.METADATA in cls.OPTIONAL_ATTRIBS:
attribs.append( E.optional( E.attribute(name='metadata') ) )
if cls.XLINK:
attribs += [ #loose interpretation of specs, not checking whether xlink combinations are valid
E.optional(E.attribute(name='href',ns="http://www.w3.org/1999/xlink"),E.attribute(name='type',ns="http://www.w3.org/1999/xlink") ),
E.optional(E.attribute(name='role',ns="http://www.w3.org/1999/xlink")),
E.optional(E.attribute(name='title',ns="http://www.w3.org/1999/xlink")),
E.optional(E.attribute(name='label',ns="http://www.w3.org/1999/xlink")),
E.optional(E.attribute(name='show',ns="http://www.w3.org/1999/xlink")),
]
attribs.append( E.optional( E.attribute( name='auth' ) ) )
if extraattribs:
for e in extraattribs:
attribs.append(e) #s
attribs.append( E.ref(name="allow_foreign_attributes") )
elements = [] #(including attributes)
if cls.TEXTCONTAINER or cls.PHONCONTAINER:
elements.append( E.text())
#We actually want to require non-empty text (E.text() is not sufficient)
#but this is not solved yet, see https://github.com/proycon/folia/issues/19
#elements.append( E.data(E.param(r".+",name="pattern"),type='string'))
#elements.append( E.data(E.param(r"(.|\n|\r)*\S+(.|\n|\r)*",name="pattern"),type='string'))
done = {}
if includechildren and cls.ACCEPTED_DATA: #pylint: disable=too-many-nested-blocks
for c in cls.ACCEPTED_DATA:
if c.__name__[:8] == 'Abstract' and inspect.isclass(c):
for c2 in globals().values():
try:
if inspect.isclass(c2) and issubclass(c2, c):
try:
if c2.XMLTAG and c2.XMLTAG not in done:
if c2.OCCURRENCES == 1:
elements.append( E.optional( E.ref(name=c2.XMLTAG) ) )
else:
elements.append( E.zeroOrMore( E.ref(name=c2.XMLTAG) ) )
if c2.XMLTAG == 'item': #nasty hack for backward compatibility with deprecated listitem element
elements.append( E.zeroOrMore( E.ref(name='listitem') ) )
done[c2.XMLTAG] = True
except AttributeError:
continue
except TypeError:
pass
elif issubclass(c, Feature) and c.SUBSET:
attribs.append( E.optional( E.attribute(name=c.SUBSET))) #features as attributes
else:
try:
if c.XMLTAG and c.XMLTAG not in done:
if cls.REQUIRED_DATA and c in cls.REQUIRED_DATA:
if c.OCCURRENCES == 1:
elements.append( E.ref(name=c.XMLTAG) )
else:
elements.append( E.oneOrMore( E.ref(name=c.XMLTAG) ) )
elif c.OCCURRENCES == 1:
elements.append( E.optional( E.ref(name=c.XMLTAG) ) )
else:
elements.append( E.zeroOrMore( E.ref(name=c.XMLTAG) ) )
if c.XMLTAG == 'item':
#nasty hack for backward compatibility with deprecated listitem element
elements.append( E.zeroOrMore( E.ref(name='listitem') ) )
done[c.XMLTAG] = True
except AttributeError:
continue
if extraelements:
for e in extraelements:
elements.append( e )
if elements:
if len(elements) > 1:
attribs.append( E.interleave(*elements) )
else:
attribs.append( *elements )
if not attribs:
attribs.append( E.empty() )
if cls.XMLTAG in ('desc','comment'):
return E.define( E.element(E.text(), *(preamble + attribs), **{'name': cls.XMLTAG}), name=cls.XMLTAG, ns=NSFOLIA)
else:
return E.define( E.element(*(preamble + attribs), **{'name': cls.XMLTAG}), name=cls.XMLTAG, ns=NSFOLIA)
@classmethod
def parsexml(Class, node, doc, **kwargs): #pylint: disable=bad-classmethod-argument
"""Internal class method used for turning an XML element into an instance of the Class.
Args:
* ``node`` - XML Element
* ``doc`` - Document
Returns:
An instance of the current Class.
"""
assert issubclass(Class, AbstractElement)
if doc.preparsexmlcallback:
result = doc.preparsexmlcallback(node)
if not result:
return None
if isinstance(result, AbstractElement):
return result
dcoi = node.tag.startswith('{' + NSDCOI + '}')
args = []
if not kwargs: kwargs = {}
text = None #for dcoi support
if (Class.TEXTCONTAINER or Class.PHONCONTAINER) and node.text:
args.append(node.text)
for subnode in node: #pylint: disable=too-many-nested-blocks
#don't trip over comments
if isinstance(subnode, ElementTree._Comment): #pylint: disable=protected-access
if (Class.TEXTCONTAINER or Class.PHONCONTAINER) and subnode.tail:
args.append(subnode.tail)
else:
if subnode.tag.startswith('{' + NSFOLIA + '}'):
if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Processing subnode " + subnode.tag[nslen:],file=stderr)
try:
e = doc.parsexml(subnode, Class)
except ParseError as e:
raise #just re-raise deepest parseError
except Exception as e:
#Python 3 will preserve full original traceback, Python 2 does not, original cause is explicitly passed to ParseError anyway:
raise ParseError("FoLiA exception in handling of <" + subnode.tag[len(NSFOLIA)+2:] + "> @ line " + str(subnode.sourceline) + ": [" + e.__class__.__name__ + "] " + str(e), cause=e)
if e is not None:
args.append(e)
if (Class.TEXTCONTAINER or Class.PHONCONTAINER) and subnode.tail:
args.append(subnode.tail)
elif subnode.tag.startswith('{' + NSDCOI + '}'):
#Dcoi support
if Class is Text and subnode.tag[nslendcoi:] == 'body':
for subsubnode in subnode:
if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Processing DCOI subnode " + subnode.tag[nslendcoi:],file=stderr)
e = doc.parsexml(subsubnode, Class)
if e is not None:
args.append(e)
else:
if doc.debug >= 1: print( "[PyNLPl FoLiA DEBUG] Processing DCOI subnode " + subnode.tag[nslendcoi:],file=stderr)
e = doc.parsexml(subnode, Class)
if e is not None:
args.append(e)
elif doc.debug >= 1:
print("[PyNLPl FoLiA DEBUG] Ignoring subnode outside of FoLiA namespace: " + subnode.tag,file=stderr)
if dcoi:
dcoipos = dcoilemma = dcoicorrection = dcoicorrectionoriginal = None
for key, value in node.attrib.items():
if key[0] == '{' or key =='XMLid':
if key == '{http://www.w3.org/XML/1998/namespace}id' or key == 'XMLid':
key = 'id'
elif key.startswith( '{' + NSFOLIA + '}'):
key = key[nslen:]
if key == 'id':
#ID in FoLiA namespace is always a reference, passed in kwargs as follows:
key = 'idref'
elif Class.XLINK and key.startswith('{http://www.w3.org/1999/xlink}'):
key = key[30:]
if key != 'href':
key = 'xlink' + key #xlinktype, xlinkrole, xlinklabel, xlinkshow, etc..
elif key.startswith('{' + NSDCOI + '}'):
key = key[nslendcoi:]
#D-Coi support:
if dcoi:
if Class is Word and key == 'pos':
dcoipos = value
continue
elif Class is Word and key == 'lemma':
dcoilemma = value
continue
elif Class is Word and key == 'correction':
dcoicorrection = value #class
continue
elif Class is Word and key == 'original':
dcoicorrectionoriginal = value
continue
elif Class is Gap and key == 'reason':
key = 'class'
elif Class is Gap and key == 'hand':
key = 'annotator'
elif Class is Division and key == 'type':
key = 'cls'
kwargs[key] = value
#D-Coi support:
if dcoi and TextContent in Class.ACCEPTED_DATA and node.text:
text = node.text.strip()
kwargs['text'] = text
if not AnnotationType.TOKEN in doc.annotationdefaults:
doc.declare(AnnotationType.TOKEN, set='http://ilk.uvt.nl/folia/sets/ilktok.foliaset')
if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found " + node.tag[nslen:],file=stderr)
instance = Class(doc, *args, **kwargs)
#if id:
# if doc.debug >= 1: print >>stderr, "[PyNLPl FoLiA DEBUG] Adding to index: " + id
# doc.index[id] = instance
if dcoi:
if dcoipos:
if not AnnotationType.POS in doc.annotationdefaults:
doc.declare(AnnotationType.POS, set='http://ilk.uvt.nl/folia/sets/cgn-legacy.foliaset')
instance.append( PosAnnotation(doc, cls=dcoipos) )
if dcoilemma:
if not AnnotationType.LEMMA in doc.annotationdefaults:
doc.declare(AnnotationType.LEMMA, set='http://ilk.uvt.nl/folia/sets/mblem-nl.foliaset')
instance.append( LemmaAnnotation(doc, cls=dcoilemma) )
if dcoicorrection and dcoicorrectionoriginal and text:
if not AnnotationType.CORRECTION in doc.annotationdefaults:
doc.declare(AnnotationType.CORRECTION, set='http://ilk.uvt.nl/folia/sets/dcoi-corrections.foliaset')
instance.correct(generate_id_in=instance, cls=dcoicorrection, original=dcoicorrectionoriginal, new=text)
if doc.parsexmlcallback:
result = doc.parsexmlcallback(instance)
if not result:
return None
if isinstance(result, AbstractElement):
return result
return instance
def resolveword(self, id):
return None
def remove(self, child):
"""Removes the child element"""
if not isinstance(child, AbstractElement):
raise ValueError("Expected AbstractElement, got " + str(type(child)))
if child.parent == self:
child.parent = None
self.data.remove(child)
#delete from index
if child.id and self.doc and child.id in self.doc.index:
del self.doc.index[child.id]
def incorrection(self):
"""Is this element part of a correction? If it is, it returns the Correction element (evaluating to True), otherwise it returns None"""
e = self.parent
while e:
if isinstance(e, Correction):
return e
if isinstance(e, AbstractStructureElement):
break
e = e.parent
return None
class Description(AbstractElement):
"""Description is an element that can be used to associate a description with almost any other FoLiA element"""
def __init__(self,doc, *args, **kwargs):
"""Required keyword arguments:
* ``value=``: The text content for the description (``str`` or ``unicode``)
"""
if 'value' in kwargs:
if kwargs['value'] is None:
self.value = ""
elif isstring(kwargs['value']):
self.value = u(kwargs['value'])
else:
if sys.version < '3':
raise Exception("value= parameter must be unicode or str instance, got " + str(type(kwargs['value'])))
else:
raise Exception("value= parameter must be str instance, got " + str(type(kwargs['value'])))
del kwargs['value']
else:
raise Exception("Description expects value= parameter")
super(Description,self).__init__(doc, *args, **kwargs)
def __nonzero__(self): #Python 2.x
return bool(self.value)
def __bool__(self):
return bool(self.value)
def __unicode__(self):
return self.value
def __str__(self):
return self.value
def xml(self, attribs = None,elements = None, skipchildren = False):
return super(Description, self).xml(attribs, [self.value],skipchildren)
def json(self,attribs =None, recurse=True, ignorelist=False):
jsonnode = {'type': self.XMLTAG, 'value': self.value}
if attribs:
for attrib in attribs:
jsonnode[attrib] = attrib
return jsonnode
@classmethod
def parsexml(Class, node, doc, **kwargs):
if not kwargs: kwargs = {}
kwargs['value'] = node.text
return super(Description,Class).parsexml(node, doc, **kwargs)
class Comment(AbstractElement):
"""Comment is an element that can be used to associate a comment with almost any other FoLiA element"""
def __init__(self,doc, *args, **kwargs):
"""Required keyword arguments:
* ``value=``: The text content for the comment (``str`` or ``unicode``)
"""
if 'value' in kwargs:
if kwargs['value'] is None:
self.value = ""
elif isstring(kwargs['value']):
self.value = u(kwargs['value'])
else:
if sys.version < '3':
raise Exception("value= parameter must be unicode or str instance, got " + str(type(kwargs['value'])))
else:
raise Exception("value= parameter must be str instance, got " + str(type(kwargs['value'])))
del kwargs['value']
else:
raise Exception("Comment expects value= parameter")
super(Comment,self).__init__(doc, *args, **kwargs)
def __nonzero__(self): #Python 2.x
return bool(self.value)
def __bool__(self):
return bool(self.value)
def __unicode__(self):
return self.value
def __str__(self):
return self.value
def xml(self, attribs = None,elements = None, skipchildren = False):
return super(Comment, self).xml(attribs, [self.value],skipchildren)
def json(self,attribs =None, recurse=True, ignorelist=False):
jsonnode = {'type': self.XMLTAG, 'value': self.value}
if attribs:
for attrib in attribs:
jsonnode[attrib] = attrib
return jsonnode
@classmethod
def parsexml(Class, node, doc, **kwargs):
if not kwargs: kwargs = {}
kwargs['value'] = node.text
return super(Comment,Class).parsexml(node, doc, **kwargs)
class AllowCorrections(object):
def correct(self, **kwargs):
"""Apply a correction (TODO: documentation to be written still)"""
if 'insertindex_offset' in kwargs:
del kwargs['insertindex_offset'] #dealt with in an earlier stage
if 'confidence' in kwargs and kwargs['confidence'] is None:
del kwargs['confidence']
if 'reuse' in kwargs:
#reuse an existing correction instead of making a new one
if isinstance(kwargs['reuse'], Correction):
c = kwargs['reuse']
else: #assume it's an index
try:
c = self.doc.index[kwargs['reuse']]
assert isinstance(c, Correction)
except:
raise ValueError("reuse= must point to an existing correction (id or instance)! Got " + str(kwargs['reuse']))
suggestionsonly = (not c.hasnew(True) and not c.hasoriginal(True) and c.hassuggestions(True))
if 'new' in kwargs and c.hascurrent():
#can't add new if there's current, so first set original to current, and then delete current
if 'current' in kwargs:
raise Exception("Can't set both new= and current= !")
if 'original' not in kwargs:
kwargs['original'] = c.current()
c.remove(c.current())
else:
if 'id' not in kwargs and 'generate_id_in' not in kwargs:
kwargs['generate_id_in'] = self
kwargs2 = copy(kwargs)
for x in ['new','original','suggestion', 'suggestions','current', 'insertindex','nooriginal']:
if x in kwargs2:
del kwargs2[x]
c = Correction(self.doc, **kwargs2)
addnew = False
if 'insertindex' in kwargs:
insertindex = int(kwargs['insertindex'])
del kwargs['insertindex']
else:
insertindex = -1 #append
if 'nooriginal' in kwargs and kwargs['nooriginal']:
nooriginal = True
del kwargs['nooriginal']
else:
nooriginal = False
if 'current' in kwargs:
if 'original' in kwargs or 'new' in kwargs: raise Exception("When setting current=, original= and new= can not be set!")
if not isinstance(kwargs['current'], list) and not isinstance(kwargs['current'], tuple): kwargs['current'] = [kwargs['current']] #support both lists (for multiple elements at once), as well as single element
c.replace(Current(self.doc, *kwargs['current']))
for o in kwargs['current']: #delete current from current element
if o in self and isinstance(o, AbstractElement): #pylint: disable=unsupported-membership-test
if insertindex == -1: insertindex = self.data.index(o)
self.remove(o)
del kwargs['current']
if 'new' in kwargs:
if not isinstance(kwargs['new'], list) and not isinstance(kwargs['new'], tuple): kwargs['new'] = [kwargs['new']] #support both lists (for multiple elements at once), as well as single element
addnew = New(self.doc, *kwargs['new']) #pylint: disable=redefined-variable-type
c.replace(addnew)
for current in c.select(Current): #delete current if present
c.remove(current)
del kwargs['new']
if 'original' in kwargs and kwargs['original']:
if not isinstance(kwargs['original'], list) and not isinstance(kwargs['original'], tuple): kwargs['original'] = [kwargs['original']] #support both lists (for multiple elements at once), as well as single element
c.replace(Original(self.doc, *kwargs['original']))
for o in kwargs['original']: #delete original from current element
if o in self and isinstance(o, AbstractElement): #pylint: disable=unsupported-membership-test
if insertindex == -1: insertindex = self.data.index(o)
self.remove(o)
for o in kwargs['original']: #make sure IDs are still properly set after removal
o.addtoindex()
for current in c.select(Current): #delete current if present
c.remove(current)
del kwargs['original']
elif addnew and not nooriginal:
#original not specified, find automagically:
original = []
for new in addnew:
kwargs2 = {}
if isinstance(new, TextContent):
kwargs2['cls'] = new.cls
try:
set = new.set
except AttributeError:
set = None
#print("DEBUG: Finding replaceables within " + str(repr(self)) + " for ", str(repr(new)), " set " ,set , " args " ,repr(kwargs2),file=sys.stderr)
replaceables = new.__class__.findreplaceables(self, set, **kwargs2)
#print("DEBUG: " , len(replaceables) , " found",file=sys.stderr)
original += replaceables
if not original:
#print("DEBUG: ", self.xmlstring(),file=sys.stderr)
raise Exception("No original= specified and unable to automatically infer on " + str(repr(self)) + " for " + str(repr(new)) + " with set " + set)
else:
c.replace( Original(self.doc, *original))
for current in c.select(Current): #delete current if present
c.remove(current)
if addnew and not nooriginal:
for original in c.original():
if original in self: #pylint: disable=unsupported-membership-test
self.remove(original)
if 'suggestion' in kwargs:
kwargs['suggestions'] = [kwargs['suggestion']]
del kwargs['suggestion']
if 'suggestions' in kwargs:
for suggestion in kwargs['suggestions']:
if isinstance(suggestion, Suggestion):
c.append(suggestion)
elif isinstance(suggestion, list) or isinstance(suggestion, tuple):
c.append(Suggestion(self.doc, *suggestion))
else:
c.append(Suggestion(self.doc, suggestion))
del kwargs['suggestions']
if 'reuse' in kwargs:
if addnew and suggestionsonly:
#What was previously only a suggestion, now becomes a real correction
#If annotator, annotatortypes
#are associated with the correction as a whole, move it to the suggestions
#correction-wide annotator, annotatortypes might be overwritten
for suggestion in c.suggestions():
if c.annotator and not suggestion.annotator:
suggestion.annotator = c.annotator
if c.annotatortype and not suggestion.annotatortype:
suggestion.annotatortype = c.annotatortype
if 'annotator' in kwargs:
c.annotator = kwargs['annotator'] #pylint: disable=attribute-defined-outside-init
if 'annotatortype' in kwargs:
c.annotatortype = kwargs['annotatortype'] #pylint: disable=attribute-defined-outside-init
if 'confidence' in kwargs:
c.confidence = float(kwargs['confidence']) #pylint: disable=attribute-defined-outside-init
c.addtoindex()
del kwargs['reuse']
else:
c.addtoindex()
if insertindex == -1:
self.append(c)
else:
self.insert(insertindex, c)
return c
class AllowTokenAnnotation(AllowCorrections):
"""Elements that allow token annotation (including extended annotation) must inherit from this class"""
def annotations(self,Class,set=None):
"""Obtain child elements (annotations) of the specified class.
A further restriction can be made based on set.
Arguments:
Class (class): The class to select; any python class (not instance) subclassed off :class:`AbstractElement`
Set (str): The set to match against, only elements pertaining to this set will be returned. If set to None (default), all elements regardless of set will be returned.
Yields:
Elements (instances derived from :class:`AbstractElement`)
Example::
for sense in text.annotations(folia.Sense, 'http://some/path/cornetto'):
..
See also:
:meth:`AbstractElement.select`
Raises:
:meth:`AllowTokenAnnotation.annotations`
:class:`NoSuchAnnotation` if no such annotation exists
"""
found = False
for e in self.select(Class,set,True,default_ignore_annotations):
found = True
yield e
if not found:
raise NoSuchAnnotation()
def hasannotation(self,Class,set=None):
"""Returns an integer indicating whether such as annotation exists, and if so, how many.
See :meth:`AllowTokenAnnotation.annotations`` for a description of the parameters."""
return sum( 1 for _ in self.select(Class,set,True,default_ignore_annotations))
def annotation(self, type, set=None):
"""Obtain a single annotation element.
A further restriction can be made based on set.
Arguments:
Class (class): The class to select; any python class (not instance) subclassed off :class:`AbstractElement`
Set (str): The set to match against, only elements pertaining to this set will be returned. If set to None (default), all elements regardless of set will be returned.
Returns:
An element (instance derived from :class:`AbstractElement`)
Example::
sense = word.annotation(folia.Sense, 'http://some/path/cornetto').cls
See also:
:meth:`AllowTokenAnnotation.annotations`
:meth:`AbstractElement.select`
Raises:
:class:`NoSuchAnnotation` if no such annotation exists
"""
"""Will return a **single** annotation (even if there are multiple). Raises a ``NoSuchAnnotation`` exception if none was found"""
for e in self.select(type,set,True,default_ignore_annotations):
return e
raise NoSuchAnnotation()
def alternatives(self, Class=None, set=None):
"""Generator over alternatives, either all or only of a specific annotation type, and possibly restrained also by set.
Arguments:
Class (class): The python Class you want to retrieve (e.g. PosAnnotation). Or set to ``None`` to select all alternatives regardless of what type they are.
set (str): The set you want to retrieve (defaults to ``None``, which selects irregardless of set)
Yields:
:class:`Alternative` elements
"""
for e in self.select(Alternative,None, True, []): #pylint: disable=too-many-nested-blocks
if Class is None:
yield e
elif len(e) >= 1: #child elements?
for e2 in e:
try:
if isinstance(e2, Class):
try:
if set is None or e2.set == set:
yield e #not e2
break #yield an alternative only once (in case there are multiple matches)
except AttributeError:
continue
except AttributeError:
continue
class AllowGenerateID(object):
"""Classes inherited from this class allow for automatic ID generation, using the convention of adding a period, the name of the element , another period, and a sequence number"""
def _getmaxid(self, xmltag):
try:
if xmltag in self.maxid:
return self.maxid[xmltag]
else:
return 0
except AttributeError:
return 0
def _setmaxid(self, child):
#print "set maxid on " + repr(self) + " for " + repr(child)
try:
self.maxid
except AttributeError:
self.maxid = {}#pylint: disable=attribute-defined-outside-init
try:
if child.id and child.XMLTAG:
fields = child.id.split(self.doc.IDSEPARATOR)
if len(fields) > 1 and fields[-1].isdigit():
if not child.XMLTAG in self.maxid:
self.maxid[child.XMLTAG] = int(fields[-1])
#print "set maxid on " + repr(self) + ", " + child.XMLTAG + " to " + fields[-1]
else:
if self.maxid[child.XMLTAG] < int(fields[-1]):
self.maxid[child.XMLTAG] = int(fields[-1])
#print "set maxid on " + repr(self) + ", " + child.XMLTAG + " to " + fields[-1]
except AttributeError:
pass
def generate_id(self, cls):
if isinstance(cls,str):
xmltag = cls
else:
try:
xmltag = cls.XMLTAG
except:
raise GenerateIDException("Unable to generate ID, expected a class such as Alternative, Correction, etc...")
maxid = self._getmaxid(xmltag)
id = None
if self.id:
id = self.id
else:
#this element has no ID, fall back to closest parent ID:
e = self
while e.parent:
if e.id:
id = e.id
break
e = e.parent
if id is None:
raise GenerateIDException("Unable to generate ID, no parent ID could be found")
origid = id
while True:
maxid += 1
id = origid + '.' + xmltag + '.' + str(maxid)
if not self.doc or id not in self.doc.index: #extra check
break
try:
self.maxid
except AttributeError:
self.maxid = {}#pylint: disable=attribute-defined-outside-init
self.maxid[xmltag] = maxid #Set MAX ID
return id
class AbstractStructureElement(AbstractElement, AllowTokenAnnotation, AllowGenerateID):
"""Abstract element, all structure elements inherit from this class. Never instantiated directly."""
def __init__(self, doc, *args, **kwargs):
super(AbstractStructureElement,self).__init__(doc, *args, **kwargs)
def resolveword(self, id):
for child in self:
r = child.resolveword(id)
if r:
return r
return None
def append(self, child, *args, **kwargs):
"""See ``AbstractElement.append()``"""
e = super(AbstractStructureElement,self).append(child, *args, **kwargs)
self._setmaxid(e)
return e
def postappend(self):
super(AbstractStructureElement,self).postappend()
if self.doc and self.doc.textvalidation:
self.doc.textvalidationerrors += int(not self.textvalidation())
def words(self, index = None):
"""Returns a generator of Word elements found (recursively) under this element.
Arguments:
* ``index``: If set to an integer, will retrieve and return the n'th element (starting at 0) instead of returning the list of all
"""
if index is None:
return self.select(Word,None,True,default_ignore_structure)
else:
if index < 0:
index = self.count(Word,None,True,default_ignore_structure) + index
for i, e in enumerate(self.select(Word,None,True,default_ignore_structure)):
if i == index:
return e
raise IndexError
def paragraphs(self, index = None):
"""Returns a generator of Paragraph elements found (recursively) under this element.
Arguments:
index (int or None): If set to an integer, will retrieve and return the n'th element (starting at 0) instead of returning the generator of all
"""
if index is None:
return self.select(Paragraph,None,True,default_ignore_structure)
else:
if index < 0:
index = self.count(Paragraph,None,True,default_ignore_structure) + index
for i,e in enumerate(self.select(Paragraph,None,True,default_ignore_structure)):
if i == index:
return e
raise IndexError
def sentences(self, index = None):
"""Returns a generator of Sentence elements found (recursively) under this element
Arguments:
index (int or None): If set to an integer, will retrieve and return the n'th element (starting at 0) instead of returning a generator of all
"""
if index is None:
return self.select(Sentence,None,True,default_ignore_structure)
else:
if index < 0:
index = self.count(Sentence,None,True,default_ignore_structure) + index
for i,e in enumerate(self.select(Sentence,None,True,default_ignore_structure)):
if i == index:
return e
raise IndexError
def layers(self, annotationtype=None,set=None):
"""Returns a list of annotation layers found *directly* under this element, does not include alternative layers"""
if inspect.isclass(annotationtype): annotationtype = annotationtype.ANNOTATIONTYPE
return [ x for x in self.select(AbstractAnnotationLayer,set,False,True) if annotationtype is None or x.ANNOTATIONTYPE == annotationtype ]
def hasannotationlayer(self, annotationtype=None,set=None):
"""Does the specified annotation layer exist?"""
l = self.layers(annotationtype, set)
return (len(l) > 0)
def __eq__(self, other):
return super(AbstractStructureElement, self).__eq__(other)
class AbstractTokenAnnotation(AbstractElement, AllowGenerateID):
"""Abstract element, all token annotation elements are derived from this class"""
def append(self, child, *args, **kwargs):
"""See ``AbstractElement.append()``"""
e = super(AbstractTokenAnnotation,self).append(child, *args, **kwargs)
self._setmaxid(e)
return e
class AbstractExtendedTokenAnnotation(AbstractTokenAnnotation):
pass
class AbstractTextMarkup(AbstractElement):
"""Abstract class for text markup elements, elements that appear with the :class:`TextContent` (``t``) element.
Markup elements pertain primarily to styling, but also have other roles.
Iterating over the element of a
:class:`TextContent` element will first and foremost produce strings, but also
uncover these markup elements when present.
"""
def __init__(self, doc, *args, **kwargs):
"""See :meth:`AbstractElement.__init__`, text is passed as a string in ``*args``."""
if 'idref' in kwargs:
self.idref = kwargs['idref']
del kwargs['idref']
else:
self.idref = None
if 'value' in kwargs:
#for backward compatibility
kwargs['text'] = kwargs['value']
del kwargs['value']
super(AbstractTextMarkup,self).__init__(doc, *args, **kwargs)
#if self.value and (self.value != self.value.translate(ILLEGAL_UNICODE_CONTROL_CHARACTERS)):
# raise ValueError("There are illegal unicode control characters present in Text Markup Content: " + repr(self.value))
def settext(self, text):
"""Sets the text content of the markup element.
Arguments:
text (str)
"""
self.data = [text]
if not self.data:
raise ValueError("Empty text content elements are not allowed")
def resolve(self):
if self.idref:
return self.doc[self.idref]
else:
return self
def xml(self, attribs = None,elements = None, skipchildren = False):
"""See :meth:`AbstractElement.xml`"""
if not attribs: attribs = {}
if self.idref:
attribs['id'] = self.idref
return super(AbstractTextMarkup,self).xml(attribs,elements, skipchildren)
def json(self,attribs =None, recurse=True, ignorelist=False):
"""See :meth:`AbstractElement.json`"""
if not attribs: attribs = {}
if self.idref:
attribs['id'] = self.idref
return super(AbstractTextMarkup,self).json(attribs,recurse, ignorelist)
@classmethod
def parsexml(Class, node, doc, **kwargs):
if not kwargs: kwargs ={}
if 'id' in node.attrib:
kwargs['idref'] = node.attrib['id']
del node.attrib['id']
return super(AbstractTextMarkup,Class).parsexml(node, doc, **kwargs)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" })
if not extraattribs: extraattribs = []
extraattribs.append( E.optional(E.attribute(name='id' ))) #id reference
return super(AbstractTextMarkup, cls).relaxng(includechildren, extraattribs, extraelements)
class TextMarkupString(AbstractTextMarkup):
"""Markup element to mark arbitrary substrings in text content (:class:`TextContent`)"""
class TextMarkupGap(AbstractTextMarkup):
"""Markup element to mark gaps in text content (:class:`TextContent`)
Only consider this element for gaps in spans of untokenised text. The use of structural element :class:`Gap` is preferred.
"""
class TextMarkupCorrection(AbstractTextMarkup):
"""Markup element to mark corrections in text content (:class:`TextContent`).
Only consider this element for corrections on untokenised text. The use of :class:`Correction` is preferred.
"""
def __init__(self, doc, *args, **kwargs):
if 'original' in kwargs:
self.original = kwargs['original']
del kwargs['original']
else:
self.original = None
super(TextMarkupCorrection,self).__init__(doc, *args, **kwargs)
def xml(self, attribs = None,elements = None, skipchildren = False):
if not attribs: attribs = {}
if self.original:
attribs['original'] = self.original
return super(TextMarkupCorrection,self).xml(attribs,elements, skipchildren)
def json(self,attribs =None, recurse=True, ignorelist=False):
if not attribs: attribs = {}
if self.original:
attribs['original'] = self.original
return super(TextMarkupCorrection,self).json(attribs,recurse,ignorelist)
@classmethod
def parsexml(Class, node, doc, **kwargs):
if not kwargs: kwargs = {}
if 'original' in node.attrib:
kwargs['original'] = node.attrib['original']
del node.attrib['original']
return super(TextMarkupCorrection,Class).parsexml(node, doc, **kwargs)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" })
if not extraattribs: extraattribs = []
extraattribs.append( E.optional(E.attribute(name='original' )))
return super(TextMarkupCorrection, cls).relaxng(includechildren, extraattribs, extraelements)
class TextMarkupError(AbstractTextMarkup):
"""Markup element to mark gaps in text content (:class:`TextContent`)
Only consider this element for gaps in spans of untokenised text. The use of structural element :class:`ErrorDetection` is preferred.
"""
class TextMarkupStyle(AbstractTextMarkup):
"""Markup element to style text content (:class:`TextContent`), e.g. make text bold, italics, underlined, coloured, etc.."""
class TextContent(AbstractElement):
"""Text content element (``t``), holds text to be associated with whatever element the text content element is a child of.
Text content elements
on structure elements like :class:`Paragraph` and :class:`Sentence` are by definition untokenised. Only on :class:`Word`` level and deeper they are by definition tokenised.
Text content elements can specify offset that refer to text at a higher parent level. Use the following keyword arguments:
* ``ref=``: The instance to point to, this points to the element holding the text content element, not the text content element itself.
* ``offset=``: The offset where this text is found, offsets start at 0
"""
def __init__(self, doc, *args, **kwargs):
"""
Example::
text = folia.TextContent(doc, 'test')
text = folia.TextContent(doc, 'test',cls='original')
"""
#for backward compatibility:
if 'value' in kwargs:
kwargs['text'] = kwargs['value']
del kwargs['value']
if 'offset' in kwargs: #offset
self.offset = int(kwargs['offset'])
del kwargs['offset']
else:
self.offset = None
#If no class is specified, it defaults to 'current'. (FoLiA uncharacteristically predefines two classes for t: current and original)
if 'cls' not in kwargs and 'class' not in kwargs:
kwargs['cls'] = 'current'
if 'ref' in kwargs: #reference to offset
if isinstance(kwargs['ref'], AbstractElement):
if kwargs['ref'].id is None:
raise ValueError("Reference for text content must have an ID or can't act as reference!")
self.ref = kwargs['ref'].id
else:
#a string (ID) is passed, we can't resolve it yet cause it may not exist at construction time, use getreference() to resolve when needed
self.ref = kwargs['ref']
del kwargs['ref']
else:
self.ref = None #no explicit reference; if the reference is implicit, getreference() will still work
super(TextContent,self).__init__(doc, *args, **kwargs)
doc.textclasses.add(self.cls)
if not self.data:
raise ValueError("Empty text content elements are not allowed")
#if isstring(self.data[0]) and (self.data[0] != self.data[0].translate(ILLEGAL_UNICODE_CONTROL_CHARACTERS)):
# raise ValueError("There are illegal unicode control characters present in TextContent: " + repr(self.data[0]))
def text(self, normalize_spaces=False):
"""Obtain the text (unicode instance)"""
return super(TextContent,self).text(normalize_spaces=normalize_spaces) #AbstractElement will handle it now, merely overridden to get rid of parameters that dont make sense in this context
def settext(self, text):
self.data = [text]
if not self.data:
raise ValueError("Empty text content elements are not allowed")
#if isstring(self.data[0]) and (self.data[0] != self.data[0].translate(ILLEGAL_UNICODE_CONTROL_CHARACTERS)):
# raise ValueError("There are illegal unicode control characters present in TextContent: " + repr(self.data[0]))
def getreference(self, validate=True):
"""Returns and validates the Text Content's reference. Raises UnresolvableTextContent when invalid"""
if self.offset is None: return None #nothing to test
if self.ref:
ref = self.doc[self.ref]
else:
ref = self.finddefaultreference()
if not ref:
raise UnresolvableTextContent("Default reference for textcontent not found!")
elif not ref.hastext(self.cls):
raise UnresolvableTextContent("Reference (ID " + str(ref.id) + ") has no such text (class=" + self.cls+")")
elif validate and self.text() != ref.textcontent(self.cls).text()[self.offset:self.offset+len(self.data[0])]:
raise UnresolvableTextContent("Reference (ID " + str(ref.id) + ", class=" + self.cls+") found but no text match at specified offset ("+str(self.offset)+")! Expected '" + self.text() + "', got '" + ref.textcontent(self.cls).text()[self.offset:self.offset+len(self.data[0])] +"'")
else:
#finally, we made it!
return ref
def deepvalidation(self):
return True
def __unicode__(self):
return self.text()
def __str__(self):
return self.text()
def __eq__(self, other):
if isinstance(other, TextContent):
return self.text() == other.text()
elif isstring(other):
return self.text() == u(other)
else:
return False
def finddefaultreference(self):
"""Find the default reference for text offsets:
The parent of the current textcontent's parent (counting only Structure Elements and Subtoken Annotation Elements)
Note: This returns not a TextContent element, but its parent. Whether the textcontent actually exists is checked later/elsewhere
"""
depth = 0
e = self
while True:
if e.parent:
e = e.parent #pylint: disable=redefined-variable-type
else:
#no parent, breaking
return False
if isinstance(e, (AbstractStructureElement, AbstractSubtokenAnnotation, String)):
depth += 1
if depth == 2:
return e
return False
#Change in behaviour (FoLiA 0.10), iter() no longer iterates over the text itself!!
#Change in behaviour (FoLiA 0.10), len() no longer return the length of the text!!
@classmethod
def findreplaceables(Class, parent, set, **kwargs):
"""(Method for internal usage, see AbstractElement)"""
#some extra behaviour for text content elements, replace also based on the 'corrected' attribute:
if 'cls' not in kwargs:
kwargs['cls'] = 'current'
replace = super(TextContent, Class).findreplaceables(parent, set, **kwargs)
replace = [ x for x in replace if x.cls == kwargs['cls']]
del kwargs['cls'] #always delete what we processed
return replace
@classmethod
def parsexml(Class, node, doc, **kwargs):
"""(Method for internal usage, see AbstractElement)"""
if not kwargs: kwargs = {}
if 'offset' in node.attrib:
kwargs['offset'] = int(node.attrib['offset'])
if 'ref' in node.attrib:
kwargs['ref'] = node.attrib['ref']
return super(TextContent,Class).parsexml(node,doc, **kwargs)
def xml(self, attribs = None,elements = None, skipchildren = False):
"""See :meth:`AbstractElement.xml`"""
attribs = {}
if not self.offset is None:
attribs['{' + NSFOLIA + '}offset'] = str(self.offset)
if self.parent and self.ref:
attribs['{' + NSFOLIA + '}ref'] = self.ref
#if self.cls != 'current' and not (self.cls == 'original' and any( isinstance(x, Original) for x in self.ancestors() ) ):
# attribs['{' + NSFOLIA + '}class'] = self.cls
#else:
# if '{' + NSFOLIA + '}class' in attribs:
# del attribs['{' + NSFOLIA + '}class']
#return E.t(self.value, **attribs)
e = super(TextContent,self).xml(attribs,elements,skipchildren)
if '{' + NSFOLIA + '}class' in e.attrib and e.attrib['{' + NSFOLIA + '}class'] == "current":
#delete 'class=current'
del e.attrib['{' + NSFOLIA + '}class']
return e
def json(self, attribs =None, recurse =True,ignorelist=False):
"""See :meth:`AbstractElement.json`"""
attribs = {}
if not self.offset is None:
attribs['offset'] = self.offset
if self.parent and self.ref:
attribs['ref'] = self.ref
return super(TextContent,self).json(attribs, recurse,ignorelist)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" })
if not extraattribs: extraattribs = []
extraattribs.append( E.optional(E.attribute(name='offset' )))
extraattribs.append( E.optional(E.attribute(name='ref' )))
return super(TextContent, cls).relaxng(includechildren, extraattribs, extraelements)
def postappend(self):
super(TextContent,self).postappend()
found = set()
for c in self.parent:
if isinstance(c,TextContent):
if c.cls in found:
raise DuplicateAnnotationError("Can not add multiple text content elements with the same class (" + c.cls + ") to the same structural element!")
else:
found.add(c.cls)
class PhonContent(AbstractElement):
"""Phonetic content element (``ph``), holds a phonetic representation to be associated with whatever element the phonetic content element is a child of.
Phonetic content elements behave much like text content elements.
Phonetic content elements can specify offset that refer to phonetic content at a higher parent level. Use the following keyword arguments:
* ``ref=``: The instance to point to, this points to the element holding the text content element, not the text content element itself.
* ``offset=``: The offset where this text is found, offsets start at 0
"""
def __init__(self, doc, *args, **kwargs):
"""
Example::
phon = folia.PhonContent(doc, 'hɛˈləʊ̯')
phon = folia.PhonContent(doc, 'hɛˈləʊ̯', cls="original")
"""
if 'offset' in kwargs: #offset
self.offset = int(kwargs['offset'])
del kwargs['offset']
else:
self.offset = None
#If no class is specified, it defaults to 'current'. (FoLiA uncharacteristically predefines two classes for phon: current and original)
if 'cls' not in kwargs and 'class' not in kwargs:
kwargs['cls'] = 'current'
if 'ref' in kwargs: #reference to offset
if isinstance(kwargs['ref'], AbstractElement):
if kwargs['ref'].id is None:
raise ValueError("Reference for phonetic content must have an ID or can't act as reference!")
self.ref = kwargs['ref'].id
else:
#a string (ID) is passed, we can't resolve it yet cause it may not exist at construction time, use getreference() to resolve when needed
self.ref = kwargs['ref']
del kwargs['ref']
else:
self.ref = None #no explicit reference; if the reference is implicit, getreference() will still work
super(PhonContent,self).__init__(doc, *args, **kwargs)
if not self.data:
raise ValueError("Empty phonetic content elements are not allowed")
#if isstring(self.data[0]) and (self.data[0] != self.data[0].translate(ILLEGAL_UNICODE_CONTROL_CHARACTERS)):
# raise ValueError("There are illegal unicode control characters present in TextContent: " + repr(self.data[0]))
def phon(self):
"""Obtain the actual phonetic representation (unicode/str instance)"""
return super(PhonContent,self).phon() #AbstractElement will handle it now, merely overridden to get rid of parameters that dont make sense in this context
def setphon(self, phon):
"""Set the representation for the phonetic content (unicode instance), called whenever phon= is passed as a keyword argument to an element constructor """
self.data = [phon]
if not self.data:
raise ValueError("Empty phonetic content elements are not allowed")
#if isstring(self.data[0]) and (self.data[0] != self.data[0].translate(ILLEGAL_UNICODE_CONTROL_CHARACTERS)):
# raise ValueError("There are illegal unicode control characters present in TextContent: " + repr(self.data[0]))
def getreference(self, validate=True):
"""Return and validate the Phonetic Content's reference. Raises UnresolvableTextContent when invalid"""
if self.offset is None: return None #nothing to test
if self.ref:
ref = self.doc[self.ref]
else:
ref = self.finddefaultreference()
if not ref:
raise UnresolvableTextContent("Default reference for phonetic content not found!")
elif not ref.hasphon(self.cls):
raise UnresolvableTextContent("Reference has no such phonetic content (class=" + self.cls+")")
elif validate and self.phon() != ref.textcontent(self.cls).phon()[self.offset:self.offset+len(self.data[0])]:
raise UnresolvableTextContent("Reference (class=" + self.cls+") found but no phonetic match at specified offset ("+str(self.offset)+")! Expected '" + self.text() + "', got '" + ref.textcontent(self.cls).text()[self.offset:self.offset+len(self.data[0])] +"'")
else:
#finally, we made it!
return ref
def deepvalidation(self):
return True
def __unicode__(self):
return self.phon()
def __str__(self):
return self.phon()
def __eq__(self, other):
if isinstance(other, PhonContent):
return self.phon() == other.phon()
elif isstring(other):
return self.phon() == u(other)
else:
return False
#append is implemented, the default suffices
def postappend(self):
super(PhonContent,self).postappend()
found = set()
for c in self.parent:
if isinstance(c,PhonContent):
if c.cls in found:
raise DuplicateAnnotationError("Can not add multiple text content elements with the same class (" + c.cls + ") to the same structural element!")
else:
found.add(c.cls)
def finddefaultreference(self):
"""Find the default reference for text offsets:
The parent of the current textcontent's parent (counting only Structure Elements and Subtoken Annotation Elements)
Note: This returns not a TextContent element, but its parent. Whether the textcontent actually exists is checked later/elsewhere
"""
depth = 0
e = self
while True:
if e.parent:
e = e.parent #pylint: disable=redefined-variable-type
else:
#no parent, breaking
return False
if isinstance(e,AbstractStructureElement) or isinstance(e,AbstractSubtokenAnnotation):
depth += 1
if depth == 2:
return e
return False
#Change in behaviour (FoLiA 0.10), iter() no longer iterates over the text itself!!
#Change in behaviour (FoLiA 0.10), len() no longer return the length of the text!!
@classmethod
def findreplaceables(Class, parent, set, **kwargs):#pylint: disable=bad-classmethod-argument
"""(Method for internal usage, see AbstractElement)"""
#some extra behaviour for text content elements, replace also based on the 'corrected' attribute:
if 'cls' not in kwargs:
kwargs['cls'] = 'current'
replace = super(PhonContent, Class).findreplaceables(parent, set, **kwargs)
replace = [ x for x in replace if x.cls == kwargs['cls']]
del kwargs['cls'] #always delete what we processed
return replace
@classmethod
def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument
"""(Method for internal usage, see AbstractElement)"""
if not kwargs: kwargs = {}
if 'offset' in node.attrib:
kwargs['offset'] = int(node.attrib['offset'])
if 'ref' in node.attrib:
kwargs['ref'] = node.attrib['ref']
return super(PhonContent,Class).parsexml(node,doc, **kwargs)
def xml(self, attribs = None,elements = None, skipchildren = False):
attribs = {}
if not self.offset is None:
attribs['{' + NSFOLIA + '}offset'] = str(self.offset)
if self.parent and self.ref:
attribs['{' + NSFOLIA + '}ref'] = self.ref
#if self.cls != 'current' and not (self.cls == 'original' and any( isinstance(x, Original) for x in self.ancestors() ) ):
# attribs['{' + NSFOLIA + '}class'] = self.cls
#else:
# if '{' + NSFOLIA + '}class' in attribs:
# del attribs['{' + NSFOLIA + '}class']
#return E.t(self.value, **attribs)
e = super(PhonContent,self).xml(attribs,elements,skipchildren)
if '{' + NSFOLIA + '}class' in e.attrib and e.attrib['{' + NSFOLIA + '}class'] == "current":
#delete 'class=current'
del e.attrib['{' + NSFOLIA + '}class']
return e
def json(self, attribs =None, recurse =True,ignorelist=False):
attribs = {}
if not self.offset is None:
attribs['offset'] = self.offset
if self.parent and self.ref:
attribs['ref'] = self.ref
return super(PhonContent,self).json(attribs, recurse, ignorelist)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" })
if not extraattribs: extraattribs = []
extraattribs.append( E.optional(E.attribute(name='offset' )))
extraattribs.append( E.optional(E.attribute(name='ref' )))
return super(PhonContent, cls).relaxng(includechildren, extraattribs, extraelements)
class Content(AbstractElement): #used for raw content, subelement for Gap
"""A container element that takes raw content, used by :class:`Gap`"""
def __init__(self,doc, *args, **kwargs):
if 'value' in kwargs:
if isstring(kwargs['value']):
self.value = u(kwargs['value'])
elif kwargs['value'] is None:
self.value = ""
else:
raise Exception("value= parameter must be unicode or str instance")
del kwargs['value']
else:
raise Exception("Content expects value= parameter")
super(Content,self).__init__(doc, *args, **kwargs)
def __nonzero__(self):
return bool(self.value)
def __bool__(self):
return bool(self.value)
def __unicode__(self):
return self.value
def __str__(self):
return self.value
def xml(self, attribs = None,elements = None, skipchildren = False):
E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"})
if not attribs:
attribs = {}
return E.content(self.value, **attribs)
def json(self,attribs =None, recurse=True, ignorelist=False):
jsonnode = {'type': self.XMLTAG, 'value': self.value}
if attribs:
for attrib in attribs:
jsonnode[attrib] = attrib
return jsonnode
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
return E.define( E.element(E.text(), name=cls.XMLTAG), name=cls.XMLTAG, ns=NSFOLIA)
@classmethod
def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument
if not kwargs: kwargs = {}
kwargs['value'] = node.text
return Content(doc, **kwargs)
class Part(AbstractStructureElement):
"""Generic structure element used to mark a part inside another block.
Do **not** use this for morphology, use :class:`Morpheme` instead.
"""
class Gap(AbstractElement):
"""Gap element, represents skipped portions of the text.
Usually contains :class:`Content` and possibly also a :class:`Description` element"""
def __init__(self, doc, *args, **kwargs):
if 'content' in kwargs:
self.value = kwargs['content']
del kwargs['content']
elif 'description' in kwargs:
self.description = kwargs['description']
del kwargs['description']
super(Gap,self).__init__(doc, *args, **kwargs)
def content(self):
for e in self:
if isinstance(e, Content):
return e.value
return ""
class Linebreak(AbstractStructureElement, AbstractTextMarkup): #this element has a double role!!
"""Line break element, signals a line break.
This element acts both as a structure element as well as a text markup element.
"""
def __init__(self, doc, *args, **kwargs):
if 'linenr' in kwargs:
self.linenr = kwargs['linenr']
del kwargs['linenr']
else:
self.linenr = None
if 'pagenr' in kwargs:
self.pagenr = kwargs['pagenr']
del kwargs['pagenr']
else:
self.pagenr = None
if 'newpage' in kwargs and kwargs['newpage']:
self.newpage = True
del kwargs['newpage']
else:
self.newpage = False
super(Linebreak, self).__init__(doc, *args, **kwargs)
def text(self, cls='current', retaintokenisation=False, previousdelimiter="", strict=False, correctionhandling=None, normalize_spaces=False):
if normalize_spaces:
return " "
else:
return previousdelimiter.strip(' ') + "\n"
@classmethod
def parsexml(Class, node, doc):#pylint: disable=bad-classmethod-argument
kwargs = {}
if 'linenr' in node.attrib:
kwargs['linenr'] = node.attrib['linenr']
if 'pagenr' in node.attrib:
kwargs['pagenr'] = node.attrib['pagenr']
if 'newpage' in node.attrib and node.attrib['newpage'] == 'yes':
kwargs['newpage'] = True
br = Linebreak(doc, **kwargs)
if '{http://www.w3.org/1999/xlink}href' in node.attrib:
br.href = node.attrib['{http://www.w3.org/1999/xlink}href']
if '{http://www.w3.org/1999/xlink}type' in node.attrib:
br.xlinktype = node.attrib['{http://www.w3.org/1999/xlink}type']
return br
def xml(self, attribs = None,elements = None, skipchildren = False):
if attribs is None: attribs = {}
if self.linenr is not None:
attribs['{' + NSFOLIA + '}linenr'] = str(self.linenr)
if self.pagenr is not None:
attribs['{' + NSFOLIA + '}pagenr'] = str(self.pagenr)
if self.newpage:
attribs['{' + NSFOLIA + '}newpage'] = "yes"
return super(Linebreak, self).xml(attribs,elements,skipchildren)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
attribs = []
attribs.append(E.optional(E.attribute(name='pagenr')))
attribs.append(E.optional(E.attribute(name='linenr')))
attribs.append(E.optional(E.attribute(name='newpage')))
return super(Linebreak,cls).relaxng(includechildren,attribs,extraelements)
class Whitespace(AbstractStructureElement):
"""Whitespace element, signals a vertical whitespace"""
def text(self, cls='current', retaintokenisation=False, previousdelimiter="", strict=False,correctionhandling=None, normalize_spaces=False):
if normalize_spaces:
return " "
else:
return previousdelimiter.strip(' ') + "\n\n"
class Word(AbstractStructureElement, AllowCorrections):
"""Word (aka token) element. Holds a word/token and all its related token annotations."""
#will actually be determined by gettextdelimiter()
def __init__(self, doc, *args, **kwargs):
"""Constructor for words.
See :class:`AbstractElement.__init__` for all inherited keyword arguments and parameters.
Keyword arguments:
* space (bool): Indicates whether this token is followed by a space (defaults to True)
Example::
sentence.append( folia.Word, 'This')
sentence.append( folia.Word, 'is')
sentence.append( folia.Word, 'a')
sentence.append( folia.Word, 'test', space=False)
sentence.append( folia.Word, '.')
See also:
:class:`AbstractElement.__init__`
"""
self.space = True
if 'space' in kwargs:
self.space = kwargs['space']
del kwargs['space']
super(Word,self).__init__(doc, *args, **kwargs)
def sentence(self):
"""Obtain the sentence this word is a part of, otherwise return None"""
return self.ancestor(Sentence)
def paragraph(self):
"""Obtain the paragraph this word is a part of, otherwise return None"""
return self.ancestor(Paragraph)
def division(self):
"""Obtain the deepest division this word is a part of, otherwise return None"""
return self.ancestor(Division)
def pos(self,set=None):
"""Shortcut: returns the FoLiA class of the PoS annotation (will return only one if there are multiple!)"""
return self.annotation(PosAnnotation,set).cls
def lemma(self, set=None):
"""Shortcut: returns the FoLiA class of the lemma annotation (will return only one if there are multiple!)"""
return self.annotation(LemmaAnnotation,set).cls
def sense(self,set=None):
"""Shortcut: returns the FoLiA class of the sense annotation (will return only one if there are multiple!)"""
return self.annotation(SenseAnnotation,set).cls
def domain(self,set=None):
"""Shortcut: returns the FoLiA class of the domain annotation (will return only one if there are multiple!)"""
return self.annotation(DomainAnnotation,set).cls
def morphemes(self,set=None):
"""Generator yielding all morphemes (in a particular set if specified). For retrieving one specific morpheme by index, use morpheme() instead"""
for layer in self.select(MorphologyLayer):
for m in layer.select(Morpheme, set):
yield m
def phonemes(self,set=None):
"""Generator yielding all phonemes (in a particular set if specified). For retrieving one specific morpheme by index, use morpheme() instead"""
for layer in self.select(PhonologyLayer):
for p in layer.select(Phoneme, set):
yield p
def morpheme(self,index, set=None):
"""Returns a specific morpheme, the n'th morpheme (given the particular set if specified)."""
for layer in self.select(MorphologyLayer):
for i, m in enumerate(layer.select(Morpheme, set)):
if index == i:
return m
raise NoSuchAnnotation
def phoneme(self,index, set=None):
"""Returns a specific phoneme, the n'th morpheme (given the particular set if specified)."""
for layer in self.select(PhonologyLayer):
for i, p in enumerate(layer.select(Phoneme, set)):
if index == i:
return p
raise NoSuchAnnotation
def gettextdelimiter(self, retaintokenisation=False):
"""Returns the text delimiter"""
if self.space or retaintokenisation:
return ' '
else:
return ''
def resolveword(self, id):
if id == self.id:
return self
else:
return None
def getcorrection(self,set=None,cls=None):
try:
return self.getcorrections(set,cls)[0]
except:
raise NoSuchAnnotation
def getcorrections(self, set=None,cls=None):
try:
l = []
for correction in self.annotations(Correction):
if ((not set or correction.set == set) and (not cls or correction.cls == cls)):
l.append(correction)
return l
except NoSuchAnnotation:
raise
@classmethod
def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument
assert Class is Word
instance = super(Word,Class).parsexml(node, doc, **kwargs) #we do this the old way (no kwargs used, because for some reason I forgot we need to whether instance evaluates to True)
if 'space' in node.attrib and instance:
if node.attrib['space'] == 'no':
instance.space = False
return instance
def xml(self, attribs = None,elements = None, skipchildren = False):
if not attribs: attribs = {}
if not self.space:
attribs['space'] = 'no'
return super(Word,self).xml(attribs,elements, False)
def json(self,attribs =None, recurse=True, ignorelist=False):
if not attribs: attribs = {}
if not self.space:
attribs['space'] = 'no'
return super(Word,self).json(attribs, recurse,ignorelist)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
if not extraattribs:
extraattribs = [ E.optional(E.attribute(name='space')) ]
else:
extraattribs.append( E.optional(E.attribute(name='space')) )
return AbstractStructureElement.relaxng(includechildren, extraattribs, extraelements, cls)
def split(self, *newwords, **kwargs):
self.sentence().splitword(self, *newwords, **kwargs)
def findspans(self, type,set=None):
"""Yields span annotation elements of the specified type that include this word.
Arguments:
type: The annotation type, can be passed as using any of the :class:`AnnotationType` member, or by passing the relevant :class:`AbstractSpanAnnotation` or :class:`AbstractAnnotationLayer` class.
set (str or None): Constrain by set
Example::
for chunk in word.findspans(folia.Chunk):
print(" Chunk class=", chunk.cls, " words=")
for word2 in chunk.wrefs(): #print all words in the chunk (of which the word is a part)
print(word2, end="")
print()
Yields:
Matching span annotation instances (derived from :class:`AbstractSpanAnnotation`)
"""
if issubclass(type, AbstractAnnotationLayer):
layerclass = type
else:
layerclass = ANNOTATIONTYPE2LAYERCLASS[type.ANNOTATIONTYPE]
e = self
while True:
if not e.parent: break
e = e.parent
for layer in e.select(layerclass,set,False):
if type is layerclass:
for e2 in layer.select(AbstractSpanAnnotation,set,True, (True, Word, Morpheme)):
if not isinstance(e2, AbstractSpanRole) and self in e2.wrefs():
yield e2
else:
for e2 in layer.select(type,set,True, (True, Word, Morpheme)):
if not isinstance(e2, AbstractSpanRole) and self in e2.wrefs():
yield e2
#for e2 in layer:
# if (type is layerclass and isinstance(e2, AbstractSpanAnnotation)) or (type is not layerclass and isinstance(e2, type)):
# if self in e2.wrefs():
# yield e2
class Feature(AbstractElement):
"""Feature elements can be used to associate subsets and subclasses with almost any
annotation element"""
def __init__(self,doc, *args, **kwargs): #pylint: disable=super-init-not-called
"""Constructor.
Keyword Arguments:
subset (str): the subset
cls (str): the class
"""
self.id = None
self.set = None
self.data = []
self.annotator = None
self.annotatortype = None
self.confidence = None
self.n = None
self.datetime = None
if not isinstance(doc, Document) and not (doc is None):
raise Exception("First argument of Feature constructor must be a Document instance, not " + str(type(doc)))
self.doc = doc
self.auth = True
if self.SUBSET:
self.subset = self.SUBSET
elif 'subset' in kwargs:
self.subset = kwargs['subset']
else:
raise Exception("No subset specified for " + self.__class__.__name__)
if 'cls' in kwargs:
self.cls = kwargs['cls']
elif 'class' in kwargs:
self.cls = kwargs['class']
else:
raise Exception("No class specified for " + self.__class__.__name__)
if isinstance(self.cls, datetime):
self.cls = self.cls.strftime("%Y-%m-%dT%H:%M:%S")
def xml(self):
E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"})
attribs = {}
if self.subset != self.SUBSET:
attribs['{' + NSFOLIA + '}subset'] = self.subset
attribs['{' + NSFOLIA + '}class'] = self.cls
return makeelement(E,'{' + NSFOLIA + '}' + self.XMLTAG, **attribs)
def json(self,attribs=None, recurse=True, ignorelist=False):
jsonnode= {'type': Feature.XMLTAG}
jsonnode['subset'] = self.subset
jsonnode['class'] = self.cls
return jsonnode
@classmethod
def relaxng(cls, includechildren=True, extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
return E.define( E.element(E.attribute(name='subset'), E.attribute(name='class'),name=cls.XMLTAG), name=cls.XMLTAG,ns=NSFOLIA)
def deepvalidation(self):
"""Perform deep validation of this element.
Raises:
:class:`DeepValidationError`
"""
if self.doc and self.doc.deepvalidation and self.parent.set and self.parent.set[0] != '_':
try:
self.doc.setdefinitions[self.parent.set].testsubclass(self.parent.cls, self.subset, self.cls)
except KeyError as e:
if self.parent.cls and not self.doc.allowadhocsets:
raise DeepValidationError("Set definition " + self.parent.set + " for " + self.parent.XMLTAG + " not loaded (feature validation failed)!")
except DeepValidationError as e:
errormsg = str(e) + " (in set " + self.parent.set+" for " + self.parent.XMLTAG
if self.parent.id:
errormsg += " with ID " + self.parent.id
errormsg += ")"
raise DeepValidationError(errormsg)
class ValueFeature(Feature):
"""Value feature, to be used within :class:`Metric`"""
pass
class Metric(AbstractElement):
"""Metric elements provide a key/value pair to allow the annotation of any kind of metric with any kind of annotation element.
It is used for example for statistical measures to be added to elements as annotation."""
pass
class AbstractSubtokenAnnotation(AbstractElement, AllowGenerateID):
"""Abstract element, all subtoken annotation elements are derived from this class"""
pass
class AbstractSpanAnnotation(AbstractElement, AllowGenerateID, AllowCorrections):
"""Abstract element, all span annotation elements are derived from this class"""
def xml(self, attribs = None,elements = None, skipchildren = False):
"""See :meth:`AbstractElement.xml`"""
if not attribs: attribs = {}
E = ElementMaker(namespace="http://ilk.uvt.nl/folia",nsmap={None: "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
e = super(AbstractSpanAnnotation,self).xml(attribs, elements, True)
for child in self:
if isinstance(child, (Word, Morpheme, Phoneme)):
#Include REFERENCES to word items instead of word items themselves
attribs['{' + NSFOLIA + '}id'] = child.id
if child.PRINTABLE and child.hastext(self.textclass):
attribs['{' + NSFOLIA + '}t'] = child.text(self.textclass)
e.append( E.wref(**attribs) )
elif not (isinstance(child, Feature) and child.SUBSET): #Don't add pre-defined features, they are already added as attributes
e.append( child.xml() )
return e
def append(self, child, *args, **kwargs):
"""See :meth:`AbstractElement.append`"""
#Accept Word instances instead of WordReference, references will be automagically used upon serialisation
if isinstance(child, (Word, Morpheme, Phoneme)) and WordReference in self.ACCEPTED_DATA:
#We don't really append but do an insertion so all references are in proper order
insertionpoint = len(self.data)
for i, sibling in enumerate(self.data):
if isinstance(sibling, (Word, Morpheme, Phoneme)):
try:
if not sibling.precedes(child):
insertionpoint = i
except: #happens if we can't determine common ancestors
pass
self.data.insert(insertionpoint, child)
return child
elif isinstance(child, AbstractSpanAnnotation): #(covers span roles just as well)
insertionpoint = len(self.data)
try:
firstword = child.wrefs(0)
except IndexError:
#we have no basis to determine an insertionpoint for this child, just append it then
return super(AbstractSpanAnnotation,self).append(child, *args, **kwargs)
insertionpoint = len(self.data)
for i, sibling in enumerate(self.data):
if isinstance(sibling, (Word, Morpheme, Phoneme)):
try:
if not sibling.precedes(firstword):
insertionpoint = i
except: #happens if we can't determine common ancestors
pass
return super(AbstractSpanAnnotation,self).insert(insertionpoint, child, *args, **kwargs)
else:
return super(AbstractSpanAnnotation,self).append(child, *args, **kwargs)
def setspan(self, *args):
"""Sets the span of the span element anew, erases all data inside.
Arguments:
*args: Instances of :class:`Word`, :class:`Morpheme` or :class:`Phoneme`
"""
self.data = []
for child in args:
self.append(child)
def add(self, child, *args, **kwargs): #alias for append
return self.append(child, *args, **kwargs)
def hasannotation(self,Class,set=None):
"""Returns an integer indicating whether such as annotation exists, and if so, how many. See ``annotations()`` for a description of the parameters."""
return self.count(Class,set,True,default_ignore_annotations)
def annotation(self, type, set=None):
"""Will return a **single** annotation (even if there are multiple). Raises a ``NoSuchAnnotation`` exception if none was found"""
l = list(self.select(type,set,True,default_ignore_annotations))
if len(l) >= 1:
return l[0]
else:
raise NoSuchAnnotation()
def annotations(self,Class,set=None):
"""Obtain annotations. Very similar to ``select()`` but raises an error if the annotation was not found.
Arguments:
* ``Class`` - The Class you want to retrieve (e.g. PosAnnotation)
* ``set`` - The set you want to retrieve (defaults to None, which selects irregardless of set)
Yields:
elements
Raises:
``NoSuchAnnotation`` if the specified annotation does not exist.
"""
found = False
for e in self.select(Class,set,True,default_ignore_annotations):
found = True
yield e
if not found:
raise NoSuchAnnotation()
def _helper_wrefs(self, targets, recurse=True):
"""Internal helper function"""
for c in self:
if isinstance(c,Word) or isinstance(c,Morpheme) or isinstance(c, Phoneme):
targets.append(c)
elif isinstance(c,WordReference):
try:
targets.append(self.doc[c.id]) #try to resolve
except KeyError:
targets.append(c) #add unresolved
elif isinstance(c, AbstractSpanAnnotation) and recurse:
#recursion
c._helper_wrefs(targets) #pylint: disable=protected-access
elif isinstance(c, Correction) and c.auth: #recurse into corrections
for e in c:
if isinstance(e, AbstractCorrectionChild) and e.auth:
for e2 in e:
if isinstance(e2, AbstractSpanAnnotation):
#recursion
e2._helper_wrefs(targets) #pylint: disable=protected-access
def wrefs(self, index = None, recurse=True):
"""Returns a list of word references, these can be Words but also Morphemes or Phonemes.
Arguments:
index (int or None): If set to an integer, will retrieve and return the n'th element (starting at 0) instead of returning the list of all
"""
targets =[]
self._helper_wrefs(targets, recurse)
if index is None:
return targets
else:
return targets[index]
def addtoindex(self,norecurse=None):
"""Makes sure this element (and all subelements), are properly added to the index"""
if not norecurse: norecurse = (Word, Morpheme, Phoneme)
if self.id:
self.doc.index[self.id] = self
for e in self.data:
if all([not isinstance(e, C) for C in norecurse]):
try:
e.addtoindex(norecurse)
except AttributeError:
pass
def copychildren(self, newdoc=None, idsuffix=""):
"""Generator creating a deep copy of the children of this element. If idsuffix is a string, if set to True, a random idsuffix will be generated including a random 32-bit hash"""
if idsuffix is True: idsuffix = ".copy." + "%08x" % random.getrandbits(32) #random 32-bit hash for each copy, same one will be reused for all children
for c in self:
if isinstance(c, Word):
yield WordReference(newdoc, id=c.id)
else:
yield c.copy(newdoc,idsuffix)
def postappend(self):
super(AbstractSpanAnnotation,self).postappend()
#If a span annotation element with wrefs x y z is added in the scope of parent span annotation element with wrefs u v w x y z, then x y z is removed from the parent span (no duplication, implicit through recursion)
e = self.parent
directwrefs = None #will be populated on first iteration
while isinstance(e, AbstractSpanAnnotation):
if directwrefs is None:
directwrefs = self.wrefs(recurse=False)
for wref in directwrefs:
try:
e.data.remove(wref)
except ValueError:
pass
e = e.parent
class AbstractAnnotationLayer(AbstractElement, AllowGenerateID, AllowCorrections):
"""Annotation layers for Span Annotation are derived from this abstract base class"""
def __init__(self, doc, *args, **kwargs):
if 'set' in kwargs:
self.set = kwargs['set']
elif self.ANNOTATIONTYPE in doc.annotationdefaults and len(doc.annotationdefaults[self.ANNOTATIONTYPE]) == 1:
self.set = list(doc.annotationdefaults[self.ANNOTATIONTYPE].keys())[0]
else:
self.set = False
# ok, let's not raise an error yet, may may still be able to derive a set from elements that are appended
super(AbstractAnnotationLayer,self).__init__(doc, *args, **kwargs)
def xml(self, attribs = None,elements = None, skipchildren = False):
"""See :meth:`AbstractElement.xml`"""
if self.set is False or self.set is None:
if len(self.data) == 0: #just skip if there are no children
return None
else:
raise ValueError("No set specified or derivable for annotation layer " + self.__class__.__name__)
return super(AbstractAnnotationLayer, self).xml(attribs, elements, skipchildren)
def append(self, child, *args, **kwargs):
"""See :meth:`AbstractElement.append`"""
#if no set is associated with the layer yet, we learn it from span annotation elements that are added
if self.set is False or self.set is None:
if inspect.isclass(child):
if issubclass(child,AbstractSpanAnnotation):
if 'set' in kwargs:
self.set = kwargs['set']
elif isinstance(child, AbstractSpanAnnotation):
if child.set:
self.set = child.set
elif isinstance(child, Correction):
#descend into corrections to find the proper set for this layer (derived from span annotation elements)
for e in itertools.chain( child.new(), child.original(), child.suggestions() ):
if isinstance(e, AbstractSpanAnnotation) and e.set:
self.set = e.set
break
return super(AbstractAnnotationLayer, self).append(child, *args, **kwargs)
def add(self, child, *args, **kwargs): #alias for append
return self.append(child, *args, **kwargs)
def annotations(self,Class,set=None):
"""Obtain annotations. Very similar to ``select()`` but raises an error if the annotation was not found.
Arguments:
* ``Class`` - The Class you want to retrieve (e.g. PosAnnotation)
* ``set`` - The set you want to retrieve (defaults to None, which selects irregardless of set)
Yields:
elements
Raises:
``NoSuchAnnotation`` if the specified annotation does not exist.
"""
found = False
for e in self.select(Class,set,True,default_ignore_annotations):
found = True
yield e
if not found:
raise NoSuchAnnotation()
def hasannotation(self,Class,set=None):
"""Returns an integer indicating whether such as annotation exists, and if so, how many. See ``annotations()`` for a description of the parameters."""
return self.count(Class,set,True,default_ignore_annotations)
def annotation(self, type, set=None):
"""Will return a **single** annotation (even if there are multiple). Raises a ``NoSuchAnnotation`` exception if none was found"""
for e in self.select(type,set,True,default_ignore_annotations):
return e
raise NoSuchAnnotation()
def alternatives(self, Class=None, set=None):
"""Generator over alternatives, either all or only of a specific annotation type, and possibly restrained also by set.
Arguments:
* ``Class`` - The Class you want to retrieve (e.g. PosAnnotation). Or set to None to select all alternatives regardless of what type they are.
* ``set`` - The set you want to retrieve (defaults to None, which selects irregardless of set)
Returns:
Generator over Alternative elements
"""
for e in self.select(AlternativeLayers,None, True, ['Original','Suggestion']): #pylint: disable=too-many-nested-blocks
if Class is None:
yield e
elif len(e) >= 1: #child elements?
for e2 in e:
try:
if isinstance(e2, Class):
try:
if set is None or e2.set == set:
yield e #not e2
break #yield an alternative only once (in case there are multiple matches)
except AttributeError:
continue
except AttributeError:
continue
def findspan(self, *words):
"""Returns the span element which spans over the specified words or morphemes.
See also:
:meth:`Word.findspans`
"""
for span in self.select(AbstractSpanAnnotation,None,True):
if tuple(span.wrefs()) == words:
return span
raise NoSuchAnnotation
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None, origclass = None):
"""Returns a RelaxNG definition for this element (as an XML element (lxml.etree) rather than a string)"""
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" })
if not extraattribs:
extraattribs = []
extraattribs.append(E.optional(E.attribute(E.text(), name='set')) )
return AbstractElement.relaxng(includechildren, extraattribs, extraelements, cls)
def deepvalidation(self):
return True
# class AbstractSubtokenAnnotationLayer(AbstractElement, AllowGenerateID):
# """Annotation layers for Subtoken Annotation are derived from this abstract base class"""
# OPTIONAL_ATTRIBS = ()
# PRINTABLE = False
# def __init__(self, doc, *args, **kwargs):
# if 'set' in kwargs:
# self.set = kwargs['set']
# del kwargs['set']
# super(AbstractSubtokenAnnotationLayer,self).__init__(doc, *args, **kwargs)
class String(AbstractElement, AllowTokenAnnotation):
"""String"""
pass
class AbstractCorrectionChild(AbstractElement):
def generate_id(self, cls):
#Delegate ID generation to parent
return self.parent.generate_id(cls)
def deepvalidation(self):
return True
class Reference(AbstractStructureElement):
"""A structural element that denotes a reference, internal or external. Examples are references to footnotes, bibliographies, hyperlinks."""
def __init__(self, doc, *args, **kwargs):
if 'idref' in kwargs:
self.idref = kwargs['idref']
del kwargs['idref']
else:
self.idref = None
if 'type' in kwargs:
self.type = kwargs['type']
del kwargs['type']
else:
self.type = None
if 'format' in kwargs:
self.format = kwargs['format']
del kwargs['format']
else:
self.format = "text/folia+xml"
super(Reference,self).__init__(doc, *args, **kwargs)
def xml(self, attribs = None,elements = None, skipchildren = False):
if not attribs: attribs = {}
if self.idref:
attribs['id'] = self.idref
if self.type:
attribs['type'] = self.type
if self.format and self.format != "text/folia+xml":
attribs['format'] = self.format
return super(Reference,self).xml(attribs,elements, skipchildren)
def json(self, attribs=None, recurse=True, ignorelist=False):
if attribs is None: attribs = {}
if self.idref:
attribs['idref'] = self.idref
if self.type:
attribs['type'] = self.type
if self.format:
attribs['format'] = self.format
return super(Reference,self).json(attribs,recurse,ignorelist)
def resolve(self):
if self.idref:
return self.doc[self.idref]
else:
return self
@classmethod
def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument
if not kwargs: kwargs = {}
if 'id' in node.attrib:
kwargs['idref'] = node.attrib['id']
del node.attrib['id']
if 'type' in node.attrib:
kwargs['type'] = node.attrib['type']
del node.attrib['type']
if 'format' in node.attrib:
kwargs['format'] = node.attrib['format']
del node.attrib['format']
return super(Reference,Class).parsexml(node, doc, **kwargs)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" })
if not extraattribs: extraattribs = []
extraattribs.append( E.attribute(name='id')) #id reference
extraattribs.append( E.optional(E.attribute(name='type' )))
extraattribs.append( E.optional(E.attribute(name='format' )))
return super(Reference, cls).relaxng(includechildren, extraattribs, extraelements)
class AlignReference(AbstractElement):
"""The AlignReference element is used to point to specific elements inside the aligned source.
It is used with :class:`Alignment` which is responsible for pointing to the external resource."""
def __init__(self, doc, *args, **kwargs): #pylint: disable=super-init-not-called
#Special constructor, not calling super constructor
if 'id' not in kwargs:
raise Exception("ID required for AlignReference")
if 'type' in kwargs:
if isinstance(kwargs['type'], AbstractElement) or inspect.isclass(kwargs['type']):
self.type = kwargs['type'].XMLTAG
else:
self.type = kwargs['type']
else:
self.type = None
if 't' in kwargs:
self.t = kwargs['t']
else:
self.t = None
assert(isinstance(doc,Document))
self.doc = doc
self.id = kwargs['id']
self.annotator = None
self.annotatortype = None
self.confidence = None
self.n = None
self.datetime = None
self.auth = False
self.set = None
self.cls = None
self.data = []
@classmethod
def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument
assert Class is AlignReference or issubclass(Class, AlignReference)
#special handling for word references
if not kwargs: kwargs = {}
kwargs['id'] = node.attrib['id']
if not 'type' in node.attrib:
raise ValueError("No type in alignment reference")
if 't' in node.attrib:
kwargs['t'] = node.attrib['t']
try:
kwargs['type'] = node.attrib['type']
except KeyError:
raise ValueError("No such type: " + node.attrib['type'])
return AlignReference(doc,**kwargs)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
return E.define( E.element(E.attribute(E.text(), name='id'), E.optional(E.attribute(E.text(), name='t')), E.optional(E.attribute(E.text(), name='type')), name=cls.XMLTAG), name=cls.XMLTAG, ns=NSFOLIA)
def resolve(self, alignmentcontext=None, documents={}):
if not alignmentcontext or not hasattr(alignmentcontext, 'href') or not alignmentcontext.href:
#no target document, same document
return self.doc[self.id]
else:
#other document
if alignmentcontext.href in documents:
return documents[alignmentcontext.href][self.id]
else:
raise DocumentNotLoaded()
def xml(self, attribs = None,elements = None, skipchildren = False):
E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"})
if not attribs:
attribs = {}
attribs['id'] = self.id
if self.type:
attribs['type'] = self.type
if self.t: attribs['t'] = self.t
return E.aref( **attribs)
def json(self, attribs=None, recurse=True, ignorelist=False):
return {} #alignment not supported yet, TODO
class Alignment(AbstractElement):
"""
The Alignment element is a form of higher-order annotation taht is used to point to an external resource.
It concerns references as annotation rather than references which are
explicitly part of the text, such as hyperlinks and :class:`Reference`.
Inside the Alignment element, the :class:`AlignReference` element may be used to point to specific elements (multiple denotes a span).
"""
def __init__(self, doc, *args, **kwargs):
if 'format' in kwargs:
self.format = kwargs['format']
del kwargs['format']
else:
self.format = "text/folia+xml"
super(Alignment,self).__init__(doc, *args, **kwargs)
@classmethod
def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument
if 'format' in node.attrib:
kwargs['format'] = node.attrib['format']
del node.attrib['format']
return super(Alignment,Class).parsexml(node, doc, **kwargs)
def xml(self, attribs = None,elements = None, skipchildren = False):
if not attribs: attribs = {}
if self.format and self.format != "text/folia+xml":
attribs['format'] = self.format
return super(Alignment,self).xml(attribs,elements, skipchildren)
def json(self, attribs =None, recurse=True, ignorelist=False):
return {} #alignment not supported yet, TODO
def resolve(self, documents=None):
if documents is None: documents = {}
#documents is a dictionary of urls to document instances, to aid in resolving cross-document alignments
for x in self.select(AlignReference,None,True,False):
yield x.resolve(self, documents)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
if extraattribs is None: extraattribs = []
extraattribs.append(E.optional(E.attribute(name="format")))
return super(Alignment,cls).relaxng(includechildren, extraattribs, extraelements)
class ErrorDetection(AbstractExtendedTokenAnnotation):
"""The ErrorDetection element is used to signal the presence of errors in a structural element."""
pass
class Suggestion(AbstractCorrectionChild):
"""Suggestions are used in the context of :class:`Correction`, but rather than provide an authoritative correction, it instead offers a suggestion for correction."""
def __init__(self, doc, *args, **kwargs):
if 'split' in kwargs:
self.split = kwargs['split']
del kwargs['split']
else:
self.split = None
if 'merge' in kwargs:
self.merge = kwargs['merge']
del kwargs['merge']
else:
self.merge = None
super(Suggestion,self).__init__(doc, *args, **kwargs)
@classmethod
def parsexml(Class, node, doc, **kwargs): #pylint: disable=bad-classmethod-argument
if not kwargs: kwargs = {}
if 'split' in node.attrib:
kwargs['split'] = node.attrib['split']
if 'merge' in node.attrib:
kwargs['merge'] = node.attrib['merge']
return super(Suggestion,Class).parsexml(node, doc, **kwargs)
def xml(self, attribs = None,elements = None, skipchildren = False):
if not attribs: attribs= {}
if self.split: attribs['split'] = self.split
if self.merge: attribs['merge'] = self.merge
return super(Suggestion, self).xml(attribs, elements, skipchildren)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" })
if not extraattribs: extraattribs = []
extraattribs.append( E.optional(E.attribute(name='split' )))
extraattribs.append( E.optional(E.attribute(name='merge' )))
return super(Suggestion, cls).relaxng(includechildren, extraattribs, extraelements)
def json(self, attribs = None, recurse=True,ignorelist=False):
if self.split:
if not attribs: attribs = {}
attribs['split'] = self.split
if self.merge:
if not attribs: attribs = {}
attribs['merge'] = self.merge
return super(Suggestion, self).json(attribs, recurse, ignorelist)
class New(AbstractCorrectionChild):
@classmethod
def addable(Class, parent, set=None, raiseexceptions=True):#pylint: disable=bad-classmethod-argument
if not super(New,Class).addable(parent,set,raiseexceptions): return False
if any( ( isinstance(c, Current) for c in parent ) ):
if raiseexceptions:
raise ValueError("Can't add New element to Correction if there is a Current item")
else:
return False
return True
def correct(self, **kwargs):
return self.parent.correct(**kwargs)
class Original(AbstractCorrectionChild):
"""Used in the context of :class:`Correction` to encapsulate the original annotations *prior* to correction."""
@classmethod
def addable(Class, parent, set=None, raiseexceptions=True):#pylint: disable=bad-classmethod-argument
if not super(Original,Class).addable(parent,set,raiseexceptions): return False
if any( ( isinstance(c, Current) for c in parent ) ):
if raiseexceptions:
raise Exception("Can't add Original item to Correction if there is a Current item")
else:
return False
return True
class Current(AbstractCorrectionChild):
"""Used in the context of :class:`Correction` to encapsulate the currently authoritative annotations.
Needed only when suggestions for correction are proposed (:class:`Suggestion`) for structural elements.
"""
@classmethod
def addable(Class, parent, set=None, raiseexceptions=True):
if not super(Current,Class).addable(parent,set,raiseexceptions): return False
if any( ( isinstance(c, New) or isinstance(c, Original) for c in parent ) ):
if raiseexceptions:
raise Exception("Can't add Current element to Correction if there is a New or Original element")
else:
return False
return True
def correct(self, **kwargs):
return self.parent.correct(**kwargs)
class Correction(AbstractElement, AllowGenerateID):
"""
Corrections are one of the most complex annotation types in FoLiA. Corrections
can be applied not just over text, but over any type of structure annotation,
token annotation or span annotation. Corrections explicitly preserve the
original, and recursively so if corrections are done over other corrections.
Despite their complexity, the library treats correction transparently. Whenever
you query for a particular element, and it is part of a correction, you get the
corrected version rather than the original. The original is always *non-authoritative*
and normal selection methods will ignore it.
This class takes four classes as children, that in turn encapsulate the actual annotations:
* :class:`New` - Encapsulates the newly corrected annotation(s)
* :class:`Original` - Encapsulated the old original annotation(s)
* :class:`Current` - Encapsulates the current authoritative annotation(s)
* :class:`Suggestions` - Encapsulates the annotation(s) that are a non-authoritative suggestion for correction
"""
def append(self, child, *args, **kwargs):
"""See ``AbstractElement.append()``"""
e = super(Correction,self).append(child, *args, **kwargs)
self._setmaxid(e)
return e
def hasnew(self,allowempty=False):
"""Does the correction define new corrected annotations?"""
for e in self.select(New,None,False, False):
if not allowempty and len(e) == 0: continue
return True
return False
def hasoriginal(self,allowempty=False):
"""Does the correction record the old annotations prior to correction?"""
for e in self.select(Original,None,False, False):
if not allowempty and len(e) == 0: continue
return True
return False
def hascurrent(self, allowempty=False):
"""Does the correction record the current authoritative annotation (needed only in a structural context when suggestions are proposed)"""
for e in self.select(Current,None,False, False):
if not allowempty and len(e) == 0: continue
return True
return False
def hassuggestions(self,allowempty=False):
"""Does the correction propose suggestions for correction?"""
for e in self.select(Suggestion,None,False, False):
if not allowempty and len(e) == 0: continue
return True
return False
def textcontent(self, cls='current', correctionhandling=CorrectionHandling.CURRENT):
"""See :meth:`AbstractElement.textcontent`"""
if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility
if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, New) or isinstance(e, Current):
return e.textcontent(cls,correctionhandling)
if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, Original):
return e.textcontent(cls,correctionhandling)
raise NoSuchText
def phoncontent(self, cls='current', correctionhandling=CorrectionHandling.CURRENT):
"""See :meth:`AbstractElement.phoncontent`"""
if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility
if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, New) or isinstance(e, Current):
return e.phoncontent(cls, correctionhandling)
if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, Original):
return e.phoncontent(cls, correctionhandling)
raise NoSuchPhon
def hastext(self, cls='current',strict=True, correctionhandling=CorrectionHandling.CURRENT):
"""See :meth:`AbstractElement.hastext`"""
if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility
if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, New) or isinstance(e, Current):
return e.hastext(cls,strict, correctionhandling)
if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, Original):
return e.hastext(cls,strict, correctionhandling)
return False
def text(self, cls = 'current', retaintokenisation=False, previousdelimiter="",strict=False, correctionhandling=CorrectionHandling.CURRENT, normalize_spaces=False):
"""See :meth:`AbstractElement.text`"""
if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility
if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, New) or isinstance(e, Current):
s = previousdelimiter + e.text(cls, retaintokenisation,"", strict, correctionhandling)
if normalize_spaces:
return norm_spaces(s)
else:
return s
if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, Original):
s = previousdelimiter + e.text(cls, retaintokenisation,"", strict, correctionhandling)
if normalize_spaces:
return norm_spaces(s)
else:
return s
raise NoSuchText
def hasphon(self, cls='current',strict=True, correctionhandling=CorrectionHandling.CURRENT):
"""See :meth:`AbstractElement.hasphon`"""
if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility
if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, New) or isinstance(e, Current):
return e.hasphon(cls,strict, correctionhandling)
if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, Original):
return e.hasphon(cls,strict, correctionhandling)
return False
def phon(self, cls = 'current', previousdelimiter="",strict=False, correctionhandling=CorrectionHandling.CURRENT):
"""See :meth:`AbstractElement.phon`"""
if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility
if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, New) or isinstance(e, Current):
return previousdelimiter + e.phon(cls, "", strict, correctionhandling)
if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER):
for e in self:
if isinstance(e, Original):
return previousdelimiter + e.phon(cls, "", correctionhandling)
raise NoSuchPhon
def gettextdelimiter(self, retaintokenisation=False):
"""See :meth:`AbstractElement.gettextdelimiter`"""
for e in self:
if isinstance(e, New) or isinstance(e, Current):
return e.gettextdelimiter(retaintokenisation)
return ""
def new(self,index = None):
"""Get the new corrected annotation.
This returns only one annotation if multiple exist, use `index` to select another in the sequence.
Returns:
an annotation element (:class:`AbstractElement`)
Raises:
:class:`NoSuchAnnotation`
"""
if index is None:
try:
return next(self.select(New,None,False))
except StopIteration:
raise NoSuchAnnotation
else:
for e in self.select(New,None,False):
return e[index]
raise NoSuchAnnotation
def original(self,index=None):
"""Get the old annotation prior to correction.
This returns only one annotation if multiple exist, use `index` to select another in the sequence.
Returns:
an annotation element (:class:`AbstractElement`)
Raises:
:class:`NoSuchAnnotation`
"""
if index is None:
try:
return next(self.select(Original,None,False, False))
except StopIteration:
raise NoSuchAnnotation
else:
for e in self.select(Original,None,False, False):
return e[index]
raise NoSuchAnnotation
def current(self,index=None):
"""Get the current authoritative annotation (used with suggestions in a structural context)
This returns only one annotation if multiple exist, use `index` to select another in the sequence.
Returns:
an annotation element (:class:`AbstractElement`)
Raises:
:class:`NoSuchAnnotation`
"""
if index is None:
try:
return next(self.select(Current,None,False))
except StopIteration:
raise NoSuchAnnotation
else:
for e in self.select(Current,None,False):
return e[index]
raise NoSuchAnnotation
def suggestions(self,index=None):
"""Get suggestions for correction.
Yields:
:class:`Suggestion` element that encapsulate the suggested annotations (if index is ``None``, default)
Returns:
a :class:`Suggestion` element that encapsulate the suggested annotations (if index is set)
Raises:
:class:`IndexError`
"""
if index is None:
return self.select(Suggestion,None,False, False)
else:
for i, e in enumerate(self.select(Suggestion,None,False, False)):
if index == i:
return e
raise IndexError
def __unicode__(self):
return str(self)
def __str__(self):
return self.text(self, 'current', False, "",False, CorrectionHandling.EITHER)
def correct(self, **kwargs):
if 'new' in kwargs:
if 'nooriginal' not in kwargs: #if not an insertion
kwargs['original'] = self
elif 'current' in kwargs:
kwargs['current'] = self
if 'insertindex' in kwargs:
#recompute insertindex
index = self.parent.getindex(self)
if index != -1:
kwargs['insertindex'] = index
if 'insertindex_offset' in kwargs:
kwargs['insertindex'] += kwargs['insertindex_offset']
del kwargs['insetindex_offset']
else:
raise Exception("Can't find insertion point for higher-order correction")
return self.parent.correct(**kwargs)
#obsolete
#def select(self, cls, set=None, recursive=True, ignorelist=[], node=None):
# """Select on Correction only descends in either "NEW" or "CURRENT" branch"""
# if ignorelist is False:
# #to override and go into all branches, set ignorelist explictly to False
# return super(Correction,self).select(cls,set,recursive, ignorelist, node)
# else:
# if ignorelist is True:
# ignorelist = copy(default_ignore)
# else:
# ignorelist = copy(ignorelist) #we don't want to alter a passed ignorelist (by ref)
# ignorelist.append(Original)
# ignorelist.append(Suggestion)
# return super(Correction,self).select(cls,set,recursive, ignorelist, node)
class Alternative(AbstractElement, AllowTokenAnnotation, AllowGenerateID):
"""Element grouping alternative token annotation(s).
Multiple alternative elements may occur, each denoting a different alternative. Elements grouped inside an alternative block are considered dependent.
A key feature of FoLiA is its ability to make explicit alternative
annotations, for token annotations, this class is used to this end.
Alternative annotations are embedded in this structure. This implies the
annotation is *not authoritative*, but is merely an alternative to the
actual annotation (if any). Alternatives may typically occur in larger
numbers, representing a distribution each with a confidence value (not
mandatory). Each alternative is wrapped in its an instance of this class,
as multiple elements inside a single alternative are considered dependent
and part of the same alternative. Combining multiple annotation in one
alternative makes sense for mixed annotation types, where for instance a
pos tag alternative is tied to a particular lemma.
"""
def deepvalidation(self):
return True
class AlternativeLayers(AbstractElement):
"""Element grouping alternative subtoken annotation(s). Multiple altlayers elements may occur, each denoting a different alternative. Elements grouped inside an alternative block are considered dependent."""
def deepvalidation(self):
return True
class External(AbstractElement):
def __init__(self, doc, *args, **kwargs): #pylint: disable=super-init-not-called
#Special constructor, not calling super constructor
if 'source' not in kwargs:
raise Exception("Source required for External")
assert(isinstance(doc,Document))
self.doc = doc
self.id = None
self.source = kwargs['source']
if 'include' in kwargs and kwargs['include'] != 'no':
self.include = bool(kwargs['include'])
else:
self.include = False
self.annotator = None
self.annotatortype = None
self.confidence = None
self.n = None
self.datetime = None
self.auth = False
self.data = []
self.subdoc = None
if self.include:
if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Loading subdocument for inclusion: " + self.source,file=stderr)
#load subdocument
#check if it is already loaded, if multiple references are made to the same doc we reuse the instance
if self.source in self.doc.subdocs:
self.subdoc = self.doc.subdocs[self.source]
elif self.source[:7] == 'http://' or self.source[:8] == 'https://':
#document is remote, download (in memory)
try:
f = urlopen(self.source)
except:
raise DeepValidationError("Unable to download subdocument for inclusion: " + self.source)
try:
content = u(f.read())
except IOError:
raise DeepValidationError("Unable to download subdocument for inclusion: " + self.source)
f.close()
self.subdoc = Document(string=content, parentdoc = self.doc, setdefinitions=self.doc.setdefinitions)
elif os.path.exists(self.source):
#document is on disk:
self.subdoc = Document(file=self.source, parentdoc = self.doc, setdefinitions=self.doc.setdefinitions)
else:
#document not found
raise DeepValidationError("Unable to find subdocument for inclusion: " + self.source)
self.subdoc.parentdoc = self.doc
self.doc.subdocs[self.source] = self.subdoc
#TODO: verify there are no clashes in declarations between parent and child
#TODO: check validity of elements under subdoc/text with respect to self.parent
@classmethod
def parsexml(Class, node, doc, **kwargs):
assert Class is External or issubclass(Class, External)
if not kwargs: kwargs = {}
#special handling for external
source = node.attrib['src']
if 'include' in node.attrib:
include = node.attrib['include']
else:
include = False
if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found external",file=stderr)
return External(doc, source=source, include=include)
def xml(self, attribs = None,elements = None, skipchildren = False):
if not attribs:
attribs= {}
attribs['src'] = self.source
if self.include:
attribs['include'] = 'yes'
else:
attribs['include'] = 'no'
return super(External, self).xml(attribs, elements, skipchildren)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
return E.define( E.element(E.attribute(E.text(), name='src'), E.optional(E.attribute(E.text(), name='include')), name=cls.XMLTAG), name=cls.XMLTAG, ns=NSFOLIA)
def select(self, Class, set=None, recursive=True, ignore=True, node=None):
"""See :meth:`AbstractElement.select`"""
if self.include:
return self.subdoc.data[0].select(Class,set,recursive, ignore, node) #pass it on to the text node of the subdoc
else:
return iter([])
class WordReference(AbstractElement):
"""Word reference. Used to refer to words or morphemes from span annotation elements. The Python class will only be used when word reference can not be resolved, if they can, Word or Morpheme objects will be used"""
def __init__(self, doc, *args, **kwargs): #pylint: disable=super-init-not-called
#Special constructor, not calling super constructor
if 'idref' not in kwargs and 'id' not in kwargs:
raise Exception("ID required for WordReference")
assert isinstance(doc,Document)
self.doc = doc
if 'idref' in kwargs:
self.id = kwargs['idref']
else:
self.id = kwargs['id']
self.annotator = None
self.annotatortype = None
self.confidence = None
self.n = None
self.datetime = None
self.data = []
self.set = None
self.cls = None
self.auth = True
@classmethod
def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument
assert Class is WordReference or issubclass(Class, WordReference)
#special handling for word references
id = node.attrib['id']
if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found word reference",file=stderr)
try:
return doc[id]
except KeyError:
if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] ...Unresolvable!",file=stderr)
return WordReference(doc, id=id)
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
return E.define( E.element(E.attribute(E.text(), name='id'), E.optional(E.attribute(E.text(), name='t')), name=cls.XMLTAG), name=cls.XMLTAG, ns=NSFOLIA)
def xml(self, attribs = None,elements = None, skipchildren = False):
"""Serialises the FoLiA element to XML, by returning an XML Element (in lxml.etree) for this element and all its children. For string output, consider the xmlstring() method instead."""
E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"})
if not attribs: attribs = {}
if not elements: elements = []
if self.id:
attribs['id'] = self.id
try:
w = self.doc[self.id]
attribs['t'] = w.text()
except KeyError:
pass
e = makeelement(E, '{' + NSFOLIA + '}' + self.XMLTAG, **attribs)
return e
class SyntacticUnit(AbstractSpanAnnotation):
"""Syntactic Unit, span annotation element to be used in :class:`SyntaxLayer`"""
pass
class Chunk(AbstractSpanAnnotation):
"""Chunk element, span annotation element to be used in :class:`ChunkingLayer`"""
pass
class Entity(AbstractSpanAnnotation):
"""Entity element, for entities such as named entities, multi-word expressions, temporal entities. This is a span annotation element to be used in :class:`EntitiesLayer`"""
pass
class AbstractSpanRole(AbstractSpanAnnotation):
#TODO: span roles don't take classes, derived off spanannotation allows too much
pass
class Headspan(AbstractSpanRole): #generic head element
"""The headspan role is used to mark the head of a span annotation.
It can be used in various contexts, for instance to mark the head of a :class:`Dependency`.
It is allowed by most span annotations.
"""
DependencyHead = Headspan #alias, backwards compatibility with FoLiA 0.8
class DependencyDependent(AbstractSpanRole):
"""Span role element that marks the dependent in a dependency relation. Used in :class:`Dependency`.
:class:`Headspan` in turn is used to mark the head of a dependency relation."""
pass
class Source(AbstractSpanRole):
"""The source span role is used to mark the source in a :class:`Sentiment` or :class:`Statement` """
class Target(AbstractSpanRole):
"""The target span role is used to mark the target in a :class:`Sentiment` """
class Relation(AbstractSpanRole):
"""The relation span role is used to mark the relation between the content of a statement and its source in a :class:`Statement`"""
class Dependency(AbstractSpanAnnotation):
"""Span annotation element to encode dependency relations"""
def head(self):
"""Returns the head of the dependency relation. Instance of :class:`Headspan`"""
return next(self.select(Headspan))
def dependent(self):
"""Returns the dependent of the dependency relation. Instance of :class:`DependencyDependent`"""
return next(self.select(DependencyDependent))
class ModalityFeature(Feature):
"""Modality feature, to be used with coreferences"""
class TimeFeature(Feature):
"""Time feature, to be used with coreferences"""
class LevelFeature(Feature):
"""Level feature, to be used with coreferences"""
class CoreferenceLink(AbstractSpanRole):
"""Coreference link. Used in :class:`CoreferenceChain`"""
class CoreferenceChain(AbstractSpanAnnotation):
"""Coreference chain. Holds :class:`CoreferenceLink` instances."""
class SemanticRole(AbstractSpanAnnotation):
"""Semantic Role"""
class Predicate(AbstractSpanAnnotation):
"""Predicate, used within :class:`SemanticRolesLayer`, takes :class:`SemanticRole` annotations as children, but has its own annotation type and separate declaration"""
class Sentiment(AbstractSpanAnnotation):
"""Sentiment. Takes span roles :class:`Headspan`, :class:`Source` and :class:`Target` as children"""
class Statement(AbstractSpanAnnotation):
"""Statement. Takes span roles :class:`Headspan`, :class:`Source` and :class:`Relation` as children"""
class Observation(AbstractSpanAnnotation):
"""Observation."""
class ComplexAlignment(AbstractElement):
"""Complex Alignment"""
#same as for AbstractSpanAnnotation, which this technically is not (hence copy)
def hasannotation(self,Class,set=None):
"""Returns an integer indicating whether such as annotation exists, and if so, how many. See ``annotations()`` for a description of the parameters."""
return self.count(Class,set,True,default_ignore_annotations)
#same as for AbstractSpanAnnotation, which this technically is not (hence copy)
def annotation(self, type, set=None):
"""Will return a **single** annotation (even if there are multiple). Raises a ``NoSuchAnnotation`` exception if none was found"""
l = self.count(type,set,True,default_ignore_annotations)
if len(l) >= 1:
return l[0]
else:
raise NoSuchAnnotation()
class FunctionFeature(Feature):
"""Function feature, to be used with :class:`Morpheme`"""
class Morpheme(AbstractStructureElement):
"""Morpheme element, represents one morpheme in morphological analysis, subtoken annotation element to be used in :class:`MorphologyLayer`"""
def findspans(self, type,set=None):
"""Find span annotation of the specified type that include this word"""
if issubclass(type, AbstractAnnotationLayer):
layerclass = type
else:
layerclass = ANNOTATIONTYPE2LAYERCLASS[type.ANNOTATIONTYPE]
e = self
while True:
if not e.parent: break
e = e.parent
for layer in e.select(layerclass,set,False):
for e2 in layer:
if isinstance(e2, AbstractSpanAnnotation):
if self in e2.wrefs():
yield e2
def textvalidation(self, warnonly=None): #warnonly will change at some point in the future to be stricter
return True
class Phoneme(AbstractStructureElement):
"""Phone element, represents one phone in phonetic analysis, subtoken annotation element to be used in :class:`PhonologyLayer`"""
def findspans(self, type,set=None): #TODO: this is a copy of the methods in Morpheme in Word, abstract into separate class and inherit
"""Find span annotation of the specified type that include this phoneme.
See :meth:`Word.findspans` for usage.
"""
if issubclass(type, AbstractAnnotationLayer):
layerclass = type
else:
layerclass = ANNOTATIONTYPE2LAYERCLASS[type.ANNOTATIONTYPE]
e = self
while True:
if not e.parent: break
e = e.parent
for layer in e.select(layerclass,set,False):
for e2 in layer:
if isinstance(e2, AbstractSpanAnnotation):
if self in e2.wrefs():
yield e2
#class Subentity(AbstractSubtokenAnnotation):
# """Subentity element, for named entities within a single token, subtoken annotation element to be used in SubentitiesLayer"""
# ACCEPTED_DATA = (Feature,TextContent, Metric)
# ANNOTATIONTYPE = AnnotationType.SUBENTITY
# XMLTAG = 'subentity'
class SyntaxLayer(AbstractAnnotationLayer):
"""Syntax Layer: Annotation layer for :class:`SyntacticUnit` span annotation elements"""
class ChunkingLayer(AbstractAnnotationLayer):
"""Chunking Layer: Annotation layer for :class:`Chunk` span annotation elements"""
class EntitiesLayer(AbstractAnnotationLayer):
"""Entities Layer: Annotation layer for :class:`Entity` span annotation elements. For named entities."""
class DependenciesLayer(AbstractAnnotationLayer):
"""Dependencies Layer: Annotation layer for :class:`Dependency` span annotation elements. For dependency entities."""
class MorphologyLayer(AbstractAnnotationLayer):
"""Morphology Layer: Annotation layer for :class:`Morpheme` subtoken annotation elements. For morphological analysis."""
class PhonologyLayer(AbstractAnnotationLayer):
"""Phonology Layer: Annotation layer for :class:`Phoneme` subtoken annotation elements. For phonetic analysis."""
class CoreferenceLayer(AbstractAnnotationLayer):
"""Syntax Layer: Annotation layer for :class:`SyntacticUnit` span annotation elements"""
class SemanticRolesLayer(AbstractAnnotationLayer):
"""Syntax Layer: Annotation layer for :class:`SemanticRole` span annotation elements"""
class StatementLayer(AbstractAnnotationLayer):
"""Statement Layer: Annotation layer for :class:`Statement` span annotation elements, used for attribution annotation."""
class SentimentLayer(AbstractAnnotationLayer):
"""Sentiment Layer: Annotation layer for :class:`Sentiment` span annotation elements, used for sentiment analysis."""
class ObservationLayer(AbstractAnnotationLayer):
"""Observation Layer: Annotation layer for :class:`Observation` span annotation elements."""
class ComplexAlignmentLayer(AbstractAnnotationLayer):
"""Complex alignment layer"""
ACCEPTED_DATA = (ComplexAlignment,Description,Correction)
XMLTAG = 'complexalignments'
ANNOTATIONTYPE = AnnotationType.COMPLEXALIGNMENT
class HeadFeature(Feature):
"""Head feature, to be used within :class:`PosAnnotation`"""
class PosAnnotation(AbstractTokenAnnotation):
"""Part-of-Speech annotation: a token annotation element"""
class LemmaAnnotation(AbstractTokenAnnotation):
"""Lemma annotation: a token annotation element"""
class LangAnnotation(AbstractExtendedTokenAnnotation):
"""Language annotation: an extended token annotation element"""
#class PhonAnnotation(AbstractTokenAnnotation): #DEPRECATED in v0.9
# """Phonetic annotation: a token annotation element"""
# ANNOTATIONTYPE = AnnotationType.PHON
# ACCEPTED_DATA = (Feature,Description, Metric)
# XMLTAG = 'phon'
class DomainAnnotation(AbstractExtendedTokenAnnotation):
"""Domain annotation: an extended token annotation element"""
class SynsetFeature(Feature):
"""Synset feature, to be used within :class:`Sense`"""
class ActorFeature(Feature):
"""Actor feature, to be used within :class:`Event`"""
class PolarityFeature(Feature):
"""Polarity feature, to be used within :class:`Sentiment`"""
class StrengthFeature(Feature):
"""Strength feature, to be used within :class:`Sentiment`"""
class BegindatetimeFeature(Feature):
"""Begindatetime feature, to be used within :class:`Event`"""
class EnddatetimeFeature(Feature):
"""Enddatetime feature, to be used within :class:`Event`"""
class StyleFeature(Feature):
pass
class Note(AbstractStructureElement):
"""Element used for notes, such as footnotes or warnings or notice blocks."""
class Definition(AbstractStructureElement):
"""Element used in :class:`Entry` for the portion that provides a definition for the entry."""
class Term(AbstractStructureElement):
"""A term, often used in contect of :class:`Entry`"""
class Example(AbstractStructureElement):
"""Element that provides an example. Used for instance in the context of :class:`Entry`"""
class Entry(AbstractStructureElement):
"""Represents an entry in a glossary/lexicon/dictionary."""
class TimeSegment(AbstractSpanAnnotation):
"""A time segment"""
TimedEvent = TimeSegment #alias for FoLiA 0.8 compatibility
class TimingLayer(AbstractAnnotationLayer):
"""Timing layer: Annotation layer for :class:`TimeSegment` span annotation elements. """
class SenseAnnotation(AbstractTokenAnnotation):
"""Sense annotation: a token annotation element"""
class SubjectivityAnnotation(AbstractTokenAnnotation):
"""Subjectivity annotation/Sentiment analysis: a token annotation element"""
class Quote(AbstractStructureElement):
"""Quote: a structure element. For quotes/citations. May hold :class:`Word`, :class:`Sentence` or :class:`Paragraph` data."""
def __init__(self, doc, *args, **kwargs):
super(Quote,self).__init__(doc, *args, **kwargs)
def resolveword(self, id):
for child in self:
r = child.resolveword(id)
if r:
return r
return None
def append(self, child, *args, **kwargs):
#Quotes have some more complex ACCEPTED_DATA behaviour depending on what lever they are used on
#Note that Sentences under quotes may occur if the parent of the quote is a sentence already
insentence = len(list(self.ancestors(Sentence))) > 0
inparagraph = len(list(self.ancestors(Paragraph))) > 0
if inspect.isclass(child):
if (insentence or inparagraph) and (child is Paragraph or child is Division):
raise Exception("Can't add paragraphs or divisions to a quote when the quote is in a sentence or paragraph!")
else:
if (insentence or inparagraph) and (isinstance(child, Paragraph) or isinstance(child, Division)):
raise Exception("Can't add paragraphs or divisions to a quote when the quote is in a sentence or paragraph!")
return super(Quote, self).append(child, *args, **kwargs)
def gettextdelimiter(self, retaintokenisation=False):
#no text delimiter of itself, recurse into children to inherit delimiter
for child in reversed(self):
if isinstance(child, Sentence):
return "" #if a quote ends in a sentence, we don't want any delimiter
else:
return child.gettextdelimiter(retaintokenisation)
return self.TEXTDELIMITER
class Sentence(AbstractStructureElement):
"""Sentence element. A structure element. Represents a sentence and holds all its words (:class:`Word`), and possibly other structure such as :class:`LineBreak`, :class:`Whitespace` and :class:`Quote`"""
def __init__(self, doc, *args, **kwargs):
"""
Example::
sentence = paragraph.append( folia.Sentence)
sentence.append( folia.Word, 'This')
sentence.append( folia.Word, 'is')
sentence.append( folia.Word, 'a')
sentence.append( folia.Word, 'test', space=False)
sentence.append( folia.Word, '.')
Example::
sentence = folia.Sentence( doc, folia.Word(doc, 'This'), folia.Word(doc, 'is'), folia.Word(doc, 'a'), folia.Word(doc, 'test', space=False), folia.Word(doc, '.') )
paragraph.append(sentence)
See also:
:meth:`AbstractElement.__init__`
"""
super(Sentence,self).__init__(doc, *args, **kwargs)
def resolveword(self, id):
for child in self:
r = child.resolveword(id)
if r:
return r
return None
def corrections(self):
"""Are there corrections in this sentence?
Returns:
bool
"""
return bool(self.select(Correction))
def paragraph(self):
"""Obtain the paragraph this sentence is a part of (None otherwise). Shortcut for :meth:`AbstractElement.ancestor`"""
return self.ancestor(Paragraph)
def division(self):
"""Obtain the division this sentence is a part of (None otherwise). Shortcut for :meth:`AbstractElement.ancestor`"""
return self.ancestor(Division)
def correctwords(self, originalwords, newwords, **kwargs):
"""Generic correction method for words. You most likely want to use the helper functions
:meth:`Sentence.splitword` , :meth:`Sentence.mergewords`, :meth:`deleteword`, :meth:`insertword` instead"""
for w in originalwords:
if not isinstance(w, Word):
raise Exception("Original word is not a Word instance: " + str(type(w)))
elif w.sentence() != self:
raise Exception("Original not found as member of sentence!")
for w in newwords:
if not isinstance(w, Word):
raise Exception("New word is not a Word instance: " + str(type(w)))
if 'suggest' in kwargs and kwargs['suggest']:
del kwargs['suggest']
return self.correct(suggestion=newwords,current=originalwords, **kwargs)
else:
return self.correct(original=originalwords, new=newwords, **kwargs)
def splitword(self, originalword, *newwords, **kwargs):
"""TODO: Write documentation"""
if isstring(originalword):
originalword = self.doc[u(originalword)]
return self.correctwords([originalword], newwords, **kwargs)
def mergewords(self, newword, *originalwords, **kwargs):
"""TODO: Write documentation"""
return self.correctwords(originalwords, [newword], **kwargs)
def deleteword(self, word, **kwargs):
"""TODO: Write documentation"""
if isstring(word):
word = self.doc[u(word)]
return self.correctwords([word], [], **kwargs)
def insertword(self, newword, prevword, **kwargs):
"""Inserts a word **as a correction** after an existing word.
This method automatically computes the index of insertion
and calls :meth:`AbstractElement.insert`
Arguments:
newword (:class:`Word`): The new word to insert
prevword (:class:`Word`): The word to insert after
Keyword Arguments:
suggest (bool): Do a suggestion for correction rather than the default authoritive correction
See also:
:meth:`AbstractElement.insert` and :meth:`AbstractElement.getindex` If you do not want to do corrections
"""
if prevword:
if isstring(prevword):
prevword = self.doc[u(prevword)]
if not prevword in self or not isinstance(prevword, Word):
raise Exception("Previous word not found or not instance of Word!")
if isinstance(newword, list) or isinstance(newword, tuple):
if not all([ isinstance(x, Word) for x in newword ]):
raise Exception("New word (iterable) constains non-Word instances!")
elif not isinstance(newword, Word):
raise Exception("New word no instance of Word!")
kwargs['insertindex'] = self.getindex(prevword) + 1
else:
kwargs['insertindex'] = 0
kwargs['nooriginal'] = True
if isinstance(newword, list) or isinstance(newword, tuple):
return self.correctwords([], newword, **kwargs)
else:
return self.correctwords([], [newword], **kwargs)
def insertwordleft(self, newword, nextword, **kwargs):
"""Inserts a word **as a correction** before an existing word.
Reverse of :meth:`Sentence.insertword`.
"""
if nextword:
if isstring(nextword):
nextword = self.doc[u(nextword)]
if not nextword in self or not isinstance(nextword, Word):
raise Exception("Next word not found or not instance of Word!")
if isinstance(newword, list) or isinstance(newword, tuple):
if not all([ isinstance(x, Word) for x in newword ]):
raise Exception("New word (iterable) constains non-Word instances!")
elif not isinstance(newword, Word):
raise Exception("New word no instance of Word!")
kwargs['insertindex'] = self.getindex(nextword)
else:
kwargs['insertindex'] = 0
kwargs['nooriginal'] = True
if isinstance(newword, list) or isinstance(newword, tuple):
return self.correctwords([], newword, **kwargs)
else:
return self.correctwords([], [newword], **kwargs)
def gettextdelimiter(self, retaintokenisation=False):
#no text delimiter of itself, recurse into children to inherit delimiter
for child in reversed(self):
if isinstance(child, (Linebreak, Whitespace)):
return "" #if a sentence ends in a linebreak, we don't want any delimiter
elif isinstance(child, Word) and not child.space:
return "" #if a sentence ends in a word with space=no, then we don't delimit either
elif isinstance(child, AbstractStructureElement):
#recurse? if the child is hidden in another element (part for instance?)
return child.gettextdelimiter(retaintokenisation) #if a sentence ends in a word with space=no, then we don't delimit either
#TODO: what about corrections?
elif isinstance(child, (AbstractAnnotationLayer, AbstractTokenAnnotation) ):
continue #this never counts as the last element (issue #41), continue...
else:
break
return self.TEXTDELIMITER
class Utterance(AbstractStructureElement):
"""Utterance element. A structure element for speech annotation."""
class Event(AbstractStructureElement):
"""Structural element representing events, often used in new media contexts for things such as tweets,chat messages and forum posts."""
class Caption(AbstractStructureElement):
"""Element used for captions for :class:`Figure` or :class:`Table`"""
class Label(AbstractStructureElement):
"""Element used for labels. Mostly in within list item. Contains words."""
class ListItem(AbstractStructureElement):
"""Single element in a List. Structure element. Contained within :class:`List` element."""
class List(AbstractStructureElement):
"""Element for enumeration/itemisation. Structure element. Contains :class:`ListItem` elements."""
class Figure(AbstractStructureElement):
"""Element for the representation of a graphical figure. Structure element."""
def json(self, attribs = None, recurse=True,ignorelist=False):
if self.src:
if not attribs: attribs = {}
attribs['src'] = self.src
return super(Figure, self).json(attribs, recurse, ignorelist)
def caption(self):
try:
caption = next(self.select(Caption))
return caption.text()
except:
raise NoSuchText
class Head(AbstractStructureElement):
"""Head element; a structure element that acts as the header/title of a :class:`Division`.
There may be only one per division. Often contains sentences (:class:`Sentence`) or Words (:class:`Word`)."""
class Paragraph(AbstractStructureElement):
"""Paragraph element. A structure element. Represents a paragraph and holds all its sentences (and possibly other structure Whitespace and Quotes)."""
class Cell(AbstractStructureElement):
"""A cell in a :class:`Row` in a :class:`Table`"""
pass
class Row(AbstractStructureElement):
"""A row in a :class:`Table`"""
pass
class TableHead(AbstractStructureElement):
"""Encapsulated the header of a table, contains :class:`Cell` elements"""
pass
class Table(AbstractStructureElement):
"""A table consisting of :class:`Row` elements that in turn consist of :class:`Cell` elements"""
pass
class Division(AbstractStructureElement):
"""Structure element representing some kind of division. Divisions may be nested at will, and may include almost all kinds of other structure elements."""
def head(self):
for e in self.data:
if isinstance(e, Head):
return e
raise NoSuchAnnotation()
class Speech(AbstractStructureElement):
"""A full speech. This is a high-level element. This element may contain :class:`Division`,:class:`Paragraph`, class:`Sentence`, etc.."""
# (both SPEAKABLE and PRINTABLE)
class Text(AbstractStructureElement):
"""A full text. This is a high-level element (not to be confused with TextContent!). This element may contain :class:`Division`,:class:`Paragraph`, class:`Sentence`, etc.."""
# (both SPEAKABLE and PRINTABLE)
class ForeignData(AbstractElement):
"""The ForeignData element encapsulated data that is not in FoLiA but in a different format.
Such data must use a different XML namespace and will be preserved as-is, that is the ``lxml.etree.Element`` instance is retained unmodified. No further interpretation takes place.
"""
def __init__(self, doc, *args, **kwargs): #pylint: disable=super-init-not-called
self.data = []
if 'node' not in kwargs:
raise ValueError("Expected a node= keyword argument for foreign-data")
if not isinstance(kwargs['node'],ElementTree._Element):
raise ValueError("foreign-data node should be ElementTree.Element instance, got " + str(type(kwargs['node'])))
self.node = kwargs['node']
for subnode in self.node:
self._checknamespace(subnode)
self.doc = doc
self.id = None
self.auth = True
self.next = None #chains foreigndata
#do not call superconstructor
def _checknamespace(self, node):
#namespace must be foreign
for subnode in node:
if node.tag and node.tag.startswith('{'+NSFOLIA+'}'):
raise ValueError("foreign-data may not include elements in the FoLiA namespace, a foreign XML namespace is mandatory")
self._checknamespace(subnode)
@classmethod
def parsexml(Class, node, doc, **kwargs):
return ForeignData(doc, node=node)
def select(self, Class, set=None, recursive=True, ignore=True, node=None): #pylint: disable=bad-classmethod-argument,redefined-builtin
"""This is a dummy method that returns an empty generator, select() does not work on ForeignData"""
#select can never descend into ForeignData, empty generator:
return
yield
def xml(self, attribs = None,elements = None, skipchildren = False):
"""Returns the XML node (an lxml.etree.Element) that holds the foreign data"""
return self.node
@classmethod
def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None):
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
return E.define( E.element(E.ref(name="any_content"), name=cls.XMLTAG), name=cls.XMLTAG, ns=NSFOLIA)
#===================================================================================================================
class Query(object):
"""An XPath query on one or more FoLiA documents"""
def __init__(self, files, expression):
if isstring(files):
self.files = [u(files)]
else:
assert hasattr(files,'__iter__')
self.files = files
self.expression = expression
def __iter__(self):
for filename in self.files:
doc = Document(file=filename, mode=Mode.XPATH)
for result in doc.xpath(self.expression):
yield result
class RegExp(object):
def __init__(self, regexp):
self.regexp = re.compile(regexp)
def __eq__(self, value):
return self.regexp.match(value)
class Pattern(object):
"""
This class describes a pattern over words to be searched for. The
:meth:`Document.findwords` method can subsequently be called with this pattern,
and it will return all the words that match. An example will best illustrate
this, first a trivial example of searching for one word::
for match in doc.findwords( folia.Pattern('house') ):
for word in match:
print word.id
print "----"
The same can be done for a sequence::
for match in doc.findwords( folia.Pattern('a','big', 'house') ):
for word in match:
print word.id
print "----"
The boolean value ``True`` acts as a wildcard, matching any word::
for match in doc.findwords( folia.Pattern('a',True,'house') ):
for word in match:
print word.id, word.text()
print "----"
Alternatively, and more constraning, you may also specify a tuple of alternatives::
for match in doc.findwords( folia.Pattern('a',('big','small'),'house') ):
for word in match:
print word.id, word.text()
print "----"
Or even a regular expression using the ``folia.RegExp`` class::
for match in doc.findwords( folia.Pattern('a', folia.RegExp('b?g'),'house') ):
for word in match:
print word.id, word.text()
print "----"
Rather than searching on the text content of the words, you can search on the
classes of any kind of token annotation using the keyword argument
``matchannotation=``::
for match in doc.findwords( folia.Pattern('det','adj','noun',matchannotation=folia.PosAnnotation ) ):
for word in match:
print word.id, word.text()
print "----"
The set can be restricted by adding the additional keyword argument
``matchannotationset=``. Case sensitivity, by default disabled, can be enabled by setting ``casesensitive=True``.
Things become even more interesting when different Patterns are combined. A
match will have to satisfy all patterns::
for match in doc.findwords( folia.Pattern('a', True, 'house'), folia.Pattern('det','adj','noun',matchannotation=folia.PosAnnotation ) ):
for word in match:
print word.id, word.text()
print "----"
The ``findwords()`` method can be instructed to also return left and/or right context for any match. This is done using the ``leftcontext=`` and ``rightcontext=`` keyword arguments, their values being an integer number of the number of context words to include in each match. For instance, we can look for the word house and return its immediate neighbours as follows::
for match in doc.findwords( folia.Pattern('house') , leftcontext=1, rightcontext=1):
for word in match:
print word.id
print "----"
A match here would thus always consist of three words instead of just one.
Last, ``Pattern`` also has support for variable-width gaps, the asterisk symbol
has special meaning to this end::
for match in doc.findwords( folia.Pattern('a','*','house') ):
for word in match:
print word.id
print "----"
Unlike the pattern ``('a',True,'house')``, which by definition is a pattern of
three words, the pattern in the example above will match gaps of any length (up
to a certain built-in maximum), so this might include matches such as *a very
nice house*.
Some remarks on these methods of querying are in order. These searches are
pretty exhaustive and are done by simply iterating over all the words in the
document. The entire document is loaded in memory and no special indices are involved.
For single documents this is okay, but when iterating over a corpus of
thousands of documents, this method is too slow, especially for real-time
applications. For huge corpora, clever indexing and database management systems
will be required. This however is beyond the scope of this library.
"""
def __init__(self, *args, **kwargs):
if not all( ( (x is True or isinstance(x,RegExp) or isstring(x) or isinstance(x, list) or isinstance(x, tuple)) for x in args )):
raise TypeError
self.sequence = args
if 'matchannotation' in kwargs:
self.matchannotation = kwargs['matchannotation']
del kwargs['matchannotation']
else:
self.matchannotation = None
if 'matchannotationset' in kwargs:
self.matchannotationset = kwargs['matchannotationset']
del kwargs['matchannotationset']
else:
self.matchannotationset = None
if 'casesensitive' in kwargs:
self.casesensitive = bool(kwargs['casesensitive'])
del kwargs['casesensitive']
else:
self.casesensitive = False
for key in kwargs.keys():
raise Exception("Unknown keyword parameter: " + key)
if not self.casesensitive:
if all( ( isstring(x) for x in self.sequence) ):
self.sequence = [ u(x).lower() for x in self.sequence ]
def __nonzero__(self): #Python 2.x
return True
def __bool__(self):
return True
def __len__(self):
return len(self.sequence)
def __getitem__(self, index):
return self.sequence[index]
def __getslice__(self, begin,end):
return self.sequence[begin:end]
def variablesize(self):
return ('*' in self.sequence)
def variablewildcards(self):
wildcards = []
for i,x in enumerate(self.sequence):
if x == '*':
wildcards.append(i)
return wildcards
def __repr__(self):
return repr(self.sequence)
def resolve(self,size, distribution):
"""Resolve a variable sized pattern to all patterns of a certain fixed size"""
if not self.variablesize():
raise Exception("Can only resize patterns with * wildcards")
nrofwildcards = 0
for x in self.sequence:
if x == '*':
nrofwildcards += 1
assert (len(distribution) == nrofwildcards)
wildcardnr = 0
newsequence = []
for x in self.sequence:
if x == '*':
newsequence += [True] * distribution[wildcardnr]
wildcardnr += 1
else:
newsequence.append(x)
d = { 'matchannotation':self.matchannotation, 'matchannotationset':self.matchannotationset, 'casesensitive':self.casesensitive }
yield Pattern(*newsequence, **d )
class ExternalMetaData(object):
def __init__(self, url):
self.url = url
class NativeMetaData(object):
def __init__(self, *args, **kwargs):
self.data = {}
self.order = []
for key, value in kwargs.items():
self[key] = value
def __setitem__(self, key, value):
exists = key in self.data
if sys.version < '3':
self.data[key] = unicode(value)
else:
self.data[key] = str(value)
if not exists: self.order.append(key)
def __iter__(self):
for x in self.order:
yield x
def __contains__(self, x):
return x in self.data
def items(self):
for key in self.order:
yield key, self.data[key]
def __len__(self):
return len(self.data)
def __getitem__(self, key):
return self.data[key]
def __delitem__(self,key):
del self.data[key]
self.order.remove(key)
class Document(object):
"""This is the FoLiA Document and holds all its data in memory.
All FoLiA elements have to be associated with a FoLiA document.
Besides holding elements, the document may hold metadata including declarations, and an index of all IDs."""
IDSEPARATOR = '.'
def __init__(self, *args, **kwargs):
"""Start/load a FoLiA document:
There are four sources of input for loading a FoLiA document::
1) Create a new document by specifying an *ID*::
doc = folia.Document(id='test')
2) Load a document from FoLiA or D-Coi XML file::
doc = folia.Document(file='/path/to/doc.xml')
3) Load a document from an XML string::
doc = folia.Document(string='....')
4) Load a document by passing a parse xml tree (lxml.etree):
doc = folia.Document(tree=xmltree)
Additionally, there are three modes that can be set with the ``mode=`` keyword argument:
* folia.Mode.MEMORY - The entire FoLiA Document will be loaded into memory. This is the default mode and the only mode in which documents can be manipulated and saved again.
* folia.Mode.XPATH - The full XML tree will still be loaded into memory, but conversion to FoLiA classes occurs only when queried. This mode can be used when the full power of XPath is required.
Keyword Arguments:
setdefinition (dict): A dictionary of set definitions, the key corresponds to the set name, the value is a SetDefinition instance
loadsetdefinitions (bool): download and load set definitions (default: False)
deepvalidation (bool): Do deep validation of the document (default: False), implies ``loadsetdefinitions``
textvalidation (bool): Do validation of text consistency (default: False)``
preparsexmlcallback (function): Callback for a function taking one argument (``node``, an lxml node). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort parsing this element (and all its children)
parsexmlcallback (function): Callback for a function taking one argument (``element``, a FoLiA element). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort adding this element (and all its children)
debug (bool): Boolean to enable/disable debug
"""
self.version = FOLIAVERSION
self.data = [] #will hold all texts (usually only one)
self.annotationdefaults = {}
self.annotations = [] #Ordered list of incorporated annotations ['token','pos', etc..]
#Add implicit declaration for TextContent
self.annotations.append( (AnnotationType.TEXT,'undefined') )
self.annotationdefaults[AnnotationType.TEXT] = {'undefined': {} }
#Add implicit declaration for PhonContent
self.annotations.append( (AnnotationType.PHON,'undefined') )
self.annotationdefaults[AnnotationType.PHON] = {'undefined': {} }
self.index = {} #all IDs go here
self.declareprocessed = False # Will be set to True when declarations have been processed
self.metadata = NativeMetaData() #will point to XML Element holding native metadata
self.metadatatype = "native"
self.submetadata = OrderedDict()
self.submetadatatype = {}
self.alias_set = {} #alias to set map (via annotationtype => first)
self.set_alias = {} #set to alias map (via annotationtype => first)
self.textclasses = set() #will contain the text classes found
self.autodeclare = False #Automatic declarations in case of undeclared elements (will be enabled for DCOI, since DCOI has no declarations)
self.sortspans = kwargs.get('sortspans', True) #sort references in span elements
if 'setdefinitions' in kwargs:
self.setdefinitions = kwargs['setdefinitions'] #to re-use a shared store
else:
self.setdefinitions = {} #key: set name, value: SetDefinition instance (only used when deepvalidation=True)
#The metadata fields FoLiA is directly aware of:
self._title = self._date = self._publisher = self._license = self._language = None
if 'debug' in kwargs:
self.debug = kwargs['debug']
else:
self.debug = False
if 'verbose' in kwargs:
self.verbose = kwargs['verbose']
else:
self.verbose = False
if 'mode' in kwargs:
self.mode = int(kwargs['mode'])
else:
self.mode = Mode.MEMORY #Load all in memory
if 'parentdoc' in kwargs: #for subdocuments
assert isinstance(kwargs['parentdoc'], Document)
self.parentdoc = kwargs['parentdoc']
else:
self.parentdoc = None
self.subdocs = {} #will hold all subdocs (sourcestring => document) , needed so the index can resolve IDs in subdocs
self.standoffdocs = {} #will hold all standoffdocs (type => set => sourcestring => document)
if 'external' in kwargs:
self.external = kwargs['external']
else:
self.external = False
if self.external and not self.parentdoc:
raise DeepValidationError("Document is marked as external and should not be loaded independently. However, no parentdoc= has been specified!")
if 'loadsetdefinitions' in kwargs:
self.loadsetdefinitions = bool(kwargs['loadsetdefinitions'])
else:
self.loadsetdefinitions = False
if 'deepvalidation' in kwargs:
self.deepvalidation = bool(kwargs['deepvalidation'])
else:
self.deepvalidation = False
if self.deepvalidation:
self.loadsetdefinitions = True
if 'textvalidation' in kwargs:
self.textvalidation = bool(kwargs['textvalidation'])
else:
self.textvalidation = False
self.textvalidationerrors = 0 #will count the number of text validation errors
self.offsetvalidationbuffer = [] #will hold (AbstractStructureElement, textclass pairs) that need to be validated still (if textvalidation == True), validation will be done when all parsing is complete and/or prior to serialisation
if 'allowadhocsets' in kwargs:
self.allowadhocsets = bool(kwargs['allowadhocsets'])
else:
if self.deepvalidation:
self.allowadhocsets = False
else:
self.allowadhocsets = True
if 'autodeclare' in kwargs:
self.autodeclare = True
if 'bypassleak' in kwargs:
self.bypassleak = False #obsolete now
if 'preparsexmlcallback' in kwargs:
self.preparsexmlcallback = kwargs['parsexmlcallback']
else:
self.preparsexmlcallback = None
if 'parsexmlcallback' in kwargs:
self.parsexmlcallback = kwargs['parsexmlcallback']
else:
self.parsexmlcallback = None
if 'id' in kwargs:
isncname(kwargs['id'])
self.id = kwargs['id']
elif 'file' in kwargs:
self.filename = kwargs['file']
if self.filename[-4:].lower() == '.bz2':
f = bz2.BZ2File(self.filename)
contents = f.read()
f.close()
self.tree = xmltreefromstring(contents)
del contents
self.parsexml(self.tree.getroot())
elif self.filename[-3:].lower() == '.gz':
f = gzip.GzipFile(self.filename) #pylint: disable=redefined-variable-type
contents = f.read()
f.close()
self.tree = xmltreefromstring(contents)
del contents
self.parsexml(self.tree.getroot())
else:
self.load(self.filename)
elif 'string' in kwargs:
self.tree = xmltreefromstring(kwargs['string'])
del kwargs['string']
self.parsexml(self.tree.getroot())
if self.mode != Mode.XPATH:
#XML Tree is now obsolete (only needed when partially loaded for xpath queries)
self.tree = None
elif 'tree' in kwargs:
self.parsexml(kwargs['tree'])
else:
raise Exception("No ID, filename or tree specified")
if self.mode != Mode.XPATH:
#XML Tree is now obsolete (only needed when partially loaded for xpath queries), free memory
self.tree = None
#def __del__(self):
# del self.index
# for child in self.data:
# del child
# del self.data
def load(self, filename):
"""Load a FoLiA XML file.
Argument:
filename (str): The file to load
"""
#if LXE and self.mode != Mode.XPATH:
# #workaround for xml:id problem (disabled)
# #f = open(filename)
# #s = f.read().replace(' xml:id=', ' id=')
# #f.close()
# self.tree = ElementTree.parse(filename)
#else:
self.tree = xmltreefromfile(filename)
self.parsexml(self.tree.getroot())
if self.mode != Mode.XPATH:
#XML Tree is now obsolete (only needed when partially loaded for xpath queries)
self.tree = None
def items(self):
"""Returns a depth-first flat list of all items in the document"""
l = []
for e in self.data:
l += e.items()
return l
def xpath(self, query):
"""Run Xpath expression and parse the resulting elements. Don't forget to use the FoLiA namesapace in your expressions, using folia: or the short form f: """
for result in self.tree.xpath(query,namespaces={'f': 'http://ilk.uvt.nl/folia','folia': 'http://ilk.uvt.nl/folia' }):
yield self.parsexml(result)
def alias(self, annotationtype, set, fallback=False):
"""Return the alias for a set (if applicable, returns the unaltered set otherwise iff fallback is enabled)"""
if inspect.isclass(annotationtype): annotationtype = annotationtype.ANNOTATIONTYPE
if annotationtype in self.set_alias and set in self.set_alias[annotationtype]:
return self.set_alias[annotationtype][set]
elif fallback:
return set
else:
raise KeyError("No alias for set " + set)
def unalias(self, annotationtype, alias):
"""Return the set for an alias (if applicable, raises an exception otherwise)"""
if inspect.isclass(annotationtype): annotationtype = annotationtype.ANNOTATIONTYPE
return self.alias_set[annotationtype][alias]
def findwords(self, *args, **kwargs):
for x in findwords(self,self.words,*args,**kwargs):
yield x
def save(self, filename=None):
"""Save the document to file.
Arguments:
* filename (str): The filename to save to. If not set (``None``, default), saves to the same file as loaded from.
"""
if not filename:
filename = self.filename
if not filename:
raise Exception("No filename specified")
if filename[-4:].lower() == '.bz2':
f = bz2.BZ2File(filename,'wb')
f.write(self.xmlstring().encode('utf-8'))
f.close()
elif filename[-3:].lower() == '.gz':
f = gzip.GzipFile(filename,'wb') #pylint: disable=redefined-variable-type
f.write(self.xmlstring().encode('utf-8'))
f.close()
else:
f = io.open(filename,'w',encoding='utf-8')
f.write(self.xmlstring())
f.close()
def __len__(self):
return len(self.data)
def __nonzero__(self): #Python 2.x
return True
def __bool__(self):
return True
def __iter__(self):
for text in self.data:
yield text
def __contains__(self, key):
"""Tests if the specified element ID is in the document index"""
if key in self.index:
return True
elif self.subdocs:
for subdoc in self.subdocs.values():
if key in subdoc:
return True
return False
else:
return False
def __getitem__(self, key):
"""Obtain an element by ID from the document index.
Example::
word = doc['example.p.4.s.10.w.3']
"""
if isinstance(key, int):
return self.data[key]
else:
try:
return self.index[key]
except KeyError:
if self.subdocs: #perhaps the key is in one of our subdocs?
for subdoc in self.subdocs.values():
try:
return subdoc[key]
except KeyError:
pass
else:
raise KeyError("No such key: " + key)
def append(self,text):
"""Add a text (or speech) to the document:
Example 1::
doc.append(folia.Text)
Example 2::
doc.append( folia.Text(doc, id='example.text') )
Example 3::
doc.append(folia.Speech)
"""
if text is Text:
text = Text(self, id=self.id + '.text.' + str(len(self.data)+1) )
elif text is Speech:
text = Speech(self, id=self.id + '.speech.' + str(len(self.data)+1) ) #pylint: disable=redefined-variable-type
else:
assert isinstance(text, Text) or isinstance(text, Speech)
self.data.append(text)
return text
def add(self,text):
"""Alias for :meth:`Document.append`"""
return self.append(text)
def create(self, Class, *args, **kwargs):
"""Create an element associated with this Document. This method may be obsolete and removed later."""
return Class(self, *args, **kwargs)
def xmldeclarations(self):
"""Internal method to generate XML nodes for all declarations"""
l = []
E = ElementMaker(namespace="http://ilk.uvt.nl/folia",nsmap={None: "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
for annotationtype, set in self.annotations:
label = None
#Find the 'label' for the declarations dynamically (aka: AnnotationType --> String)
for key, value in vars(AnnotationType).items():
if value == annotationtype:
label = key
break
#gather attribs
if (annotationtype == AnnotationType.TEXT or annotationtype == AnnotationType.PHON) and set == 'undefined' and len(self.annotationdefaults[annotationtype][set]) == 0:
#this is the implicit TextContent declaration, no need to output it explicitly
continue
attribs = {}
if set and set != 'undefined':
attribs['{' + NSFOLIA + '}set'] = set
for key, value in self.annotationdefaults[annotationtype][set].items():
if key == 'annotatortype':
if value == AnnotatorType.MANUAL:
attribs['{' + NSFOLIA + '}' + key] = 'manual'
elif value == AnnotatorType.AUTO:
attribs['{' + NSFOLIA + '}' + key] = 'auto'
elif key == 'datetime':
attribs['{' + NSFOLIA + '}' + key] = value.strftime("%Y-%m-%dT%H:%M:%S") #proper iso-formatting
elif value:
attribs['{' + NSFOLIA + '}' + key] = value
if label:
l.append( makeelement(E,'{' + NSFOLIA + '}' + label.lower() + '-annotation', **attribs) )
else:
raise Exception("Invalid annotation type")
return l
def jsondeclarations(self):
"""Return all declarations in a form ready to be serialised to JSON.
Returns:
list of dict
"""
l = []
for annotationtype, set in self.annotations:
label = None
#Find the 'label' for the declarations dynamically (aka: AnnotationType --> String)
for key, value in vars(AnnotationType).items():
if value == annotationtype:
label = key
break
#gather attribs
if (annotationtype == AnnotationType.TEXT or annotationtype == AnnotationType.PHON) and set == 'undefined' and len(self.annotationdefaults[annotationtype][set]) == 0:
#this is the implicit TextContent declaration, no need to output it explicitly
continue
jsonnode = {'annotationtype': label.lower()}
if set and set != 'undefined':
jsonnode['set'] = set
for key, value in self.annotationdefaults[annotationtype][set].items():
if key == 'annotatortype':
if value == AnnotatorType.MANUAL:
jsonnode[key] = 'manual'
elif value == AnnotatorType.AUTO:
jsonnode[key] = 'auto'
elif key == 'datetime':
jsonnode[key] = value.strftime("%Y-%m-%dT%H:%M:%S") #proper iso-formatting
elif value:
jsonnode[key] = value
if label:
l.append( jsonnode )
else:
raise Exception("Invalid annotation type")
return l
def xml(self):
"""Serialise the document to XML.
Returns:
lxml.etree.Element
See also:
:meth:`Document.xmlstring`
"""
self.pendingvalidation()
E = ElementMaker(namespace="http://ilk.uvt.nl/folia",nsmap={'xml' : "http://www.w3.org/XML/1998/namespace", 'xlink':"http://www.w3.org/1999/xlink"})
attribs = {}
attribs['{http://www.w3.org/XML/1998/namespace}id'] = self.id
#if self.version:
# attribs['version'] = self.version
#else:
attribs['version'] = FOLIAVERSION
attribs['generator'] = 'pynlpl.formats.folia-v' + LIBVERSION
metadataattribs = {}
metadataattribs['{' + NSFOLIA + '}type'] = self.metadatatype
if isinstance(self.metadata, ExternalMetaData):
metadataattribs['{' + NSFOLIA + '}src'] = self.metadata.url
e = E.FoLiA(
E.metadata(
E.annotations(
*self.xmldeclarations()
),
*self.xmlmetadata(),
**metadataattribs
)
, **attribs)
for text in self.data:
e.append(text.xml())
return e
def json(self):
"""Serialise the document to a ``dict`` ready for serialisation to JSON.
Example::
import json
jsondoc = json.dumps(doc.json())
"""
self.pendingvalidation()
jsondoc = {'id': self.id, 'children': [], 'declarations': self.jsondeclarations() }
if self.version:
jsondoc['version'] = self.version
else:
jsondoc['version'] = FOLIAVERSION
jsondoc['generator'] = 'pynlpl.formats.folia-v' + LIBVERSION
for text in self.data:
jsondoc['children'].append(text.json())
return jsondoc
def xmlmetadata(self):
"""Internal method to serialize metadata to XML"""
E = ElementMaker(namespace="http://ilk.uvt.nl/folia",nsmap={None: "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"})
elements = []
if self.metadatatype == "native":
if isinstance(self.metadata, NativeMetaData):
for key, value in self.metadata.items():
elements.append(E.meta(value,id=key) )
else:
if isinstance(self.metadata, ForeignData):
#in-document
m = self.metadata
while m is not None:
elements.append(m.xml())
m = m.next
for metadata_id, submetadata in self.submetadata.items():
subelements = []
attribs = {
"{http://www.w3.org/XML/1998/namespace}id": metadata_id,
"type": self.submetadatatype[metadata_id] }
if isinstance(submetadata, NativeMetaData):
for key, value in submetadata.items():
subelements.append(E.meta(value,id=key) )
elif isinstance(submetadata, ExternalMetaData):
attribs['src'] = submetadata.url
elif isinstance(submetadata, ForeignData):
#in-document
m = submetadata
while m is not None:
subelements.append(m.xml())
m = m.next
elements.append( E.submetadata(*subelements, **attribs))
return elements
def parsexmldeclarations(self, node):
"""Internal method to parse XML declarations"""
if self.debug >= 1:
print("[PyNLPl FoLiA DEBUG] Processing Annotation Declarations",file=stderr)
self.declareprocessed = True
for subnode in node: #pylint: disable=too-many-nested-blocks
if not isinstance(subnode.tag, str): continue
if subnode.tag[:25] == '{' + NSFOLIA + '}' and subnode.tag[-11:] == '-annotation':
prefix = subnode.tag[25:][:-11]
type = None
if prefix.upper() in vars(AnnotationType):
type = vars(AnnotationType)[prefix.upper()]
else:
raise Exception("Unknown declaration: " + subnode.tag)
if 'set' in subnode.attrib and subnode.attrib['set']:
set = subnode.attrib['set']
else:
set = 'undefined'
if (type,set) in self.annotations:
if type == AnnotationType.TEXT:
#explicit Text declaration, remove the implicit declaration:
a = []
for t,s in self.annotations:
if not (t == AnnotationType.TEXT and s == 'undefined'):
a.append( (t,s) )
self.annotations = a
#raise ValueError("Double declaration of " + subnode.tag + ", set '" + set + "' + is already declared") //doubles are okay says Ko
else:
self.annotations.append( (type, set) )
#Load set definition
if set and self.loadsetdefinitions and set not in self.setdefinitions:
if set[:7] == "http://" or set[:8] == "https://" or set[:6] == "ftp://":
try:
self.setdefinitions[set] = SetDefinition(set,verbose=self.verbose) #will raise exception on error
except DeepValidationError:
print("WARNING: Set " + set + " could not be downloaded, ignoring!",file=sys.stderr) #warning and ignore
#Set defaults
if type in self.annotationdefaults and set in self.annotationdefaults[type]:
#handle duplicate. If ambiguous: remove defaults
if 'annotator' in subnode.attrib:
if not ('annotator' in self.annotationdefaults[type][set]):
self.annotationdefaults[type][set]['annotator'] = subnode.attrib['annotator']
elif self.annotationdefaults[type][set]['annotator'] != subnode.attrib['annotator']:
del self.annotationdefaults[type][set]['annotator']
if 'annotatortype' in subnode.attrib:
if not ('annotatortype' in self.annotationdefaults[type][set]):
self.annotationdefaults[type][set]['annotatortype'] = subnode.attrib['annotatortype']
elif self.annotationdefaults[type][set]['annotatortype'] != subnode.attrib['annotatortype']:
del self.annotationdefaults[type][set]['annotatortype']
else:
defaults = {}
if 'annotator' in subnode.attrib:
defaults['annotator'] = subnode.attrib['annotator']
if 'annotatortype' in subnode.attrib:
if subnode.attrib['annotatortype'] == 'auto':
defaults['annotatortype'] = AnnotatorType.AUTO
else:
defaults['annotatortype'] = AnnotatorType.MANUAL
if 'datetime' in subnode.attrib:
if isinstance(subnode.attrib['datetime'], datetime):
defaults['datetime'] = subnode.attrib['datetime']
else:
defaults['datetime'] = parse_datetime(subnode.attrib['datetime'])
if not type in self.annotationdefaults:
self.annotationdefaults[type] = {}
self.annotationdefaults[type][set] = defaults
if 'external' in subnode.attrib and subnode.attrib['external']:
if self.debug >= 1:
print("[PyNLPl FoLiA DEBUG] Loading external document: " + subnode.attrib['external'],file=stderr)
if not type in self.standoffdocs:
self.standoffdocs[type] = {}
self.standoffdocs[type][set] = {}
#check if it is already loaded, if multiple references are made to the same doc we reuse the instance
standoffdoc = None
for t in self.standoffdocs:
for s in self.standoffdocs[t]:
for source in self.standoffdocs[t][s]:
if source == subnode.attrib['external']:
standoffdoc = self.standoffdocs[t][s]
break
if standoffdoc: break
if standoffdoc: break
if not standoffdoc:
if subnode.attrib['external'][:7] == 'http://' or subnode.attrib['external'][:8] == 'https://':
#document is remote, download (in memory)
try:
f = urlopen(subnode.attrib['external'])
except:
raise DeepValidationError("Unable to download standoff document: " + subnode.attrib['external'])
try:
content = u(f.read())
except IOError:
raise DeepValidationError("Unable to download standoff document: " + subnode.attrib['external'])
f.close()
standoffdoc = Document(string=content, parentdoc=self, setdefinitions=self.setdefinitions)
elif os.path.exists(subnode.attrib['external']):
#document is on disk:
standoffdoc = Document(file=subnode.attrib['external'], parentdoc=self, setdefinitions=self.setdefinitions)
else:
#document not found
raise DeepValidationError("Unable to find standoff document: " + subnode.attrib['external'])
self.standoffdocs[type][set][subnode.attrib['external']] = standoffdoc
standoffdoc.parentdoc = self
if self.debug >= 1:
print("[PyNLPl FoLiA DEBUG] Found declared annotation " + subnode.tag + ". Defaults: " + repr(defaults),file=stderr)
def setimdi(self, node): #OBSOLETE
"""OBSOLETE"""
ns = {'imdi': 'http://www.mpi.nl/IMDI/Schema/IMDI'}
self.metadatatype = MetaDataType.IMDI
if LXE:
self.metadata = ElementTree.tostring(node, xml_declaration=False, pretty_print=True, encoding='utf-8')
else:
self.metadata = ElementTree.tostring(node, encoding='utf-8')
n = node.xpath('imdi:Session/imdi:Title', namespaces=ns)
if n and n[0].text: self._title = n[0].text
n = node.xpath('imdi:Session/imdi:Date', namespaces=ns)
if n and n[0].text: self._date = n[0].text
n = node.xpath('//imdi:Source/imdi:Access/imdi:Publisher', namespaces=ns)
if n and n[0].text: self._publisher = n[0].text
n = node.xpath('//imdi:Source/imdi:Access/imdi:Availability', namespaces=ns)
if n and n[0].text: self._license = n[0].text
n = node.xpath('//imdi:Languages/imdi:Language/imdi:ID', namespaces=ns)
if n and n[0].text: self._language = n[0].text
def declare(self, annotationtype, set, **kwargs):
"""Declare a new annotation type to be used in the document.
Keyword arguments can be used to set defaults for any annotation of this type and set.
Arguments:
annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``.
set (str): the set, should formally be a URL pointing to the set definition
Keyword Arguments:
annotator (str): Sets a default annotator
annotatortype: Should be either ``AnnotatorType.MANUAL`` or ``AnnotatorType.AUTO``, indicating whether the annotation was performed manually or by an automated process.
datetime (datetime.datetime): Sets the default datetime
alias (str): Defines alias that may be used in set attribute of elements instead of the full set name
Example::
doc.declare(folia.PosAnnotation, 'http://some/path/brown-tag-set', annotator="mytagger", annotatortype=folia.AnnotatorType.AUTO)
"""
if (sys.version > '3' and not isinstance(set,str)) or (sys.version < '3' and not isinstance(set,(str,unicode))):
raise ValueError("Set parameter for declare() must be a string")
if inspect.isclass(annotationtype):
annotationtype = annotationtype.ANNOTATIONTYPE
if annotationtype in self.alias_set and set in self.alias_set[annotationtype]:
raise ValueError("Set " + set + " conflicts with alias, may not be equal!")
if not (annotationtype, set) in self.annotations:
self.annotations.append( (annotationtype,set) )
if set and self.loadsetdefinitions and not set in self.setdefinitions:
if set[:7] == "http://" or set[:8] == "https://" or set[:6] == "ftp://":
self.setdefinitions[set] = SetDefinition(set,verbose=self.verbose) #will raise exception on error
if not annotationtype in self.annotationdefaults:
self.annotationdefaults[annotationtype] = {}
self.annotationdefaults[annotationtype][set] = kwargs
if 'alias' in kwargs:
if annotationtype in self.set_alias and set in self.set_alias[annotationtype] and self.set_alias[annotationtype][set] != kwargs['alias']:
raise ValueError("Redeclaring set " + set + " with another alias ('"+kwargs['alias']+"') is not allowed!")
if annotationtype in self.alias_set and kwargs['alias'] in self.alias_set[annotationtype] and self.alias_set[annotationtype][kwargs['alias']] != set:
raise ValueError("Redeclaring alias " + kwargs['alias'] + " with another set ('"+set+"') is not allowed!")
if annotationtype in self.set_alias and kwargs['alias'] in self.set_alias[annotationtype]:
raise ValueError("Alias " + kwargs['alias'] + " conflicts with set name, may not be equal!")
if annotationtype not in self.alias_set:
self.alias_set[annotationtype] = {}
if annotationtype not in self.set_alias:
self.set_alias[annotationtype] = {}
self.alias_set[annotationtype][kwargs['alias']] = set
self.set_alias[annotationtype][set] = kwargs['alias']
def declared(self, annotationtype, set):
"""Checks if the annotation type is present (i.e. declared) in the document.
Arguments:
annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``.
set (str): the set, should formally be a URL pointing to the set definition (aliases are also supported)
Example::
if doc.declared(folia.PosAnnotation, 'http://some/path/brown-tag-set'):
..
Returns:
bool
"""
if inspect.isclass(annotationtype): annotationtype = annotationtype.ANNOTATIONTYPE
return ( (annotationtype,set) in self.annotations) or (set in self.alias_set and self.alias_set[set] and (annotationtype, self.alias_set[set]) in self.annotations )
def defaultset(self, annotationtype):
"""Obtain the default set for the specified annotation type.
Arguments:
annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``.
Returns:
the set (str)
Raises:
:class:`NoDefaultError` if the annotation type does not exist or if there is ambiguity (multiple sets for the same type)
"""
if inspect.isclass(annotationtype) or isinstance(annotationtype,AbstractElement): annotationtype = annotationtype.ANNOTATIONTYPE
try:
return list(self.annotationdefaults[annotationtype].keys())[0]
except KeyError:
raise NoDefaultError
except IndexError:
raise NoDefaultError
def defaultannotator(self, annotationtype, set=None):
"""Obtain the default annotator for the specified annotation type and set.
Arguments:
annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``.
set (str): the set, should formally be a URL pointing to the set definition
Returns:
the set (str)
Raises:
:class:`NoDefaultError` if the annotation type does not exist or if there is ambiguity (multiple sets for the same type)
"""
if inspect.isclass(annotationtype) or isinstance(annotationtype,AbstractElement): annotationtype = annotationtype.ANNOTATIONTYPE
if not set: set = self.defaultset(annotationtype)
try:
return self.annotationdefaults[annotationtype][set]['annotator']
except KeyError:
raise NoDefaultError
def defaultannotatortype(self, annotationtype,set=None):
"""Obtain the default annotator type for the specified annotation type and set.
Arguments:
annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``.
set (str): the set, should formally be a URL pointing to the set definition
Returns:
``AnnotatorType.AUTO`` or ``AnnotatorType.MANUAL``
Raises:
:class:`NoDefaultError` if the annotation type does not exist or if there is ambiguity (multiple sets for the same type)
"""
if inspect.isclass(annotationtype) or isinstance(annotationtype,AbstractElement): annotationtype = annotationtype.ANNOTATIONTYPE
if not set: set = self.defaultset(annotationtype)
try:
return self.annotationdefaults[annotationtype][set]['annotatortype']
except KeyError:
raise NoDefaultError
def defaultdatetime(self, annotationtype,set=None):
"""Obtain the default datetime for the specified annotation type and set.
Arguments:
annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``.
set (str): the set, should formally be a URL pointing to the set definition
Returns:
the set (str)
Raises:
:class:`NoDefaultError` if the annotation type does not exist or if there is ambiguity (multiple sets for the same type)
"""
if inspect.isclass(annotationtype) or isinstance(annotationtype,AbstractElement): annotationtype = annotationtype.ANNOTATIONTYPE
if not set: set = self.defaultset(annotationtype)
try:
return self.annotationdefaults[annotationtype][set]['datetime']
except KeyError:
raise NoDefaultError
def title(self, value=None):
"""Get or set the document's title from/in the metadata
No arguments: Get the document's title from metadata
Argument: Set the document's title in metadata
"""
if not (value is None):
if (self.metadatatype == "native"):
self.metadata['title'] = value
else:
self._title = value
if (self.metadatatype == "native"):
if 'title' in self.metadata:
return self.metadata['title']
else:
return None
else:
return self._title
def date(self, value=None):
"""Get or set the document's date from/in the metadata.
No arguments: Get the document's date from metadata
Argument: Set the document's date in metadata
"""
if not (value is None):
if (self.metadatatype == "native"):
self.metadata['date'] = value
else:
self._date = value
if (self.metadatatype == "native"):
if 'date' in self.metadata:
return self.metadata['date']
else:
return None
else:
return self._date
def publisher(self, value=None):
"""No arguments: Get the document's publisher from metadata
Argument: Set the document's publisher in metadata
"""
if not (value is None):
if (self.metadatatype == "native"):
self.metadata['publisher'] = value
else:
self._publisher = value
if (self.metadatatype == "native"):
if 'publisher' in self.metadata:
return self.metadata['publisher']
else:
return None
else:
return self._publisher
def license(self, value=None):
"""No arguments: Get the document's license from metadata
Argument: Set the document's license in metadata
"""
if not (value is None):
if (self.metadatatype == "native"):
self.metadata['license'] = value
else:
self._license = value
if (self.metadatatype == "native"):
if 'license' in self.metadata:
return self.metadata['license']
else:
return None
else:
return self._license
def language(self, value=None):
"""No arguments: Get the document's language (ISO-639-3) from metadata
Argument: Set the document's language (ISO-639-3) in metadata
"""
if not (value is None):
if (self.metadatatype == "native"):
self.metadata['language'] = value
else:
self._language = value
if self.metadatatype == "native":
if 'language' in self.metadata:
return self.metadata['language']
else:
return None
else:
return self._language
def parsemetadata(self, node):
"""Internal method to parse metadata"""
if 'type' in node.attrib:
self.metadatatype = node.attrib['type']
else:
#no type specified, default to native
self.metadatatype = "native"
if 'src' in node.attrib:
self.metadata = ExternalMetaData(node.attrib['src'])
elif self.metadatatype == "native":
self.metadata = NativeMetaData()
else:
self.metadata = None #may be set below to ForeignData
for subnode in node:
if subnode.tag == '{' + NSFOLIA + '}annotations':
self.parsexmldeclarations(subnode)
elif subnode.tag == '{' + NSFOLIA + '}meta':
if self.metadatatype == "native":
if subnode.text:
self.metadata[subnode.attrib['id']] = subnode.text
else:
raise MetaDataError("Encountered a meta element but metadata type is not native!")
elif subnode.tag == '{' + NSFOLIA + '}provenance':
#forward compatibility with FoLiA 2.0; ignore provenance
print("WARNING: Ignoring provenance data. Use foliapy instead of pynlpl.formats.folia for FoLiA v2.0 compatibility!",file=sys.stderr)
pass
elif subnode.tag == '{' + NSFOLIA + '}foreign-data':
if self.metadatatype == "native":
raise MetaDataError("Encountered a foreign-data element but metadata type is native!")
elif self.metadata is not None:
#multiple foreign-data elements, chain:
e = self.metadata
while e.next is not None:
e = e.next
e.next = ForeignData(self, node=subnode)
else:
self.metadata = ForeignData(self, node=subnode)
elif subnode.tag == '{' + NSFOLIA + '}submetadata':
self.parsesubmetadata(subnode)
elif subnode.tag == '{http://www.mpi.nl/IMDI/Schema/IMDI}METATRANSCRIPT': #backward-compatibility for old IMDI without foreign-key
E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"})
self.metadatatype = "imdi"
self.metadata = ForeignData(self, node=subnode)
def parsesubmetadata(self, node):
if '{http://www.w3.org/XML/1998/namespace}id' not in node.attrib:
raise MetaDataError("Encountered a submetadata element without xml:id!")
else:
id = node.attrib['{http://www.w3.org/XML/1998/namespace}id']
if 'type' in node.attrib:
self.submetadatatype[id] = node.attrib['type']
else:
self.submetadatatype[id] = "native"
if 'src' in node.attrib:
self.submetadata[id] = ExternalMetaData(node.attrib['src'])
elif self.submetadatatype[id] == "native":
self.submetadata[id] = NativeMetaData()
else:
self.submetadata[id] = None
for subnode in node:
if subnode.tag == '{' + NSFOLIA + '}meta':
if self.submetadatatype[id] == "native":
if subnode.text:
self.submetadata[id][subnode.attrib['id']] = subnode.text
else:
raise MetaDataError("Encountered a meta element but metadata type is not native!")
elif subnode.tag == '{' + NSFOLIA + '}foreign-data':
if self.submetadatatype[id] == "native":
raise MetaDataError("Encountered a foreign-data element but metadata type is native!")
elif self.submetadata[id] is not None:
#multiple foreign-data elements, chain:
e = self.submetadata[id]
while e.next is not None:
e = e.next
e.next = ForeignData(self, node=subnode)
else:
self.submetadata[id] = ForeignData(self, node=subnode)
def parsexml(self, node, ParentClass = None):
"""Internal method.
This is the main XML parser, will invoke class-specific XML parsers."""
if (LXE and isinstance(node,ElementTree._ElementTree)) or (not LXE and isinstance(node, ElementTree.ElementTree)): #pylint: disable=protected-access
node = node.getroot()
elif isstring(node):
node = xmltreefromstring(node).getroot()
if node.tag.startswith('{' + NSFOLIA + '}'):
foliatag = node.tag[nslen:]
if foliatag == "FoLiA":
if self.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found FoLiA document",file=stderr)
try:
self.id = node.attrib['{http://www.w3.org/XML/1998/namespace}id']
except KeyError:
try:
self.id = node.attrib['XMLid']
except KeyError:
try:
self.id = node.attrib['id']
except KeyError:
raise Exception("FoLiA Document has no ID!")
if 'version' in node.attrib:
self.version = node.attrib['version']
if checkversion(self.version) > 0:
print("WARNING!!! Document uses a newer version of FoLiA than this library! (" + self.version + " vs " + FOLIAVERSION + "). Any possible subsequent failures in parsing or processing may probably be attributed to this. Upgrade to foliapy (https://github.com/proycon/foliapy) to remedy this.",file=sys.stderr)
else:
self.version = None
if 'external' in node.attrib:
self.external = (node.attrib['external'] == 'yes')
if self.external and not self.parentdoc:
raise DeepValidationError("Document is marked as external and should not be loaded independently. However, no parentdoc= has been specified!")
for subnode in node:
if subnode.tag == '{' + NSFOLIA + '}metadata':
self.parsemetadata(subnode)
elif (subnode.tag == '{' + NSFOLIA + '}text' or subnode.tag == '{' + NSFOLIA + '}speech') and self.mode == Mode.MEMORY:
if self.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found Text",file=stderr)
e = self.parsexml(subnode)
if e is not None:
self.data.append(e)
else:
#generic handling (FoLiA)
if not foliatag in XML2CLASS:
raise Exception("Unknown FoLiA XML tag: " + foliatag)
Class = XML2CLASS[foliatag]
return Class.parsexml(node,self)
elif node.tag == '{' + NSDCOI + '}DCOI':
if self.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found DCOI document",file=stderr)
self.autodeclare = True
try:
self.id = node.attrib['{http://www.w3.org/XML/1998/namespace}id']
except KeyError:
try:
self.id = node.attrib['id']
except KeyError:
try:
self.id = node.attrib['XMLid']
except KeyError:
raise Exception("D-Coi Document has no ID!")
for subnode in node:
if subnode.tag == '{http://www.mpi.nl/IMDI/Schema/IMDI}METATRANSCRIPT':
self.metadatatype = MetaDataType.IMDI
self.setimdi(subnode)
elif subnode.tag == '{' + NSDCOI + '}text':
if self.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found Text",file=stderr)
e = self.parsexml(subnode)
if e is not None:
self.data.append( e )
elif node.tag.startswith('{' + NSDCOI + '}'):
#generic handling (D-Coi)
if node.tag[nslendcoi:] in XML2CLASS:
Class = XML2CLASS[node.tag[nslendcoi:]]
return Class.parsexml(node,self)
elif node.tag[nslendcoi:][0:3] == 'div': #support for div0, div1, etc:
Class = Division
return Class.parsexml(node,self)
elif node.tag[nslendcoi:] == 'item': #support for listitem
Class = ListItem
return Class.parsexml(node,self)
elif node.tag[nslendcoi:] == 'figDesc': #support for description in figures
Class = Description
return Class.parsexml(node,self)
else:
raise Exception("Unknown DCOI XML tag: " + node.tag)
else:
raise Exception("Unknown FoLiA XML tag: " + node.tag)
self.pendingvalidation() #perform any pending offset validations (if applicable)
def pendingvalidation(self, warnonly=None):
"""Perform any pending validations
Parameters:
warnonly (bool): Warn only (True) or raise exceptions (False). If set to None then this value will be determined based on the document's FoLiA version (Warn only before FoLiA v1.5)
Returns:
bool
"""
if self.debug: print("[PyNLPl FoLiA DEBUG] Processing pending validations (if any)",file=stderr)
if warnonly is None and self and self.version:
warnonly = (checkversion(self.version, '1.5.0') < 0) #warn only for documents older than FoLiA v1.5
if self.textvalidation:
while self.offsetvalidationbuffer:
structureelement, textclass = self.offsetvalidationbuffer.pop()
if self.debug: print("[PyNLPl FoLiA DEBUG] Performing offset validation on " + repr(structureelement) + " textclass " + textclass,file=stderr)
#validate offsets
tc = structureelement.textcontent(textclass)
if tc.offset is not None:
try:
tc.getreference(validate=True)
except UnresolvableTextContent:
msg = "Text for " + structureelement.__class__.__name__ + ", ID " + str(structureelement.id) + ", textclass " + textclass + ", has incorrect offset " + str(tc.offset) + " or invalid reference"
print("TEXT VALIDATION ERROR: " + msg,file=sys.stderr)
if not warnonly:
raise
def select(self, Class, set=None, recursive=True, ignore=True):
"""See :meth:`AbstractElement.select`"""
if self.mode == Mode.MEMORY:
for t in self.data:
if Class.__name__ == 'Text':
yield t
else:
for e in t.select(Class,set,recursive,ignore):
yield e
def count(self, Class, set=None, recursive=True,ignore=True):
"""See :meth:`AbstractElement.count`"""
if self.mode == Mode.MEMORY:
s = 0
for t in self.data:
s += sum( 1 for e in t.select(Class,recursive,True ) )
return s
def paragraphs(self, index = None):
"""Return a generator of all paragraphs found in the document.
If an index is specified, return the n'th paragraph only (starting at 0)"""
if index is None:
return self.select(Paragraph)
else:
if index < 0:
index = sum(t.count(Paragraph) for t in self.data) + index
for t in self.data:
for i,e in enumerate(t.select(Paragraph)) :
if i == index:
return e
raise IndexError
def sentences(self, index = None):
"""Return a generator of all sentence found in the document. Except for sentences in quotes.
If an index is specified, return the n'th sentence only (starting at 0)"""
if index is None:
return self.select(Sentence,None,True,[Quote])
else:
if index < 0:
index = sum(t.count(Sentence,None,True,[Quote]) for t in self.data) + index
for t in self.data:
for i,e in enumerate(t.select(Sentence,None,True,[Quote])) :
if i == index:
return e
raise IndexError
def words(self, index = None):
"""Return a generator of all active words found in the document. Does not descend into annotation layers, alternatives, originals, suggestions.
If an index is specified, return the n'th word only (starting at 0)"""
if index is None:
return self.select(Word,None,True,default_ignore_structure)
else:
if index < 0:
index = sum(t.count(Word,None,True,default_ignore_structure) for t in self.data) + index
for t in self.data:
for i, e in enumerate(t.select(Word,None,True,default_ignore_structure)):
if i == index:
return e
raise IndexError
def text(self, cls='current', retaintokenisation=False):
"""Returns the text of the entire document (returns a unicode instance)
See also:
:meth:`AbstractElement.text`
"""
#backward compatibility, old versions didn't have cls as first argument, so if a boolean is passed first we interpret it as the 2nd:
if cls is True or cls is False:
retaintokenisation = cls
cls = 'current'
s = ""
for c in self.data:
if s: s += "\n\n\n"
try:
s += c.text(cls, retaintokenisation)
except NoSuchText:
continue
return s
def xmlstring(self):
"""Return the XML representation of the document as a string."""
s = ElementTree.tostring(self.xml(), xml_declaration=True, pretty_print=True, encoding='utf-8')
if sys.version < '3':
if isinstance(s, str):
s = unicode(s,'utf-8') #pylint: disable=undefined-variable
else:
if isinstance(s,bytes):
s = str(s,'utf-8')
s = s.replace('ns0:','') #ugly patch to get rid of namespace prefix
s = s.replace(':ns0','')
return s
def __unicode__(self):
"""Returns the text of the entire document"""
return self.text()
def __str__(self):
"""Returns the text of the entire document"""
return self.text()
def __ne__(self, other):
return not (self == other)
def __eq__(self, other):
if len(self.data) != len(other.data):
if self.debug: print("[PyNLPl FoLiA DEBUG] Equality check - Documents have unequal amount of children",file=stderr)
return False
for e,e2 in zip(self.data,other.data):
if e != e2:
return False
return True
#==============================================================================
class Corpus:
"""A corpus of various FoLiA documents. Yields a Document on each iteration. Suitable for sequential processing."""
def __init__(self,corpusdir, extension = 'xml', restrict_to_collection = "", conditionf=lambda x: True, ignoreerrors=False, **kwargs):
self.corpusdir = corpusdir
self.extension = extension
self.restrict_to_collection = restrict_to_collection
self.conditionf = conditionf
self.ignoreerrors = ignoreerrors
self.kwargs = kwargs
def __iter__(self):
if not self.restrict_to_collection:
for f in glob.glob(os.path.join(self.corpusdir,"*." + self.extension)):
if self.conditionf(f):
try:
yield Document(file=f, **self.kwargs )
except Exception as e: #pylint: disable=broad-except
print("Error, unable to parse " + f + ": " + e.__class__.__name__ + " - " + str(e),file=stderr)
if not self.ignoreerrors:
raise
for d in glob.glob(os.path.join(self.corpusdir,"*")):
if (not self.restrict_to_collection or self.restrict_to_collection == os.path.basename(d)) and (os.path.isdir(d)):
for f in glob.glob(os.path.join(d ,"*." + self.extension)):
if self.conditionf(f):
try:
yield Document(file=f, **self.kwargs)
except Exception as e: #pylint: disable=broad-except
print("Error, unable to parse " + f + ": " + e.__class__.__name__ + " - " + str(e),file=stderr)
if not self.ignoreerrors:
raise
class CorpusFiles(Corpus):
"""A corpus of various FoLiA documents. Yields the filenames on each iteration."""
def __iter__(self):
if not self.restrict_to_collection:
for f in glob.glob(os.path.join(self.corpusdir,"*." + self.extension)):
if self.conditionf(f):
try:
yield f
except Exception as e: #pylint: disable=broad-except
print("Error, unable to parse " + f+ ": " + e.__class__.__name__ + " - " + str(e),file=stderr)
if not self.ignoreerrors:
raise
for d in glob.glob(os.path.join(self.corpusdir,"*")):
if (not self.restrict_to_collection or self.restrict_to_collection == os.path.basename(d)) and (os.path.isdir(d)):
for f in glob.glob(os.path.join(d, "*." + self.extension)):
if self.conditionf(f):
try:
yield f
except Exception as e: #pylint: disable=broad-except
print("Error, unable to parse " + f+ ": " + e.__class__.__name__ + " - " + str(e),file=stderr)
if not self.ignoreerrors:
raise
class CorpusProcessor(object):
"""Processes a corpus of various FoLiA documents using a parallel processing. Calls a user-defined function with the three-tuple (filename, args, kwargs) for each file in the corpus. The user-defined function is itself responsible for instantiating a FoLiA document! args and kwargs, as received by the custom function, are set through the run() method, which yields the result of the custom function on each iteration."""
def __init__(self,corpusdir, function, threads = None, extension = 'xml', restrict_to_collection = "", conditionf=lambda x: True, maxtasksperchild=100, preindex = False, ordered=True, chunksize = 1):
self.function = function
self.threads = threads #If set to None, will use all available cores by default
self.corpusdir = corpusdir
self.extension = extension
self.restrict_to_collection = restrict_to_collection
self.conditionf = conditionf
self.ignoreerrors = True
self.maxtasksperchild = maxtasksperchild #This should never be set too high due to lxml leaking memory!!!
self.preindex = preindex
self.ordered = ordered
self.chunksize = chunksize
if preindex:
self.index = list(CorpusFiles(self.corpusdir, self.extension, self.restrict_to_collection, self.conditionf, True))
self.index.sort()
def __len__(self):
if self.preindex:
return len(self.index)
else:
return ValueError("Can only retrieve length if instantiated with preindex=True")
def execute(self):
for _ in self.run():
pass
def run(self, *args, **kwargs):
if not self.preindex:
self.index = CorpusFiles(self.corpusdir, self.extension, self.restrict_to_collection, self.conditionf, True) #generator
pool = multiprocessing.Pool(self.threads,None,None, self.maxtasksperchild)
if self.ordered:
return pool.imap( self.function, ( (filename, args, kwargs) for filename in self.index), self.chunksize)
else:
return pool.imap_unordered( self.function, ( (filename, args, kwargs) for filename in self.index), self.chunksize)
#pool.close()
def __iter__(self):
return self.run()
def relaxng_declarations():
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"})
for key in vars(AnnotationType).keys():
if key[0] != '_':
yield E.element( E.optional( E.attribute(name='set')) , E.optional(E.attribute(name='annotator')) , E.optional( E.attribute(name='annotatortype') ) , E.optional( E.attribute(name='datetime') ) , name=key.lower() + '-annotation')
def relaxng(filename=None):
"""Generates a RelaxNG Schema for FoLiA. Optionally saves it to file.
Args:
filename (str): Save the schema to the following filename
Returns:
lxml.ElementTree: The schema
"""
E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"})
grammar = E.grammar( E.start( E.element( #FoLiA
E.attribute(name='id',ns="http://www.w3.org/XML/1998/namespace"),
E.optional( E.attribute(name='version') ),
E.optional( E.attribute(name='generator') ),
E.element( #metadata
E.optional(E.attribute(name='type')),
E.optional(E.attribute(name='src')),
E.element( E.zeroOrMore( E.choice( *relaxng_declarations() ) ) ,name='annotations'),
E.zeroOrMore(
E.element(E.attribute(name='id'), E.text(), name='meta'),
),
E.zeroOrMore(
E.ref(name="foreign-data"),
),
E.zeroOrMore(
E.element( #submetadata
E.attribute(name='id',ns="http://www.w3.org/XML/1998/namespace"),
E.optional(E.attribute(name='type')),
E.optional(E.attribute(name='src')),
E.zeroOrMore(
E.element(E.attribute(name='id'), E.text(), name='meta'),
),
E.zeroOrMore(
E.ref(name="foreign-data"),
),
name="submetadata"
)
),
#E.optional(
# E.ref(name='METATRANSCRIPT')
#),
name='metadata',
#ns=NSFOLIA,
),
E.interleave(
E.zeroOrMore(
E.ref(name='text'),
),
E.zeroOrMore(
E.ref(name='speech'),
),
),
name='FoLiA',
ns = NSFOLIA
) ),
#definitions needed for ForeignData (allow any content) - see http://www.microhowto.info/howto/match_arbitrary_content_using_relax_ng.html
E.define( E.interleave(E.zeroOrMore(E.ref(name="any_element")),E.text()), name="any_content"),
E.define( E.element(E.anyName(), E.zeroOrMore(E.ref(name="any_attribute")), E.zeroOrMore(E.ref(name="any_content"))), name="any_element"),
E.define( E.attribute(E.anyName()), name="any_attribute"),
#Definition for allowing alien-namespace attributes on any element
E.define( E.zeroOrMore(E.attribute(E.anyName(getattr(E,'except')(E.nsName(),E.nsName(ns=""),E.nsName(ns="http://www.w3.org/XML/1998/namespace"),E.nsName(ns="http://www.w3.org/1999/xlink"))))), name="allow_foreign_attributes"),
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes",
)
done = {}
for c in globals().values():
if 'relaxng' in dir(c):
if c.relaxng and c.XMLTAG and not c.XMLTAG in done:
done[c.XMLTAG] = True
definition = c.relaxng()
grammar.append( definition )
if c.XMLTAG == 'item': #nasty backward-compatibility hack to allow deprecated listitem element (actually called item)
definition_alias = c.relaxng()
definition_alias.set('name','listitem')
definition_alias[0].set('name','listitem')
grammar.append( definition_alias )
#for e in relaxng_imdi():
# grammar.append(e)
if filename:
if sys.version < '3':
f = io.open(filename,'w',encoding='utf-8')
else:
f = io.open(filename,'wb')
if LXE:
if sys.version < '3':
f.write( ElementTree.tostring(relaxng(),pretty_print=True).replace("","\n\n") )
else:
f.write( ElementTree.tostring(relaxng(),pretty_print=True).replace(b"",b"\n\n") )
else:
f.write( ElementTree.tostring(relaxng()).replace("","\n\n") )
f.close()
return grammar
def findwords(doc, worditerator, *args, **kwargs):
if 'leftcontext' in kwargs:
leftcontext = int(kwargs['leftcontext'])
del kwargs['leftcontext']
else:
leftcontext = 0
if 'rightcontext' in kwargs:
rightcontext = int(kwargs['rightcontext'])
del kwargs['rightcontext']
else:
rightcontext = 0
if 'maxgapsize' in kwargs:
maxgapsize = int(kwargs['maxgapsize'])
del kwargs['maxgapsize']
else:
maxgapsize = 10
for key in kwargs.keys():
raise Exception("Unknown keyword parameter: " + key)
matchcursor = 0
#shortcut for when no Pattern is passed, make one on the fly
if len(args) == 1 and not isinstance(args[0], Pattern):
if not isinstance(args[0], list) and not isinstance(args[0], tuple):
args[0] = [args[0]]
args[0] = Pattern(*args[0])
unsetwildcards = False
variablewildcards = None
prevsize = -1
#sanity check
for i, pattern in enumerate(args):
if not isinstance(pattern, Pattern):
raise TypeError("You must pass instances of Sequence to findwords")
if prevsize > -1 and len(pattern) != prevsize:
raise Exception("If multiple patterns are provided, they must all have the same length!")
if pattern.variablesize():
if not variablewildcards and i > 0:
unsetwildcards = True
else:
if variablewildcards and pattern.variablewildcards() != variablewildcards:
raise Exception("If multiple patterns are provided with variable wildcards, then these wildcards must all be in the same positions!")
variablewildcards = pattern.variablewildcards()
elif variablewildcards:
unsetwildcards = True
prevsize = len(pattern)
if unsetwildcards:
#one pattern determines a fixed length whilst others are variable, rewrite all to fixed length
#converting multi-span * wildcards into single-span 'True' wildcards
for pattern in args:
if pattern.variablesize():
pattern.sequence = [ True if x == '*' else x for x in pattern.sequence ]
if variablewildcards: #pylint: disable=too-many-nested-blocks
#one or more items have a * wildcard, which may span multiple tokens. Resolve this to a wider range of simpler patterns
#we're not commited to a particular size, expand to various ones
for size in range(len(variablewildcards), maxgapsize+1):
for distribution in pynlpl.algorithms.sum_to_n(size, len(variablewildcards)): #gap distributions, (amount) of 'True' wildcards
patterns = []
for pattern in args:
if pattern.variablesize():
patterns += list(pattern.resolve(size,distribution))
else:
patterns.append( pattern )
for match in findwords(doc, worditerator,*patterns, **{'leftcontext':leftcontext,'rightcontext':rightcontext}):
yield match
else:
patterns = args #pylint: disable=redefined-variable-type
buffers = []
for word in worditerator():
buffers.append( [] ) #Add a new empty buffer for every word
match = [None] * len(buffers)
for pattern in patterns:
#find value to match against
if not pattern.matchannotation:
value = word.text()
else:
if pattern.matchannotationset:
items = list(word.select(pattern.matchannotation, pattern.matchannotationset, True, [Original, Suggestion, Alternative]))
else:
try:
set = doc.defaultset(pattern.matchannotation.ANNOTATIONTYPE)
items = list(word.select(pattern.matchannotation, set, True, [Original, Suggestion, Alternative] ))
except KeyError:
continue
if len(items) == 1:
value = items[0].cls
else:
continue
if not pattern.casesensitive:
value = value.lower()
for i, buffer in enumerate(buffers):
if match[i] is False:
continue
matchcursor = len(buffer)
match[i] = (value == pattern.sequence[matchcursor] or pattern.sequence[matchcursor] is True or (isinstance(pattern.sequence[matchcursor], tuple) and value in pattern.sequence[matchcursor]))
for buffer, matches in list(zip(buffers, match)):
if matches:
buffer.append(word) #add the word
if len(buffer) == len(pattern.sequence):
yield buffer[0].leftcontext(leftcontext) + buffer + buffer[-1].rightcontext(rightcontext)
buffers.remove(buffer)
else:
buffers.remove(buffer) #remove buffer
class Reader(object):
"""Streaming FoLiA reader.
The reader allows you to read a FoLiA Document without holding the whole tree structure in memory. The document will be read and the elements you seek returned as they are found. If you are querying a corpus of large FoLiA documents for a specific structure, then it is strongly recommend to use the Reader rather than the standard Document!"""
def __init__(self, filename, target, *args, **kwargs):
"""Read a FoLiA document in a streaming fashion. You select a specific target element and all occurrences of this element, including all contents (so all elements within), will be returned.
Arguments:
* ``filename``: The filename of the document to read
* ``target``: The FoLiA element(s) you want to read (with everything contained in its scope). Passed as a class. For example: ``folia.Sentence``, or a tuple of multiple element classes. Can also be set to ``None`` to return all elements, but that would load the full tree structure into memory.
"""
self.target = target
if not (isinstance(self.target, tuple) or isinstance(self.target, list) or issubclass(self.target, AbstractElement)):
raise ValueError("Target must be subclass of FoLiA element")
if 'bypassleak' in kwargs:
self.bypassleak = False
self.stream = io.open(filename,'rb')
self.initdoc()
def findwords(self, *args, **kwargs):
self.target = Word
for x in findwords(self.doc,self.__iter__,*args,**kwargs):
yield x
def initdoc(self):
self.doc = None
metadata = False
for action, node in ElementTree.iterparse(self.stream, events=("start","end")):
if action == "start" and node.tag == "{" + NSFOLIA + "}FoLiA":
if '{http://www.w3.org/XML/1998/namespace}id' in node.attrib:
id = node.attrib['{http://www.w3.org/XML/1998/namespace}id']
self.doc = Document(id=id)
if 'version' in node.attrib:
self.doc.version = node.attrib['version']
if action == "end" and node.tag == "{" + NSFOLIA + "}metadata":
if not self.doc:
raise MalformedXMLError("Metadata found, but no document? Impossible")
metadata = True
self.doc.parsemetadata(node)
break
if not self.doc:
raise MalformedXMLError("No FoLiA Document found!")
elif not metadata:
raise MalformedXMLError("No metadata found!")
self.stream.seek(0)
def __iter__(self):
"""Iterating over a Reader instance will cause the FoLiA document to be read. This is a generator yielding instances of the object you specified"""
if not isinstance(self.target, tuple) or isinstance(self.target,list):
target = "{" + NSFOLIA + "}" + self.target.XMLTAG
Class = self.target
multitargets = False
else:
multitargets = True
for action, node in ElementTree.iterparse(self.stream, events=("end",), tag=target):
if not multitargets or (multitargets and node.tag.startswith('{' + NSFOLIA + '}')):
if not multitargets: Class = XML2CLASS[node.tag[nslen:]]
if not multitargets or (multitargets and Class in self.targets):
element = Class.parsexml(node, self.doc)
node.clear() #clean up children
# Also eliminate now-empty references from the root node to
# elem (http://www.ibm.com/developerworks/xml/library/x-hiperfparse/)
#for ancestor in node.xpath('ancestor-or-self::*'):
while node.getprevious() is not None:
del node.getparent()[0] # clean up preceding siblings
yield element
def __del__(self):
self.stream.close()
def isncname(name):
#not entirely according to specs http://www.w3.org/TR/REC-xml/#NT-Name , but simplified:
for i, c in enumerate(name):
if i == 0:
if not c.isalpha() and c != '_':
raise ValueError('Invalid XML NCName identifier: ' + name + ' (at position ' + str(i+1)+')')
else:
if not c.isalnum() and not (c in ['-','_','.']):
raise ValueError('Invalid XML NCName identifier: ' + name + ' (at position ' + str(i+1)+')')
return True
def validate(filename,schema=None,deep=False):
if not os.path.exists(filename):
raise IOError("No such file")
try:
try:
doc = ElementTree.parse(filename, ElementTree.XMLParser(collect_ids=False) )
except TypeError:
doc = ElementTree.parse(filename, ElementTree.XMLParser() ) #older lxml, may leak!
except:
raise MalformedXMLError("Malformed XML!")
#See if there's inline IMDI and strip it off prior to validation (validator doesn't do IMDI)
m = doc.xpath('//folia:metadata', namespaces={'f': 'http://ilk.uvt.nl/folia','folia': 'http://ilk.uvt.nl/folia' })
if m:
metadata = m[0]
m = metadata.find('{http://www.mpi.nl/IMDI/Schema/IMDI}METATRANSCRIPT')
if m is not None:
metadata.remove(m)
if not schema:
schema = ElementTree.RelaxNG(relaxng())
try:
schema.assertValid(doc) #will raise exceptions
except Exception as e:
for error in schema.error_log:
print("Error on line " + str(error.line) + ": " + error.message, file=sys.stderr)
raise e
if deep:
doc = Document(tree=doc, deepvalidation=True)
#================================= FOLIA SPECIFICATION ==========================================================
#foliaspec:header
#This file was last updated according to the FoLiA specification for version 1.5.1 on 2017-11-21 13:18:02, using foliaspec.py
#Code blocks after a foliaspec comment (until the next newline) are automatically generated. **DO NOT EDIT THOSE** and **DO NOT REMOVE ANY FOLIASPEC COMMENTS** !!!
#foliaspec:structurescope:STRUCTURESCOPE
#Structure scope above the sentence level, used by next() and previous() methods
STRUCTURESCOPE = (Sentence, Paragraph, Division, ListItem, Text, Event, Caption, Head,)
#foliaspec:annotationtype_xml_map
#A mapping from annotation types to xml tags (strings)
ANNOTATIONTYPE2XML = {
AnnotationType.ALIGNMENT: "alignment" ,
AnnotationType.CHUNKING: "chunk" ,
AnnotationType.COMPLEXALIGNMENT: "complexalignment" ,
AnnotationType.COREFERENCE: "coreferencechain" ,
AnnotationType.CORRECTION: "correction" ,
AnnotationType.DEFINITION: "def" ,
AnnotationType.DEPENDENCY: "dependency" ,
AnnotationType.DIVISION: "div" ,
AnnotationType.DOMAIN: "domain" ,
AnnotationType.ENTITY: "entity" ,
AnnotationType.ENTRY: "entry" ,
AnnotationType.ERRORDETECTION: "errordetection" ,
AnnotationType.EVENT: "event" ,
AnnotationType.EXAMPLE: "ex" ,
AnnotationType.FIGURE: "figure" ,
AnnotationType.GAP: "gap" ,
AnnotationType.LANG: "lang" ,
AnnotationType.LEMMA: "lemma" ,
AnnotationType.LINEBREAK: "br" ,
AnnotationType.LIST: "list" ,
AnnotationType.METRIC: "metric" ,
AnnotationType.MORPHOLOGICAL: "morpheme" ,
AnnotationType.NOTE: "note" ,
AnnotationType.OBSERVATION: "observation" ,
AnnotationType.PARAGRAPH: "p" ,
AnnotationType.PART: "part" ,
AnnotationType.PHON: "ph" ,
AnnotationType.PHONOLOGICAL: "phoneme" ,
AnnotationType.POS: "pos" ,
AnnotationType.PREDICATE: "predicate" ,
AnnotationType.SEMROLE: "semrole" ,
AnnotationType.SENSE: "sense" ,
AnnotationType.SENTENCE: "s" ,
AnnotationType.SENTIMENT: "sentiment" ,
AnnotationType.STATEMENT: "statement" ,
AnnotationType.STRING: "str" ,
AnnotationType.SUBJECTIVITY: "subjectivity" ,
AnnotationType.SYNTAX: "su" ,
AnnotationType.TABLE: "table" ,
AnnotationType.TERM: "term" ,
AnnotationType.TEXT: "t" ,
AnnotationType.STYLE: "t-style" ,
AnnotationType.TIMESEGMENT: "timesegment" ,
AnnotationType.UTTERANCE: "utt" ,
AnnotationType.WHITESPACE: "whitespace" ,
AnnotationType.TOKEN: "w" ,
}
#foliaspec:string_class_map
XML2CLASS = {
"aref": AlignReference,
"alignment": Alignment,
"alt": Alternative,
"altlayers": AlternativeLayers,
"caption": Caption,
"cell": Cell,
"chunk": Chunk,
"chunking": ChunkingLayer,
"comment": Comment,
"complexalignment": ComplexAlignment,
"complexalignments": ComplexAlignmentLayer,
"content": Content,
"coreferencechain": CoreferenceChain,
"coreferences": CoreferenceLayer,
"coreferencelink": CoreferenceLink,
"correction": Correction,
"current": Current,
"def": Definition,
"dependencies": DependenciesLayer,
"dependency": Dependency,
"dep": DependencyDependent,
"desc": Description,
"div": Division,
"domain": DomainAnnotation,
"entities": EntitiesLayer,
"entity": Entity,
"entry": Entry,
"errordetection": ErrorDetection,
"event": Event,
"ex": Example,
"external": External,
"feat": Feature,
"figure": Figure,
"foreign-data": ForeignData,
"gap": Gap,
"head": Head,
"hd": Headspan,
"label": Label,
"lang": LangAnnotation,
"lemma": LemmaAnnotation,
"br": Linebreak,
"list": List,
"item": ListItem,
"metric": Metric,
"morpheme": Morpheme,
"morphology": MorphologyLayer,
"new": New,
"note": Note,
"observation": Observation,
"observations": ObservationLayer,
"original": Original,
"p": Paragraph,
"part": Part,
"ph": PhonContent,
"phoneme": Phoneme,
"phonology": PhonologyLayer,
"pos": PosAnnotation,
"predicate": Predicate,
"quote": Quote,
"ref": Reference,
"relation": Relation,
"row": Row,
"semrole": SemanticRole,
"semroles": SemanticRolesLayer,
"sense": SenseAnnotation,
"s": Sentence,
"sentiment": Sentiment,
"sentiments": SentimentLayer,
"source": Source,
"speech": Speech,
"statement": Statement,
"statements": StatementLayer,
"str": String,
"subjectivity": SubjectivityAnnotation,
"suggestion": Suggestion,
"su": SyntacticUnit,
"syntax": SyntaxLayer,
"table": Table,
"tablehead": TableHead,
"target": Target,
"term": Term,
"text": Text,
"t": TextContent,
"t-correction": TextMarkupCorrection,
"t-error": TextMarkupError,
"t-gap": TextMarkupGap,
"t-str": TextMarkupString,
"t-style": TextMarkupStyle,
"timesegment": TimeSegment,
"timing": TimingLayer,
"utt": Utterance,
"whitespace": Whitespace,
"w": Word,
"wref": WordReference,
}
XML2CLASS['listitem'] = ListItem #backward compatibility for erroneous old FoLiA versions (XML tag is 'item' now, consistent with manual)
#foliaspec:annotationtype_layerclass_map
ANNOTATIONTYPE2LAYERCLASS = {
AnnotationType.CHUNKING: ChunkingLayer ,
AnnotationType.COMPLEXALIGNMENT: ComplexAlignmentLayer ,
AnnotationType.COREFERENCE: CoreferenceLayer ,
AnnotationType.DEPENDENCY: DependenciesLayer ,
AnnotationType.ENTITY: EntitiesLayer ,
AnnotationType.MORPHOLOGICAL: MorphologyLayer ,
AnnotationType.OBSERVATION: ObservationLayer ,
AnnotationType.PHONOLOGICAL: PhonologyLayer ,
AnnotationType.SEMROLE: SemanticRolesLayer ,
AnnotationType.SENTIMENT: SentimentLayer ,
AnnotationType.STATEMENT: StatementLayer ,
AnnotationType.SYNTAX: SyntaxLayer ,
AnnotationType.TIMESEGMENT: TimingLayer ,
AnnotationType.PREDICATE: SemanticRolesLayer
}
#foliaspec:default_ignore
#Default ignore list for the select() method, do not descend into these
default_ignore = ( Original, Suggestion, Alternative, AlternativeLayers, ForeignData,)
#foliaspec:default_ignore_annotations
#Default ignore list for token annotation
default_ignore_annotations = ( Original, Suggestion, Alternative, AlternativeLayers, MorphologyLayer, PhonologyLayer,)
#foliaspec:default_ignore_structure
#Default ignore list for structure annotation
default_ignore_structure = ( Original, Suggestion, Alternative, AlternativeLayers, AbstractAnnotationLayer,)
#foliaspec:defaultproperties
#Default properties which all elements inherit
AbstractElement.ACCEPTED_DATA = (Description, Comment,)
AbstractElement.ANNOTATIONTYPE = None
AbstractElement.AUTH = True
AbstractElement.AUTO_GENERATE_ID = False
AbstractElement.OCCURRENCES = 0
AbstractElement.OCCURRENCES_PER_SET = 0
AbstractElement.OPTIONAL_ATTRIBS = None
AbstractElement.PHONCONTAINER = False
AbstractElement.PRIMARYELEMENT = True
AbstractElement.PRINTABLE = False
AbstractElement.REQUIRED_ATTRIBS = None
AbstractElement.REQUIRED_DATA = None
AbstractElement.SETONLY = False
AbstractElement.SPEAKABLE = False
AbstractElement.SUBSET = None
AbstractElement.TEXTCONTAINER = False
AbstractElement.TEXTDELIMITER = None
AbstractElement.XLINK = False
AbstractElement.XMLTAG = None
#foliaspec:setelementproperties
#Sets all element properties for all elements
#------ AbstractAnnotationLayer -------
AbstractAnnotationLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData,)
AbstractAnnotationLayer.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.N, Attrib.TEXTCLASS, Attrib.METADATA,)
AbstractAnnotationLayer.PRINTABLE = False
AbstractAnnotationLayer.SETONLY = True
AbstractAnnotationLayer.SPEAKABLE = False
#------ AbstractCorrectionChild -------
AbstractCorrectionChild.ACCEPTED_DATA = (AbstractSpanAnnotation, AbstractStructureElement, AbstractTokenAnnotation, Comment, Correction, Description, ForeignData, Metric, PhonContent, String, TextContent,)
AbstractCorrectionChild.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.N,)
AbstractCorrectionChild.PRINTABLE = True
AbstractCorrectionChild.SPEAKABLE = True
AbstractCorrectionChild.TEXTDELIMITER = None
#------ AbstractExtendedTokenAnnotation -------
#------ AbstractSpanAnnotation -------
AbstractSpanAnnotation.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, ForeignData, Metric,)
AbstractSpanAnnotation.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER, Attrib.TEXTCLASS, Attrib.METADATA,)
AbstractSpanAnnotation.PRINTABLE = True
AbstractSpanAnnotation.SPEAKABLE = True
#------ AbstractSpanRole -------
AbstractSpanRole.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, WordReference,)
AbstractSpanRole.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.ANNOTATOR, Attrib.N, Attrib.DATETIME,)
#------ AbstractStructureElement -------
AbstractStructureElement.ACCEPTED_DATA = (AbstractAnnotationLayer, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Part,)
AbstractStructureElement.AUTO_GENERATE_ID = True
AbstractStructureElement.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER, Attrib.METADATA,)
AbstractStructureElement.PRINTABLE = True
AbstractStructureElement.REQUIRED_ATTRIBS = None
AbstractStructureElement.SPEAKABLE = True
AbstractStructureElement.TEXTDELIMITER = "\n\n"
#------ AbstractTextMarkup -------
AbstractTextMarkup.ACCEPTED_DATA = (AbstractTextMarkup, Comment, Description, Linebreak,)
AbstractTextMarkup.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER, Attrib.METADATA,)
AbstractTextMarkup.PRIMARYELEMENT = False
AbstractTextMarkup.PRINTABLE = True
AbstractTextMarkup.TEXTCONTAINER = True
AbstractTextMarkup.TEXTDELIMITER = ""
AbstractTextMarkup.XLINK = True
#------ AbstractTokenAnnotation -------
AbstractTokenAnnotation.ACCEPTED_DATA = (Comment, Description, Feature, ForeignData, Metric,)
AbstractTokenAnnotation.OCCURRENCES_PER_SET = 1
AbstractTokenAnnotation.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER, Attrib.TEXTCLASS, Attrib.METADATA,)
AbstractTokenAnnotation.REQUIRED_ATTRIBS = (Attrib.CLASS,)
#------ ActorFeature -------
ActorFeature.SUBSET = "actor"
ActorFeature.XMLTAG = None
#------ AlignReference -------
AlignReference.XMLTAG = "aref"
#------ Alignment -------
Alignment.ACCEPTED_DATA = (AlignReference, Comment, Description, Feature, ForeignData, Metric,)
Alignment.ANNOTATIONTYPE = AnnotationType.ALIGNMENT
Alignment.LABEL = "Alignment"
Alignment.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER, Attrib.METADATA,)
Alignment.PRINTABLE = False
Alignment.REQUIRED_ATTRIBS = None
Alignment.SPEAKABLE = False
Alignment.XLINK = True
Alignment.XMLTAG = "alignment"
#------ Alternative -------
Alternative.ACCEPTED_DATA = (AbstractTokenAnnotation, Comment, Correction, Description, ForeignData, MorphologyLayer, PhonologyLayer,)
Alternative.AUTH = False
Alternative.LABEL = "Alternative"
Alternative.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER, Attrib.METADATA,)
Alternative.PRINTABLE = False
Alternative.REQUIRED_ATTRIBS = None
Alternative.SPEAKABLE = False
Alternative.XMLTAG = "alt"
#------ AlternativeLayers -------
AlternativeLayers.ACCEPTED_DATA = (AbstractAnnotationLayer, Comment, Description, ForeignData,)
AlternativeLayers.AUTH = False
AlternativeLayers.LABEL = "Alternative Layers"
AlternativeLayers.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER, Attrib.METADATA,)
AlternativeLayers.PRINTABLE = False
AlternativeLayers.REQUIRED_ATTRIBS = None
AlternativeLayers.SPEAKABLE = False
AlternativeLayers.XMLTAG = "altlayers"
#------ BegindatetimeFeature -------
BegindatetimeFeature.SUBSET = "begindatetime"
BegindatetimeFeature.XMLTAG = None
#------ Caption -------
Caption.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Gap, Linebreak, Metric, Part, PhonContent, Reference, Sentence, String, TextContent, Whitespace,)
Caption.LABEL = "Caption"
Caption.OCCURRENCES = 1
Caption.XMLTAG = "caption"
#------ Cell -------
Cell.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Entry, Event, Example, Feature, ForeignData, Gap, Head, Linebreak, Metric, Note, Paragraph, Part, Reference, Sentence, String, TextContent, Whitespace, Word,)
Cell.LABEL = "Cell"
Cell.TEXTDELIMITER = " | "
Cell.XMLTAG = "cell"
#------ Chunk -------
Chunk.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, WordReference,)
Chunk.ANNOTATIONTYPE = AnnotationType.CHUNKING
Chunk.LABEL = "Chunk"
Chunk.XMLTAG = "chunk"
#------ ChunkingLayer -------
ChunkingLayer.ACCEPTED_DATA = (Chunk, Comment, Correction, Description, ForeignData,)
ChunkingLayer.ANNOTATIONTYPE = AnnotationType.CHUNKING
ChunkingLayer.PRIMARYELEMENT = False
ChunkingLayer.XMLTAG = "chunking"
#------ Comment -------
Comment.LABEL = "Comment"
Comment.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.N, Attrib.METADATA,)
Comment.PRINTABLE = False
Comment.XMLTAG = "comment"
#------ ComplexAlignment -------
ComplexAlignment.ACCEPTED_DATA = (Alignment, Comment, Description, Feature, ForeignData, Metric,)
ComplexAlignment.ANNOTATIONTYPE = AnnotationType.COMPLEXALIGNMENT
ComplexAlignment.LABEL = "Complex Alignment"
ComplexAlignment.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER, Attrib.METADATA,)
ComplexAlignment.PRINTABLE = False
ComplexAlignment.REQUIRED_ATTRIBS = None
ComplexAlignment.SPEAKABLE = False
ComplexAlignment.XMLTAG = "complexalignment"
#------ ComplexAlignmentLayer -------
ComplexAlignmentLayer.ACCEPTED_DATA = (Comment, ComplexAlignment, Correction, Description, ForeignData,)
ComplexAlignmentLayer.ANNOTATIONTYPE = AnnotationType.COMPLEXALIGNMENT
ComplexAlignmentLayer.PRIMARYELEMENT = False
ComplexAlignmentLayer.XMLTAG = "complexalignments"
#------ Content -------
Content.LABEL = "Gap Content"
Content.OCCURRENCES = 1
Content.XMLTAG = "content"
#------ CoreferenceChain -------
CoreferenceChain.ACCEPTED_DATA = (AlignReference, Alignment, Comment, CoreferenceLink, Description, Feature, ForeignData, Metric,)
CoreferenceChain.ANNOTATIONTYPE = AnnotationType.COREFERENCE
CoreferenceChain.LABEL = "Coreference Chain"
CoreferenceChain.REQUIRED_DATA = (CoreferenceLink,)
CoreferenceChain.XMLTAG = "coreferencechain"
#------ CoreferenceLayer -------
CoreferenceLayer.ACCEPTED_DATA = (Comment, CoreferenceChain, Correction, Description, ForeignData,)
CoreferenceLayer.ANNOTATIONTYPE = AnnotationType.COREFERENCE
CoreferenceLayer.PRIMARYELEMENT = False
CoreferenceLayer.XMLTAG = "coreferences"
#------ CoreferenceLink -------
CoreferenceLink.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Headspan, LevelFeature, Metric, ModalityFeature, TimeFeature, WordReference,)
CoreferenceLink.ANNOTATIONTYPE = AnnotationType.COREFERENCE
CoreferenceLink.LABEL = "Coreference Link"
CoreferenceLink.PRIMARYELEMENT = False
CoreferenceLink.XMLTAG = "coreferencelink"
#------ Correction -------
Correction.ACCEPTED_DATA = (Comment, Current, Description, ErrorDetection, Feature, ForeignData, Metric, New, Original, Suggestion,)
Correction.ANNOTATIONTYPE = AnnotationType.CORRECTION
Correction.LABEL = "Correction"
Correction.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER, Attrib.METADATA,)
Correction.PRINTABLE = True
Correction.SPEAKABLE = True
Correction.TEXTDELIMITER = None
Correction.XMLTAG = "correction"
#------ Current -------
Current.OCCURRENCES = 1
Current.OPTIONAL_ATTRIBS = None
Current.XMLTAG = "current"
#------ Definition -------
Definition.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, Figure, ForeignData, Linebreak, List, Metric, Paragraph, Part, PhonContent, Reference, Sentence, String, Table, TextContent, Utterance, Whitespace, Word,)
Definition.ANNOTATIONTYPE = AnnotationType.DEFINITION
Definition.LABEL = "Definition"
Definition.XMLTAG = "def"
#------ DependenciesLayer -------
DependenciesLayer.ACCEPTED_DATA = (Comment, Correction, Dependency, Description, ForeignData,)
DependenciesLayer.ANNOTATIONTYPE = AnnotationType.DEPENDENCY
DependenciesLayer.PRIMARYELEMENT = False
DependenciesLayer.XMLTAG = "dependencies"
#------ Dependency -------
Dependency.ACCEPTED_DATA = (AlignReference, Alignment, Comment, DependencyDependent, Description, Feature, ForeignData, Headspan, Metric,)
Dependency.ANNOTATIONTYPE = AnnotationType.DEPENDENCY
Dependency.LABEL = "Dependency"
Dependency.REQUIRED_DATA = (DependencyDependent, Headspan,)
Dependency.XMLTAG = "dependency"
#------ DependencyDependent -------
DependencyDependent.LABEL = "Dependent"
DependencyDependent.OCCURRENCES = 1
DependencyDependent.XMLTAG = "dep"
#------ Description -------
Description.LABEL = "Description"
Description.OCCURRENCES = 1
Description.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.N, Attrib.METADATA,)
Description.XMLTAG = "desc"
#------ Division -------
Division.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Division, Entry, Event, Example, Feature, Figure, ForeignData, Gap, Head, Linebreak, List, Metric, Note, Paragraph, Part, PhonContent, Quote, Reference, Sentence, Table, TextContent, Utterance, Whitespace,)
Division.ANNOTATIONTYPE = AnnotationType.DIVISION
Division.LABEL = "Division"
Division.TEXTDELIMITER = "\n\n\n"
Division.XMLTAG = "div"
#------ DomainAnnotation -------
DomainAnnotation.ANNOTATIONTYPE = AnnotationType.DOMAIN
DomainAnnotation.LABEL = "Domain"
DomainAnnotation.OCCURRENCES_PER_SET = 0
DomainAnnotation.XMLTAG = "domain"
#------ EnddatetimeFeature -------
EnddatetimeFeature.SUBSET = "enddatetime"
EnddatetimeFeature.XMLTAG = None
#------ EntitiesLayer -------
EntitiesLayer.ACCEPTED_DATA = (Comment, Correction, Description, Entity, ForeignData,)
EntitiesLayer.ANNOTATIONTYPE = AnnotationType.ENTITY
EntitiesLayer.PRIMARYELEMENT = False
EntitiesLayer.XMLTAG = "entities"
#------ Entity -------
Entity.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, WordReference,)
Entity.ANNOTATIONTYPE = AnnotationType.ENTITY
Entity.LABEL = "Entity"
Entity.XMLTAG = "entity"
#------ Entry -------
Entry.ACCEPTED_DATA = (AbstractAnnotationLayer, Alignment, Alternative, AlternativeLayers, Comment, Correction, Definition, Description, Example, Feature, ForeignData, Metric, Part, Term,)
Entry.ANNOTATIONTYPE = AnnotationType.ENTRY
Entry.LABEL = "Entry"
Entry.XMLTAG = "entry"
#------ ErrorDetection -------
ErrorDetection.ANNOTATIONTYPE = AnnotationType.ERRORDETECTION
ErrorDetection.LABEL = "Error Detection"
ErrorDetection.OCCURRENCES_PER_SET = 0
ErrorDetection.XMLTAG = "errordetection"
#------ Event -------
Event.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, ActorFeature, Alignment, Alternative, AlternativeLayers, BegindatetimeFeature, Comment, Correction, Description, Division, EnddatetimeFeature, Entry, Event, Example, Feature, Figure, ForeignData, Gap, Head, Linebreak, List, Metric, Note, Paragraph, Part, PhonContent, Quote, Reference, Sentence, String, Table, TextContent, Utterance, Whitespace, Word,)
Event.ANNOTATIONTYPE = AnnotationType.EVENT
Event.LABEL = "Event"
Event.XMLTAG = "event"
#------ Example -------
Example.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, Figure, ForeignData, Linebreak, List, Metric, Paragraph, Part, PhonContent, Reference, Sentence, String, Table, TextContent, Utterance, Whitespace, Word,)
Example.ANNOTATIONTYPE = AnnotationType.EXAMPLE
Example.LABEL = "Example"
Example.XMLTAG = "ex"
#------ External -------
External.ACCEPTED_DATA = (Comment, Description,)
External.AUTH = True
External.LABEL = "External"
External.OPTIONAL_ATTRIBS = None
External.PRINTABLE = True
External.REQUIRED_ATTRIBS = (Attrib.SRC,)
External.SPEAKABLE = False
External.XMLTAG = "external"
#------ Feature -------
Feature.LABEL = "Feature"
Feature.XMLTAG = "feat"
#------ Figure -------
Figure.ACCEPTED_DATA = (AbstractAnnotationLayer, Alignment, Alternative, AlternativeLayers, Caption, Comment, Correction, Description, Feature, ForeignData, Metric, Part, Sentence, String, TextContent,)
Figure.ANNOTATIONTYPE = AnnotationType.FIGURE
Figure.LABEL = "Figure"
Figure.SPEAKABLE = False
Figure.TEXTDELIMITER = "\n\n"
Figure.XMLTAG = "figure"
#------ ForeignData -------
ForeignData.XMLTAG = "foreign-data"
#------ FunctionFeature -------
FunctionFeature.SUBSET = "function"
FunctionFeature.XMLTAG = None
#------ Gap -------
Gap.ACCEPTED_DATA = (Comment, Content, Description, Feature, ForeignData, Metric, Part,)
Gap.ANNOTATIONTYPE = AnnotationType.GAP
Gap.LABEL = "Gap"
Gap.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.METADATA,)
Gap.XMLTAG = "gap"
#------ Head -------
Head.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Event, Feature, ForeignData, Gap, Linebreak, Metric, Part, PhonContent, Reference, Sentence, String, TextContent, Whitespace, Word,)
Head.LABEL = "Head"
Head.OCCURRENCES = 1
Head.TEXTDELIMITER = "\n\n"
Head.XMLTAG = "head"
#------ HeadFeature -------
HeadFeature.SUBSET = "head"
HeadFeature.XMLTAG = None
#------ Headspan -------
Headspan.LABEL = "Head"
Headspan.OCCURRENCES = 1
Headspan.XMLTAG = "hd"
#------ Label -------
Label.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Linebreak, Metric, Part, PhonContent, Reference, String, TextContent, Whitespace, Word,)
Label.LABEL = "Label"
Label.XMLTAG = "label"
#------ LangAnnotation -------
LangAnnotation.ANNOTATIONTYPE = AnnotationType.LANG
LangAnnotation.LABEL = "Language"
LangAnnotation.XMLTAG = "lang"
#------ LemmaAnnotation -------
LemmaAnnotation.ANNOTATIONTYPE = AnnotationType.LEMMA
LemmaAnnotation.LABEL = "Lemma"
LemmaAnnotation.XMLTAG = "lemma"
#------ LevelFeature -------
LevelFeature.SUBSET = "level"
LevelFeature.XMLTAG = None
#------ Linebreak -------
Linebreak.ANNOTATIONTYPE = AnnotationType.LINEBREAK
Linebreak.LABEL = "Linebreak"
Linebreak.TEXTDELIMITER = ""
Linebreak.XLINK = True
Linebreak.XMLTAG = "br"
#------ List -------
List.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Caption, Comment, Correction, Description, Event, Feature, ForeignData, ListItem, Metric, Note, Part, PhonContent, Reference, String, TextContent,)
List.ANNOTATIONTYPE = AnnotationType.LIST
List.LABEL = "List"
List.TEXTDELIMITER = "\n\n"
List.XMLTAG = "list"
#------ ListItem -------
ListItem.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Event, Feature, ForeignData, Gap, Label, Linebreak, List, Metric, Note, Paragraph, Part, PhonContent, Reference, Sentence, String, TextContent, Whitespace, Word,)
ListItem.LABEL = "List Item"
ListItem.TEXTDELIMITER = "\n"
ListItem.XMLTAG = "item"
#------ Metric -------
Metric.ACCEPTED_DATA = (Comment, Description, Feature, ForeignData, ValueFeature,)
Metric.ANNOTATIONTYPE = AnnotationType.METRIC
Metric.LABEL = "Metric"
Metric.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER, Attrib.METADATA,)
Metric.XMLTAG = "metric"
#------ ModalityFeature -------
ModalityFeature.SUBSET = "modality"
ModalityFeature.XMLTAG = None
#------ Morpheme -------
Morpheme.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, FunctionFeature, Metric, Morpheme, Part, PhonContent, String, TextContent,)
Morpheme.ANNOTATIONTYPE = AnnotationType.MORPHOLOGICAL
Morpheme.LABEL = "Morpheme"
Morpheme.TEXTDELIMITER = ""
Morpheme.XMLTAG = "morpheme"
#------ MorphologyLayer -------
MorphologyLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Morpheme,)
MorphologyLayer.ANNOTATIONTYPE = AnnotationType.MORPHOLOGICAL
MorphologyLayer.PRIMARYELEMENT = False
MorphologyLayer.XMLTAG = "morphology"
#------ New -------
New.OCCURRENCES = 1
New.OPTIONAL_ATTRIBS = None
New.XMLTAG = "new"
#------ Note -------
Note.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Example, Feature, Figure, ForeignData, Head, Linebreak, List, Metric, Paragraph, Part, PhonContent, Reference, Sentence, String, Table, TextContent, Utterance, Whitespace, Word,)
Note.ANNOTATIONTYPE = AnnotationType.NOTE
Note.LABEL = "Note"
Note.XMLTAG = "note"
#------ Observation -------
Observation.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, WordReference,)
Observation.ANNOTATIONTYPE = AnnotationType.OBSERVATION
Observation.LABEL = "Observation"
Observation.XMLTAG = "observation"
#------ ObservationLayer -------
ObservationLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Observation,)
ObservationLayer.ANNOTATIONTYPE = AnnotationType.OBSERVATION
ObservationLayer.PRIMARYELEMENT = False
ObservationLayer.XMLTAG = "observations"
#------ Original -------
Original.AUTH = False
Original.OCCURRENCES = 1
Original.OPTIONAL_ATTRIBS = None
Original.XMLTAG = "original"
#------ Paragraph -------
Paragraph.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Entry, Event, Example, Feature, Figure, ForeignData, Gap, Head, Linebreak, List, Metric, Note, Part, PhonContent, Quote, Reference, Sentence, String, TextContent, Whitespace, Word,)
Paragraph.ANNOTATIONTYPE = AnnotationType.PARAGRAPH
Paragraph.LABEL = "Paragraph"
Paragraph.TEXTDELIMITER = "\n\n"
Paragraph.XMLTAG = "p"
#------ Part -------
Part.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, AbstractStructureElement, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Part, PhonContent, TextContent,)
Part.ANNOTATIONTYPE = AnnotationType.PART
Part.LABEL = "Part"
Part.TEXTDELIMITER = None
Part.XMLTAG = "part"
#------ PhonContent -------
PhonContent.ACCEPTED_DATA = (Comment, Description,)
PhonContent.ANNOTATIONTYPE = AnnotationType.PHON
PhonContent.LABEL = "Phonetic Content"
PhonContent.OCCURRENCES = 0
PhonContent.OPTIONAL_ATTRIBS = (Attrib.CLASS, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.METADATA,)
PhonContent.PHONCONTAINER = True
PhonContent.PRINTABLE = False
PhonContent.SPEAKABLE = True
PhonContent.XMLTAG = "ph"
#------ Phoneme -------
Phoneme.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, FunctionFeature, Metric, Part, PhonContent, Phoneme, String, TextContent,)
Phoneme.ANNOTATIONTYPE = AnnotationType.PHONOLOGICAL
Phoneme.LABEL = "Phoneme"
Phoneme.TEXTDELIMITER = ""
Phoneme.XMLTAG = "phoneme"
#------ PhonologyLayer -------
PhonologyLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Phoneme,)
PhonologyLayer.ANNOTATIONTYPE = AnnotationType.PHONOLOGICAL
PhonologyLayer.PRIMARYELEMENT = False
PhonologyLayer.XMLTAG = "phonology"
#------ PolarityFeature -------
PolarityFeature.SUBSET = "polarity"
PolarityFeature.XMLTAG = None
#------ PosAnnotation -------
PosAnnotation.ACCEPTED_DATA = (Comment, Description, Feature, ForeignData, HeadFeature, Metric,)
PosAnnotation.ANNOTATIONTYPE = AnnotationType.POS
PosAnnotation.LABEL = "Part-of-Speech"
PosAnnotation.XMLTAG = "pos"
#------ Predicate -------
Predicate.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, SemanticRole, WordReference,)
Predicate.ANNOTATIONTYPE = AnnotationType.PREDICATE
Predicate.LABEL = "Predicate"
Predicate.XMLTAG = "predicate"
#------ Quote -------
Quote.ACCEPTED_DATA = (AbstractAnnotationLayer, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Division, Feature, ForeignData, Gap, Metric, Paragraph, Part, Quote, Reference, Sentence, String, TextContent, Utterance, Word,)
Quote.LABEL = "Quote"
Quote.XMLTAG = "quote"
#------ Reference -------
Reference.ACCEPTED_DATA = (AbstractAnnotationLayer, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Linebreak, Metric, Paragraph, Part, PhonContent, Quote, Sentence, String, TextContent, Utterance, Whitespace, Word,)
Reference.LABEL = "Reference"
Reference.TEXTDELIMITER = None
Reference.XMLTAG = "ref"
#------ Relation -------
Relation.LABEL = "Relation"
Relation.OCCURRENCES = 1
Relation.XMLTAG = "relation"
#------ Row -------
Row.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Cell, Comment, Correction, Description, Feature, ForeignData, Metric, Part,)
Row.LABEL = "Table Row"
Row.TEXTDELIMITER = "\n"
Row.XMLTAG = "row"
#------ SemanticRole -------
SemanticRole.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Headspan, Metric, WordReference,)
SemanticRole.ANNOTATIONTYPE = AnnotationType.SEMROLE
SemanticRole.LABEL = "Semantic Role"
SemanticRole.REQUIRED_ATTRIBS = (Attrib.CLASS,)
SemanticRole.XMLTAG = "semrole"
#------ SemanticRolesLayer -------
SemanticRolesLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Predicate, SemanticRole,)
SemanticRolesLayer.ANNOTATIONTYPE = AnnotationType.SEMROLE
SemanticRolesLayer.PRIMARYELEMENT = False
SemanticRolesLayer.XMLTAG = "semroles"
#------ SenseAnnotation -------
SenseAnnotation.ACCEPTED_DATA = (Comment, Description, Feature, ForeignData, Metric, SynsetFeature,)
SenseAnnotation.ANNOTATIONTYPE = AnnotationType.SENSE
SenseAnnotation.LABEL = "Semantic Sense"
SenseAnnotation.OCCURRENCES_PER_SET = 0
SenseAnnotation.XMLTAG = "sense"
#------ Sentence -------
Sentence.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Entry, Event, Example, Feature, ForeignData, Gap, Linebreak, Metric, Note, Part, PhonContent, Quote, Reference, String, TextContent, Whitespace, Word,)
Sentence.ANNOTATIONTYPE = AnnotationType.SENTENCE
Sentence.LABEL = "Sentence"
Sentence.TEXTDELIMITER = " "
Sentence.XMLTAG = "s"
#------ Sentiment -------
Sentiment.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Headspan, Metric, PolarityFeature, Source, StrengthFeature, Target, WordReference,)
Sentiment.ANNOTATIONTYPE = AnnotationType.SENTIMENT
Sentiment.LABEL = "Sentiment"
Sentiment.XMLTAG = "sentiment"
#------ SentimentLayer -------
SentimentLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Sentiment,)
SentimentLayer.ANNOTATIONTYPE = AnnotationType.SENTIMENT
SentimentLayer.PRIMARYELEMENT = False
SentimentLayer.XMLTAG = "sentiments"
#------ Source -------
Source.LABEL = "Source"
Source.OCCURRENCES = 1
Source.XMLTAG = "source"
#------ Speech -------
Speech.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Division, Entry, Event, Example, External, Feature, ForeignData, Gap, List, Metric, Note, Paragraph, Part, PhonContent, Quote, Reference, Sentence, String, TextContent, Utterance, Word,)
Speech.LABEL = "Speech Body"
Speech.TEXTDELIMITER = "\n\n\n"
Speech.XMLTAG = "speech"
#------ Statement -------
Statement.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Headspan, Metric, Relation, Source, WordReference,)
Statement.ANNOTATIONTYPE = AnnotationType.STATEMENT
Statement.LABEL = "Statement"
Statement.XMLTAG = "statement"
#------ StatementLayer -------
StatementLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Statement,)
StatementLayer.ANNOTATIONTYPE = AnnotationType.STATEMENT
StatementLayer.PRIMARYELEMENT = False
StatementLayer.XMLTAG = "statements"
#------ StrengthFeature -------
StrengthFeature.SUBSET = "strength"
StrengthFeature.XMLTAG = None
#------ String -------
String.ACCEPTED_DATA = (AbstractExtendedTokenAnnotation, Alignment, Comment, Correction, Description, Feature, ForeignData, Metric, PhonContent, TextContent,)
String.ANNOTATIONTYPE = AnnotationType.STRING
String.LABEL = "String"
String.OCCURRENCES = 0
String.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.N, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.METADATA,)
String.PRINTABLE = True
String.XMLTAG = "str"
#------ StyleFeature -------
StyleFeature.SUBSET = "style"
StyleFeature.XMLTAG = None
#------ SubjectivityAnnotation -------
SubjectivityAnnotation.ANNOTATIONTYPE = AnnotationType.SUBJECTIVITY
SubjectivityAnnotation.LABEL = "Subjectivity/Sentiment"
SubjectivityAnnotation.XMLTAG = "subjectivity"
#------ Suggestion -------
Suggestion.AUTH = False
Suggestion.OCCURRENCES = 0
Suggestion.XMLTAG = "suggestion"
#------ SynsetFeature -------
SynsetFeature.SUBSET = "synset"
SynsetFeature.XMLTAG = None
#------ SyntacticUnit -------
SyntacticUnit.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, SyntacticUnit, WordReference,)
SyntacticUnit.ANNOTATIONTYPE = AnnotationType.SYNTAX
SyntacticUnit.LABEL = "Syntactic Unit"
SyntacticUnit.XMLTAG = "su"
#------ SyntaxLayer -------
SyntaxLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, SyntacticUnit,)
SyntaxLayer.ANNOTATIONTYPE = AnnotationType.SYNTAX
SyntaxLayer.PRIMARYELEMENT = False
SyntaxLayer.XMLTAG = "syntax"
#------ Table -------
Table.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Part, Row, TableHead,)
Table.ANNOTATIONTYPE = AnnotationType.TABLE
Table.LABEL = "Table"
Table.XMLTAG = "table"
#------ TableHead -------
TableHead.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Part, Row,)
TableHead.LABEL = "Table Header"
TableHead.XMLTAG = "tablehead"
#------ Target -------
Target.LABEL = "Target"
Target.OCCURRENCES = 1
Target.XMLTAG = "target"
#------ Term -------
Term.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Event, Feature, Figure, ForeignData, Gap, Linebreak, List, Metric, Paragraph, Part, PhonContent, Reference, Sentence, String, Table, TextContent, Utterance, Whitespace, Word,)
Term.ANNOTATIONTYPE = AnnotationType.TERM
Term.LABEL = "Term"
Term.XMLTAG = "term"
#------ Text -------
Text.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Division, Entry, Event, Example, External, Feature, Figure, ForeignData, Gap, Linebreak, List, Metric, Note, Paragraph, Part, PhonContent, Quote, Reference, Sentence, String, Table, TextContent, Whitespace, Word,)
Text.LABEL = "Text Body"
Text.TEXTDELIMITER = "\n\n\n"
Text.XMLTAG = "text"
#------ TextContent -------
TextContent.ACCEPTED_DATA = (AbstractTextMarkup, Comment, Description, Linebreak,)
TextContent.ANNOTATIONTYPE = AnnotationType.TEXT
TextContent.LABEL = "Text"
TextContent.OCCURRENCES = 0
TextContent.OPTIONAL_ATTRIBS = (Attrib.CLASS, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.METADATA,)
TextContent.PRINTABLE = True
TextContent.SPEAKABLE = False
TextContent.TEXTCONTAINER = True
TextContent.XLINK = True
TextContent.XMLTAG = "t"
#------ TextMarkupCorrection -------
TextMarkupCorrection.ANNOTATIONTYPE = AnnotationType.CORRECTION
TextMarkupCorrection.PRIMARYELEMENT = False
TextMarkupCorrection.XMLTAG = "t-correction"
#------ TextMarkupError -------
TextMarkupError.ANNOTATIONTYPE = AnnotationType.ERRORDETECTION
TextMarkupError.PRIMARYELEMENT = False
TextMarkupError.XMLTAG = "t-error"
#------ TextMarkupGap -------
TextMarkupGap.ANNOTATIONTYPE = AnnotationType.GAP
TextMarkupGap.PRIMARYELEMENT = False
TextMarkupGap.XMLTAG = "t-gap"
#------ TextMarkupString -------
TextMarkupString.ANNOTATIONTYPE = AnnotationType.STRING
TextMarkupString.PRIMARYELEMENT = False
TextMarkupString.XMLTAG = "t-str"
#------ TextMarkupStyle -------
TextMarkupStyle.ANNOTATIONTYPE = AnnotationType.STYLE
TextMarkupStyle.PRIMARYELEMENT = True
TextMarkupStyle.XMLTAG = "t-style"
#------ TimeFeature -------
TimeFeature.SUBSET = "time"
TimeFeature.XMLTAG = None
#------ TimeSegment -------
TimeSegment.ACCEPTED_DATA = (ActorFeature, AlignReference, Alignment, BegindatetimeFeature, Comment, Description, EnddatetimeFeature, Feature, ForeignData, Metric, WordReference,)
TimeSegment.ANNOTATIONTYPE = AnnotationType.TIMESEGMENT
TimeSegment.LABEL = "Time Segment"
TimeSegment.XMLTAG = "timesegment"
#------ TimingLayer -------
TimingLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, TimeSegment,)
TimingLayer.ANNOTATIONTYPE = AnnotationType.TIMESEGMENT
TimingLayer.PRIMARYELEMENT = False
TimingLayer.XMLTAG = "timing"
#------ Utterance -------
Utterance.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Gap, Metric, Note, Part, PhonContent, Quote, Reference, Sentence, String, TextContent, Word,)
Utterance.ANNOTATIONTYPE = AnnotationType.UTTERANCE
Utterance.LABEL = "Utterance"
Utterance.TEXTDELIMITER = " "
Utterance.XMLTAG = "utt"
#------ ValueFeature -------
ValueFeature.SUBSET = "value"
ValueFeature.XMLTAG = None
#------ Whitespace -------
Whitespace.ANNOTATIONTYPE = AnnotationType.WHITESPACE
Whitespace.LABEL = "Whitespace"
Whitespace.TEXTDELIMITER = ""
Whitespace.XMLTAG = "whitespace"
#------ Word -------
Word.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Part, PhonContent, Reference, String, TextContent,)
Word.ANNOTATIONTYPE = AnnotationType.TOKEN
Word.LABEL = "Word/Token"
Word.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER, Attrib.TEXTCLASS, Attrib.METADATA,)
Word.TEXTDELIMITER = " "
Word.XMLTAG = "w"
#------ WordReference -------
WordReference.XMLTAG = "wref"
#EOF
PyNLPl-1.2.9/pynlpl/formats/foliaset.py 0000644 0001750 0000144 00000051626 13344247576 020713 0 ustar proycon users 0000000 0000000 # -*- coding: utf-8 -*-
#----------------------------------------------------------------
# PyNLPl - FoLiA Set Definition Module
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
#
# https://proycon.github.io/folia
# httsp://github.com/proycon/pynlpl
# proycon AT anaproy DOT nl
#
# Module for reading, editing and writing FoLiA XML
#
# Licensed under GPLv3
#
#----------------------------------------------------------------
#pylint: disable=redefined-builtin,trailing-whitespace,superfluous-parens,bad-classmethod-argument,wrong-import-order,wrong-import-position,ungrouped-imports
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
import sys
import io
import rdflib
from lxml import etree as ElementTree
if sys.version < '3':
from StringIO import StringIO #pylint: disable=import-error,wrong-import-order
from urllib import urlopen #pylint: disable=no-name-in-module,wrong-import-order
from urllib2 import HTTPError
else:
from io import StringIO, BytesIO #pylint: disable=wrong-import-order,ungrouped-imports
from urllib.request import urlopen #pylint: disable=E0611,wrong-import-order,ungrouped-imports
from urllib.error import HTTPError
#foliaspec:namespace:NSFOLIA
#The FoLiA XML namespace
NSFOLIA = "http://ilk.uvt.nl/folia"
#foliaspec:setdefinitionnamespace:NSFOLIASETDEFINITION
NSFOLIASETDEFINITION = "http://folia.science.ru.nl/setdefinition"
NSSKOS = "http://www.w3.org/2004/02/skos/core"
class DeepValidationError(Exception):
pass
class SetDefinitionError(DeepValidationError):
pass
class SetType: #legacy only
CLOSED, OPEN, MIXED, EMPTY = range(4)
class LegacyClassDefinition(object):
def __init__(self,id, label, subclasses=None):
self.id = id
self.label = label
if subclasses:
self.subclasses = subclasses
else:
self.subclasses = []
@classmethod
def parsexml(Class, node):
if not node.tag == '{' + NSFOLIA + '}class':
raise Exception("Expected class tag for this xml node, got" + node.tag)
if 'label' in node.attrib:
label = node.attrib['label']
else:
label = ""
subclasses= []
for subnode in node:
if isinstance(subnode.tag, str) or (sys.version < '3' and isinstance(subnode.tag, unicode)): #pylint: disable=undefined-variable
if subnode.tag == '{' + NSFOLIA + '}class':
subclasses.append( LegacyClassDefinition.parsexml(subnode) )
elif subnode.tag[:len(NSFOLIA) +2] == '{' + NSFOLIA + '}':
raise Exception("Invalid tag in Class definition: " + subnode.tag)
if '{http://www.w3.org/XML/1998/namespace}id' in node.attrib:
idkey = '{http://www.w3.org/XML/1998/namespace}id'
else:
idkey = 'id'
return LegacyClassDefinition(node.attrib[idkey],label, subclasses)
def __iter__(self):
for c in self.subclasses:
yield c
def json(self):
jsonnode = {'id': self.id, 'label': self.label}
jsonnode['subclasses'] = []
for subclass in self.subclasses:
jsonnode['subclasses'].append(subclass.json())
return jsonnode
def rdf(self,graph, basens,parentseturi, parentclass=None, seqnr=None):
graph.add((rdflib.term.URIRef(basens + '#' + self.id), rdflib.RDF.type, rdflib.term.URIRef(NSSKOS + '#Concept')))
graph.add((rdflib.term.URIRef(basens + '#' + self.id), rdflib.term.URIRef(NSSKOS + '#notation'), rdflib.term.Literal(self.id)))
graph.add((rdflib.term.URIRef(basens + '#' + self.id), rdflib.term.URIRef(NSSKOS + '#prefLabel'), rdflib.term.Literal(self.label)))
graph.add((parentseturi , rdflib.term.URIRef(NSSKOS + '#member'), rdflib.term.URIRef(basens + '#' + self.id)))
if seqnr is not None:
graph.add((rdflib.term.URIRef(basens + '#' + self.id), rdflib.term.URIRef(NSFOLIASETDEFINITION + '#sequenceNumber'), rdflib.term.Literal(seqnr) ))
if parentclass:
graph.add((rdflib.term.URIRef(basens + '#' + self.id), rdflib.term.URIRef(NSSKOS + '#narrower'), rdflib.term.URIRef(basens + '#' + parentclass) ))
for subclass in self.subclasses:
subclass.rdf(graph,basens,parentseturi, self.id)
class LegacySetDefinition(object):
def __init__(self, id, type, classes = None, subsets = None, label =None):
self.id = id
self.type = type
self.label = label
if classes:
self.classes = classes
else:
self.classes = []
if subsets:
self.subsets = subsets
else:
self.subsets = []
@classmethod
def parsexml(Class, node):
issubset = node.tag == '{' + NSFOLIA + '}subset'
if not issubset:
assert node.tag == '{' + NSFOLIA + '}set'
classes = []
subsets= []
if 'type' in node.attrib:
if node.attrib['type'] == 'open':
type = SetType.OPEN
elif node.attrib['type'] == 'closed':
type = SetType.CLOSED
elif node.attrib['type'] == 'mixed':
type = SetType.MIXED
elif node.attrib['type'] == 'empty':
type = SetType.EMPTY
else:
raise Exception("Invalid set type: ", type)
else:
type = SetType.CLOSED
if 'label' in node.attrib:
label = node.attrib['label']
else:
label = None
for subnode in node:
if isinstance(subnode.tag, str) or (sys.version < '3' and isinstance(subnode.tag, unicode)): #pylint: disable=undefined-variable
if subnode.tag == '{' + NSFOLIA + '}class':
classes.append( LegacyClassDefinition.parsexml(subnode) )
elif not issubset and subnode.tag == '{' + NSFOLIA + '}subset':
subsets.append( LegacySetDefinition.parsexml(subnode) )
elif subnode.tag == '{' + NSFOLIA + '}constraint':
pass
elif subnode.tag[:len(NSFOLIA) +2] == '{' + NSFOLIA + '}':
raise SetDefinitionError("Invalid tag in Set definition: " + subnode.tag)
return LegacySetDefinition(node.attrib['{http://www.w3.org/XML/1998/namespace}id'],type,classes, subsets, label)
def json(self):
jsonnode = {'id': self.id}
if self.label:
jsonnode['label'] = self.label
if self.type == SetType.OPEN:
jsonnode['type'] = 'open'
elif self.type == SetType.CLOSED:
jsonnode['type'] = 'closed'
elif self.type == SetType.MIXED:
jsonnode['type'] = 'mixed'
elif self.type == SetType.EMPTY:
jsonnode['type'] = 'empty'
jsonnode['subsets'] = {}
for subset in self.subsets:
jsonnode['subsets'][subset.id] = subset.json()
jsonnode['classes'] = {}
jsonnode['classorder'] = []
for c in sorted(self.classes, key=lambda x: x.label):
jsonnode['classes'][c.id] = c.json()
jsonnode['classorder'].append( c.id )
return jsonnode
def rdf(self,graph, basens="",parenturi=None):
if not basens:
basens = NSFOLIASETDEFINITION + "/" + self.id
if not parenturi:
graph.bind( self.id, basens + '#', override=True ) #set a prefix for our namespace (does not use @base because of issue RDFLib/rdflib#559 )
seturi = rdflib.term.URIRef(basens + '#Set')
else:
seturi = rdflib.term.URIRef(basens + '#Subset.' + self.id)
graph.add((seturi, rdflib.RDF.type, rdflib.term.URIRef(NSSKOS + '#Collection')))
if self.id:
graph.add((seturi, rdflib.term.URIRef(NSSKOS + '#notation'), rdflib.term.Literal(self.id)))
if self.type == SetType.OPEN:
graph.add((seturi, rdflib.term.URIRef(NSFOLIASETDEFINITION + '#open'), rdflib.term.Literal(True)))
elif self.type == SetType.EMPTY:
graph.add((seturi, rdflib.term.URIRef(NSFOLIASETDEFINITION + '#empty'), rdflib.term.Literal(True)))
if self.label:
graph.add((seturi, rdflib.term.URIRef(NSSKOS + '#prefLabel'), rdflib.term.Literal(self.label)))
if parenturi:
graph.add((parenturi, rdflib.term.URIRef(NSSKOS + '#member'), seturi))
for i, c in enumerate(self.classes):
c.rdf(graph, basens, seturi, None, i+1)
for s in self.subsets:
s.rdf(graph, basens, seturi)
def xmltreefromstring(s):
"""Internal function, deals with different Python versions, unicode strings versus bytes, and with the leak bug in lxml"""
if sys.version < '3':
#Python 2
if isinstance(s,unicode): #pylint: disable=undefined-variable
s = s.encode('utf-8')
try:
return ElementTree.parse(StringIO(s), ElementTree.XMLParser(collect_ids=False))
except TypeError:
return ElementTree.parse(StringIO(s), ElementTree.XMLParser()) #older lxml, may leak!!!!
else:
#Python 3
if isinstance(s,str):
s = s.encode('utf-8')
try:
return ElementTree.parse(BytesIO(s), ElementTree.XMLParser(collect_ids=False))
except TypeError:
return ElementTree.parse(BytesIO(s), ElementTree.XMLParser()) #older lxml, may leak!!!!
class SetDefinition(object):
def __init__(self, url, format=None, basens="",verbose=False):
self.graph = rdflib.Graph()
self.basens = basens
self.mainsetcache = {}
self.subsetcache = {}
self.set_id_uri_cache = {}
self.verbose = verbose
self.graph.bind( 'fsd', NSFOLIASETDEFINITION+'#', override=True)
self.graph.bind( 'skos', NSSKOS+'#', override=True)
if not format:
#try to guess format from URL
if url.endswith('.ttl'):
format = 'text/turtle'
elif url.endswith('.n3'):
format = 'text/n3'
elif url.endswith('.rdf.xml') or url.endswith('.rdf'):
format = 'application/rdf+xml'
elif url.endswith('.xml'): #other XML will be considered legacy
format = 'application/foliaset+xml' #legacy
if format in ('application/foliaset+xml','legacy',None):
#legacy format, has some checks and fallbacks if the format turns out to be RDF anyway
self.legacyset = None
if url[0] == '/' or url[0] == '.':
#local file
f = io.open(url,'r',encoding='utf-8')
else:
#remote URL
if not self.basens:
self.basens = url
try:
f = urlopen(url)
except:
raise DeepValidationError("Unable to download " + url)
try:
data = f.read()
except IOError:
raise DeepValidationError("Unable to download " + url)
finally:
f.close()
if data[0] in ('@',b'@',64):
#this is not gonna be valid XML, but looks like turtle/n3 RDF
self.graph.parse(location=url, format='text/turtle')
if self.verbose:
print("Loaded set " + url + " (" + str(len(self.graph)) + " triples)",file=sys.stderr)
return
tree = xmltreefromstring(data)
root = tree.getroot()
if root.tag != '{' + NSFOLIA + '}set':
if root.tag.lower().find('rdf') != 1:
#well, this is RDF after all...
self.graph.parse(location=url, format='rdf')
return
else:
raise SetDefinitionError("Not a FoLiA Set Definition! Unexpected root tag:"+ root.tag)
legacyset = LegacySetDefinition.parsexml(root)
legacyset.rdf(self.graph, self.basens)
if self.verbose:
print("Loaded legacy set " + url + " (" + str(len(self.graph)) + " triples)",file=sys.stderr)
else:
try:
self.graph.parse(location=url, format=format)
except HTTPError:
raise DeepValidationError("Unable to download " + url)
if self.verbose:
print("Loaded set " + url + " (" + str(len(self.graph)) + " triples)",file=sys.stderr)
def testclass(self,cls):
"""Test for the presence of the class, returns the full URI or raises an exception"""
mainsetinfo = self.mainset()
if mainsetinfo['open']:
return cls #everything is okay
elif mainsetinfo['empty']:
if cls:
raise DeepValidationError("Expected an empty class, got \"" + cls + "\"")
else:
if not cls:
raise DeepValidationError("No class specified")
#closed set
set_uri = mainsetinfo['uri']
for row in self.graph.query("SELECT ?c WHERE { ?c rdf:type skos:Concept ; skos:notation \"" + cls + "\". <" + str(set_uri) + "> skos:member ?c }"):
return str(row.c)
raise DeepValidationError("Not a valid class: " + cls)
def testsubclass(self, cls, subset, subclass):
"""Test for the presence of a class in a subset (used with features), returns the full URI or raises an exception"""
subsetinfo = self.subset(subset)
if subsetinfo['open']:
return subclass #everything is okay
else:
subset_uri = subsetinfo['uri']
if not subset_uri:
raise DeepValidationError("Not a valid subset: " + subset)
query = "SELECT ?c WHERE { ?c rdf:type skos:Concept ; skos:notation \"" + subclass + "\" . <" + str(subset_uri) + "> skos:member ?c }"
for row in self.graph.query(query):
return str(row.c)
raise DeepValidationError("Not a valid class in subset " + subset + ": " + subclass)
def get_set_uri(self, set_id=None):
if set_id in self.set_id_uri_cache:
return self.set_id_uri_cache[set_id]
if set_id:
for row in self.graph.query("SELECT ?s WHERE { ?s rdf:type skos:Collection ; skos:notation \"" + set_id + "\" }"):
self.set_id_uri_cache[set_id] = row.s
return row.s
raise DeepValidationError("No such set: " + str(set_id))
else:
for row in self.graph.query("SELECT ?s WHERE { ?s rdf:type skos:Collection . FILTER NOT EXISTS { ?y rdf:type skos:Collection . ?y skos:member ?s } }"):
self.set_id_uri_cache[set_id] = row.s
return row.s
raise DeepValidationError("Main set not found")
def mainset(self):
"""Returns information regarding the set"""
if self.mainsetcache:
return self.mainsetcache
set_uri = self.get_set_uri()
for row in self.graph.query("SELECT ?seturi ?setid ?setlabel ?setopen ?setempty WHERE { ?seturi rdf:type skos:Collection . OPTIONAL { ?seturi skos:notation ?setid } OPTIONAL { ?seturi skos:prefLabel ?setlabel } OPTIONAL { ?seturi fsd:open ?setopen } OPTIONAL { ?seturi fsd:empty ?setempty } FILTER NOT EXISTS { ?y skos:member ?seturi . ?y rdf:type skos:Collection } }"):
self.mainsetcache = {'uri': str(row.seturi), 'id': str(row.setid), 'label': str(row.setlabel) if row.setlabel else "", 'open': bool(row.setopen), 'empty': bool(row.setempty) }
return self.mainsetcache
raise DeepValidationError("Unable to find main set (set_uri=" + str(set_uri)+"), this should not happen")
def subset(self, subset_id):
"""Returns information regarding the set"""
if subset_id in self.subsetcache:
return self.subsetcache[subset_id]
set_uri = self.get_set_uri(subset_id)
for row in self.graph.query("SELECT ?seturi ?setid ?setlabel ?setopen WHERE { ?seturi rdf:type skos:Collection . OPTIONAL { ?seturi skos:notation ?setid } OPTIONAL { ?seturi skos:prefLabel ?setlabel } OPTIONAL { ?seturi fsd:open ?setopen } FILTER (?seturi = <" + str(set_uri)+">) }"):
self.subsetcache[str(row.setid)] = {'uri': str(row.seturi), 'id': str(row.setid), 'label': str(row.setlabel) if row.setlabel else "", 'open': bool(row.setopen) }
return self.subsetcache[str(row.setid)]
raise DeepValidationError("Unable to find subset (set_uri=" + str(set_uri)+")")
def orderedclasses(self, set_uri_or_id=None, nestedhierarchy=False):
"""Higher-order generator function that yields class information in the right order, combines calls to :meth:`SetDefinition.classes` and :meth:`SetDefinition.classorder`"""
classes = self.classes(set_uri_or_id, nestedhierarchy)
for classid in self.classorder(classes):
yield classes[classid]
def __iter__(self):
"""Alias for :meth:`SetDefinition.orderedclasses`"""
return self.orderedclasses()
def classes(self, set_uri_or_id=None, nestedhierarchy=False):
"""Returns a dictionary of classes for the specified (sub)set (if None, default, the main set is selected)"""
if set_uri_or_id and set_uri_or_id.startswith(('http://','https://')):
set_uri = set_uri_or_id
else:
set_uri = self.get_set_uri(set_uri_or_id)
assert set_uri is not None
classes= {}
uri2idmap = {}
for row in self.graph.query("SELECT ?classuri ?classid ?classlabel ?parentclass ?seqnr WHERE { ?classuri rdf:type skos:Concept ; skos:notation ?classid. <" + str(set_uri) + "> skos:member ?classuri . OPTIONAL { ?classuri skos:prefLabel ?classlabel } OPTIONAL { ?classuri skos:narrower ?parentclass } OPTIONAL { ?classuri fsd:sequenceNumber ?seqnr } }"):
classinfo = {'uri': str(row.classuri), 'id': str(row.classid),'label': str(row.classlabel) if row.classlabel else "" }
if nestedhierarchy:
uri2idmap[str(row.classuri)] = str(row.classid)
if row.parentclass:
classinfo['parentclass'] = str(row.parentclass) #uri
if row.seqnr:
classinfo['seqnr'] = int(row.seqnr)
classes[str(row.classid)] = classinfo
if nestedhierarchy:
#build hierarchy
removekeys = []
for classid, classinfo in classes.items():
if 'parentclass' in classinfo:
removekeys.append(classid)
parentclassid = uri2idmap[classinfo['parentclass']]
if 'subclasses' not in classes[parentclassid]:
classes[parentclassid]['subclasses'] = {}
classes[parentclassid]['subclasses'][classid] = classinfo
for key in removekeys:
del classes[key]
return classes
def classorder(self,classes):
"""Return a list of class IDs in order for presentational purposes: order is determined first and foremost by explicit ordering, else alphabetically by label or as a last resort by class ID"""
return [ classid for classid, classitem in sorted( ((classid, classitem) for classid, classitem in classes.items() if 'seqnr' in classitem) , key=lambda pair: pair[1]['seqnr'] )] + \
[ classid for classid, classitem in sorted( ((classid, classitem) for classid, classitem in classes.items() if 'seqnr' not in classitem) , key=lambda pair: pair[1]['label'] if 'label' in pair[1] else pair[1]['id']) ]
def subsets(self, set_uri_or_id=None):
if set_uri_or_id and set_uri_or_id.startswith(('http://', 'https://')):
set_uri = set_uri_or_id
else:
set_uri = self.get_set_uri(set_uri_or_id)
assert set_uri is not None
for row in self.graph.query("SELECT ?seturi ?setid ?setlabel ?setopen WHERE { ?seturi rdf:type skos:Collection . <" + str(set_uri) + "> skos:member ?seturi . OPTIONAL { ?seturi skos:notation ?setid } OPTIONAL { ?seturi skos:prefLabel ?setlabel } OPTIONAL { ?seturi fsd:open ?setopen } }"):
yield {'uri': str(row.seturi), 'id': str(row.setid), 'label': str(row.setlabel) if row.setlabel else "", 'open': bool(row.setopen) }
def json(self):
data = {'subsets': {}}
setinfo = self.mainset()
#backward compatibility, set type:
if setinfo['open']:
setinfo['type'] = 'open'
else:
setinfo['type'] = 'closed'
data.update(setinfo)
classes = self.classes()
data['classes'] = classes
data['classorder'] = self.classorder(classes)
for subsetinfo in self.subsets():
#backward compatibility, set type:
if subsetinfo['open']:
subsetinfo['type'] = 'open'
else:
subsetinfo['type'] = 'closed'
data['subsets'][subsetinfo['id']] = subsetinfo
classes = self.classes(subsetinfo['uri'])
data['subsets'][subsetinfo['id']]['classes'] = classes
data['subsets'][subsetinfo['id']]['classorder'] = self.classorder(classes)
return data
PyNLPl-1.2.9/pynlpl/formats/fql.py 0000644 0001750 0000144 00000303405 13344247576 017662 0 ustar proycon users 0000000 0000000 #---------------------------------------------------------------
# PyNLPl - FoLiA Query Language
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
# http://proycon.github.com/folia
# http://www.github.com/proycon/pynlpl
# proycon AT anaproy DOT nl
#
# Module for reading, editing and writing FoLiA XML using
# the FoLiA Query Language
#
# Licensed under GPLv3
#
#----------------------------------------------------------------
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
from pynlpl.formats import folia
from copy import copy
import json
import re
import sys
import random
import datetime
OPERATORS = ('=','==','!=','>','<','<=','>=','CONTAINS','NOTCONTAINS','MATCHES','NOTMATCHES')
MASK_NORMAL = 0
MASK_LITERAL = 1
MASK_EXPRESSION = 2
MAXEXPANSION = 99
FOLIAVERSION = '1.5.0'
FQLVERSION = '0.4.1'
class SyntaxError(Exception):
pass
class QueryError(Exception):
pass
def getrandomid(query,prefix=""):
randomid = ""
while not randomid or randomid in query.doc.index:
randomid = prefix + "%08x" % random.getrandbits(32) #generate a random ID
return randomid
class UnparsedQuery(object):
"""This class takes care of handling grouped blocks in parentheses and handling quoted values"""
def __init__(self, s, i=0):
self.q = []
self.mask = []
l = len(s)
begin = 0
while i < l:
c = s[i]
if c == " ":
#process previous word
if begin < i:
w = s[begin:i]
self.q.append(w)
self.mask.append(MASK_NORMAL)
begin = i + 1
elif i == l - 1:
#process last word
w = s[begin:]
self.q.append(w)
self.mask.append(MASK_NORMAL)
if c == '(': #groups
#find end quote and process block
level = 0
quoted = False
s2 = ""
for j in range(i+1,l):
c2 = s[j]
if c2 == '"':
if s[j-1] != "\\": #check it isn't escaped
quoted = not quoted
if not quoted:
if c2 == '(':
level += 1
elif c2 == ')':
if level == 0:
s2 = s[i+1:j]
break
else:
level -= 1
if s2:
self.q.append(UnparsedQuery(s2))
self.mask.append(MASK_EXPRESSION)
i = j
begin = i+1
else:
raise SyntaxError("Unmatched parenthesis at char " + str(i))
elif c == '"': #literals
if i == 0 or (i > 0 and s[i-1] != "\\"): #check it isn't escaped
#find end quote and process block
s2 = None
for j in range(i+1,l):
c2 = s[j]
if c2 == '"':
if s[j-1] != "\\": #check it isn't escaped
s2 = s[i+1:j]
break
if not s2 is None:
self.q.append(s2.replace('\\"','"').replace("\\n","\n")) #undo escaped quotes and newlines
self.mask.append(MASK_LITERAL)
i = j
begin = i+1
else:
raise SyntaxError("Unterminated string literal at char " + str(i))
i += 1
remove = []
#process shortcut notation
for i, (w,m) in enumerate(zip(self.q,self.mask)):
if m == MASK_NORMAL and w[0] == ':':
#we have shortcut notation for a HAS statement, rewrite:
self.q[i] = UnparsedQuery(w[1:] + " HAS class " + self.q[i+1] + " \"" + self.q[i+2] + "\"")
self.mask[i] = MASK_EXPRESSION
remove += [i+1,i+2]
if remove:
for index in reversed(remove):
del self.q[index]
del self.mask[index]
def __iter__(self):
for w in self.q:
yield w
def __len__(self):
return len(self.q)
def __getitem__(self, index):
try:
return self.q[index]
except:
return ""
def kw(self, index, value):
try:
if isinstance(value, tuple):
return self.q[index] in value and self.mask[index] == MASK_NORMAL
else:
return self.q[index] == value and self.mask[index] == MASK_NORMAL
except:
return False
def __exists__(self, keyword):
for k,m in zip(self.q,self.mask):
if keyword == k and m == MASK_NORMAL:
return True
return False
def __setitem__(self, index, value):
self.q[index] = value
def __str__(self):
s = []
for w,m in zip(self.q,self.mask):
if m == MASK_NORMAL:
s.append(w)
elif m == MASK_LITERAL:
s.append('"' + w.replace('"','\\"') + '"')
elif m == MASK_EXPRESSION:
s.append('(' + str(w) + ')')
return " ".join(s)
class Filter(object): #WHERE ....
def __init__(self, filters, negation=False,disjunction=False):
self.filters = filters
self.negation = negation
self.disjunction = disjunction
@staticmethod
def parse(q, i=0):
filters = []
negation = False
logop = ""
l = len(q)
while i < l:
if q.kw(i, "NOT"):
negation = True
i += 1
elif isinstance(q[i], UnparsedQuery):
filter,_ = Filter.parse(q[i])
filters.append(filter)
i += 1
if q.kw(i,"AND") or q.kw(i, "OR"):
if logop and q[i] != logop:
raise SyntaxError("Mixed logical operators, use parentheses: " + str(q))
logop = q[i]
i += 1
else:
break #done
elif i == 0 and (q[i].startswith("PREVIOUS") or q[i].startswith("NEXT") or q.kw(i, ("LEFTCONTEXT","RIGHTCONTEXT","CONTEXT","PARENT","ANCESTOR","CHILD") )):
#we have a context expression, always occuring in its own subquery
modifier = q[i]
i += 1
selector,i = Selector.parse(q,i)
filters.append( (modifier, selector,None) )
break
elif q[i+1] in OPERATORS and q[i] and q[i+2]:
operator = q[i+1]
if q[i] == "class":
v = lambda x,y='cls': getattr(x,y)
elif q[i] in ("text","value","phon"):
v = lambda x,y='text': getattr(x,'value') if isinstance(x, (folia.Description, folia.Comment, folia.Content)) else getattr(x,'phon') if isinstance(x,folia.PhonContent) else getattr(x,'text')()
else:
v = lambda x,y=q[i]: getattr(x,y)
if q[i] == 'confidence':
cnv = float
else:
cnv = lambda x: x
if operator == '=' or operator == '==':
filters.append( lambda x,y=q[i+2],v=v : v(x) == y )
elif operator == '!=':
filters.append( lambda x,y=q[i+2],v=v : v(x) != y )
elif operator == '>':
filters.append( lambda x,y=cnv(q[i+2]),v=v : False if v(x) is None else v(x) > y )
elif operator == '<':
filters.append( lambda x,y=cnv(q[i+2]),v=v : False if v(x) is None else v(x) < y )
elif operator == '>=':
filters.append( lambda x,y=cnv(q[i+2]),v=v : False if v(x) is None else v(x) >= y )
elif operator == '<=':
filters.append( lambda x,y=cnv(q[i+2]),v=v : False if v(x) is None else v(x) <= y )
elif operator == 'CONTAINS':
filters.append( lambda x,y=q[i+2],v=v : v(x).find( y ) != -1 )
elif operator == 'NOTCONTAINS':
filters.append( lambda x,y=q[i+2],v=v : v(x).find( y ) == -1 )
elif operator == 'MATCHES':
filters.append( lambda x,y=re.compile(q[i+2]),v=v : y.search(v(x)) is not None )
elif operator == 'NOTMATCHES':
filters.append( lambda x,y=re.compile(q[i+2]),v=v : y.search(v(x)) is None )
if q.kw(i+3,("AND","OR")):
if logop and q[i+3] != logop:
raise SyntaxError("Mixed logical operators, use parentheses: " + str(q))
logop = q[i+3]
i += 4
else:
i += 3
break #done
elif 'HAS' in q[i:]:
#has statement (spans full UnparsedQuery by definition)
selector,i = Selector.parse(q,i)
if not q.kw(i,"HAS"):
raise SyntaxError("Expected HAS, got " + str(q[i]) + " at position " + str(i) + " in: " + str(q))
i += 1
subfilter,i = Filter.parse(q,i)
filters.append( ("CHILD",selector,subfilter) )
else:
raise SyntaxError("Expected comparison operator, got " + str(q[i+1]) + " in: " + str(q))
if negation and len(filters) > 1:
raise SyntaxError("Expecting parentheses when NOT is used with multiple conditions")
return Filter(filters, negation, logop == "OR"), i
def __call__(self, query, element, debug=False):
"""Tests the filter on the specified element, returns a boolean"""
match = True
if debug: print("[FQL EVALUATION DEBUG] Filter - Testing filter [" + str(self) + "] for ", repr(element),file=sys.stderr)
for filter in self.filters:
if isinstance(filter,tuple):
modifier, selector, subfilter = filter
if debug: print("[FQL EVALUATION DEBUG] Filter - Filter is a subfilter of type " + modifier + ", descending...",file=sys.stderr)
#we have a subfilter, i.e. a HAS statement on a subelement
match = False
if modifier == "CHILD":
for subelement,_ in selector(query, [element], True, debug): #if there are multiple subelements, they are always treated disjunctly
if not subfilter:
match = True
else:
match = subfilter(query, subelement, debug)
if match: break #only one subelement has to match by definition, then the HAS statement is matched
elif modifier == "PARENT":
match = selector.match(query, element.parent,debug)
elif modifier == "NEXT":
neighbour = element.next()
if neighbour:
match = selector.match(query, neighbour,debug)
elif modifier == "PREVIOUS":
neighbour = element.previous()
if neighbour:
match = selector.match(query, neighbour,debug)
else:
raise NotImplementedError("Context keyword " + modifier + " not implemented yet")
elif isinstance(filter, Filter):
#we have a nested filter (parentheses)
match = filter(query, element, debug)
else:
#we have a condition function we can evaluate
match = filter(element)
if self.negation:
match = not match
if match:
if self.disjunction:
if debug: print("[FQL EVALUATION DEBUG] Filter returns True",file=sys.stderr)
return True
else:
if not self.disjunction: #implies conjunction
if debug: print("[FQL EVALUATION DEBUG] Filter returns False",file=sys.stderr)
return False
if debug: print("[FQL EVALUATION DEBUG] Filter returns ", str(match),file=sys.stderr)
return match
def __str__(self):
q = ""
if self.negation:
q += "NOT "
for i, filter in enumerate(self.filters):
if i > 0:
if self.disjunction:
q += "OR "
else:
q += "AND "
if isinstance(filter, Filter):
q += "(" + str(filter) + ") "
elif isinstance(filter, tuple):
modifier,selector,subfilter = filter
q += "(" + modifier + " " + str(selector) + " HAS " + str(subfilter) + ") "
else:
#original filter can't be reconstructed, place dummy:
q += "...\"" + str(filter.__defaults__[0]) +"\""
return q.strip()
class SpanSet(list):
def select(self,*args):
raise QueryError("Got a span set for a non-span element")
def partof(self, collection):
for e in collection:
if isinstance(e, SpanSet):
if len(e) != len(self):
return False
for c1,c2 in zip(e,self):
if c1 is not c2:
return False
return False
class Selector(object):
def __init__(self, Class, set=None,id=None, filter=None, nextselector=None, expansion = None):
self.Class = Class
self.set = set
self.id = id
self.filter = filter
self.nextselector = nextselector #selectors can be chained
self.expansion = expansion #{min,max} occurrence interval, allowed only in Span and evaluated there instead of here
def chain(self, targets):
assert targets[0] is self
selector = self
selector.nextselector = None
for target in targets[1:]:
selector.nextselector = target
selector = target
@staticmethod
def parse(q, i=0, allowexpansion=False):
l = len(q)
set = None
id = None
filter = None
expansion = None
if q[i] == "ID" and q[i+1]:
id = q[i+1]
Class = None
i += 2
else:
if q[i] == "ALL":
Class = "ALL"
else:
try:
Class = folia.XML2CLASS[q[i]]
except:
raise SyntaxError("Expected element type, got " + str(q[i]) + " in: " + str(q))
i += 1
if q[i] and q[i][0] == "{" and q[i][-1] == "}":
if not allowexpansion:
raise SyntaxError("Expansion expressions not allowed at this point, got one at position " + str(i) + " in: " + str(q))
expansion = q[i][1:-1]
expansion = expansion.split(',')
i += 1
try:
if len(expansion) == 1:
expansion = (int(expansion), int(expansion))
elif len(expansion) == 2 and expansion[0] == "":
expansion = (0,int(expansion[1]))
elif len(expansion) == 2 and expansion[1] == "":
expansion = (int(expansion[0]),MAXEXPANSION)
elif len(expansion) == 2:
expansion = tuple(int(x) for x in expansion if x)
else:
raise SyntaxError("Invalid expansion expression: " + ",".join(expansion))
except ValueError:
raise SyntaxError("Invalid expansion expression: " + ",".join(expansion))
while i < l:
if q.kw(i,"OF") and q[i+1]:
set = q[i+1]
i += 2
elif q.kw(i,"ID") and q[i+1]:
id = q[i+1]
i += 2
elif q.kw(i, "WHERE"):
#ok, big filter coming up!
filter, i = Filter.parse(q,i+1)
break
else:
#something we don't handle
break
return Selector(Class,set,id,filter, None, expansion), i
def __call__(self, query, contextselector, recurse=True, debug=False): #generator, lazy evaluation!
if isinstance(contextselector,tuple) and len(contextselector) == 2:
selection = contextselector[0](*contextselector[1])
else:
selection = contextselector
count = 0
for e in selection:
selector = self
while True: #will loop through the chain of selectors, only the first one is called explicitly
if debug: print("[FQL EVALUATION DEBUG] Select - Running selector [", str(self), "] on ", repr(e),file=sys.stderr)
if selector.id:
if debug: print("[FQL EVALUATION DEBUG] Select - Selecting ID " + selector.id,file=sys.stderr)
try:
candidate = query.doc[selector.id]
selector.Class = candidate.__class__
if not selector.filter or selector.filter(query,candidate, debug):
if debug: print("[FQL EVALUATION DEBUG] Select - Yielding (by ID) ", repr(candidate),file=sys.stderr)
yield candidate, e
except KeyError:
if debug: print("[FQL EVALUATION DEBUG] Select - Selecting by ID failed for ID " + selector.id,file=sys.stderr)
pass #silently ignore ID mismatches
elif selector.Class == "ALL":
for candidate in e:
if isinstance(candidate, folia.AbstractElement):
yield candidate, e
elif selector.Class:
if debug: print("[FQL EVALUATION DEBUG] Select - Selecting Class " + selector.Class.XMLTAG + " with set " + str(selector.set),file=sys.stderr)
if selector.Class.XMLTAG in query.defaultsets:
selector.set = query.defaultsets[selector.Class.XMLTAG]
isspan = issubclass(selector.Class, folia.AbstractSpanAnnotation)
if isinstance(e, tuple): e = e[0]
if isspan and (isinstance(e, folia.Word) or isinstance(e, folia.Morpheme)):
for candidate in e.findspans(selector.Class, selector.set):
if not selector.filter or selector.filter(query,candidate, debug):
if debug: print("[FQL EVALUATION DEBUG] Select - Yielding span, single reference: ", repr(candidate),file=sys.stderr)
yield candidate, e
elif isspan and isinstance(e, SpanSet):
#we take the first item of the span to find the candidates
for candidate in e[0].findspans(selector.Class, selector.set):
if not selector.filter or selector.filter(query,candidate, debug):
#test if all the other elements in the span are in this candidate
matched = True
spanelements = list(candidate.wrefs())
for e2 in e[1:]:
if e2 not in spanelements:
matched = False
break
if matched:
if debug: print("[FQL EVALUATION DEBUG] Select - Yielding span, multiple references: ", repr(candidate),file=sys.stderr)
yield candidate, e
elif isinstance(e, SpanSet):
yield e, e
else:
#print("DEBUG: doing select " + selector.Class.__name__ + " (recurse=" + str(recurse)+") on " + repr(e))
for candidate in e.select(selector.Class, selector.set, recurse):
try:
if candidate.changedbyquery is query:
#this candidate has been added/modified by the query, don't select it again
continue
except AttributeError:
pass
if not selector.filter or selector.filter(query,candidate, debug):
if debug: print("[FQL EVALUATION DEBUG] Select - Yielding ", repr(candidate), " in ", repr(e),file=sys.stderr)
yield candidate, e
if selector.nextselector is None:
if debug: print("[FQL EVALUATION DEBUG] Select - End of chain",file=sys.stderr)
break # end of chain
else:
if debug: print("[FQL EVALUATION DEBUG] Select - Selecting next in chain",file=sys.stderr)
selector = selector.nextselector
def match(self, query, candidate, debug = False):
if debug: print("[FQL EVALUATION DEBUG] Select - Matching selector [", str(self), "] on ", repr(candidate),file=sys.stderr)
if self.id:
if candidate.id != self.id:
return False
elif self.Class:
if not isinstance(candidate,self.Class):
return False
if self.filter and not self.filter(query,candidate, debug):
return False
if debug: print("[FQL EVALUATION DEBUG] Select - Selector matches! ", repr(candidate),file=sys.stderr)
return True
def autodeclare(self,doc):
if self.Class and self.set:
if not doc.declared(self.Class, self.set):
doc.declare(self.Class, self.set)
if self.nextselector:
self.nextselector.autodeclare()
def __str__(self):
s = ""
if self.Class:
s += self.Class.XMLTAG + " "
if self.set:
s += "OF " + self.set + " "
if self.id:
s += "ID " + self.id + " "
if self.filter:
s += "WHERE " + str(self.filter)
if self.nextselector:
s += str(self.nextselector)
return s.strip()
class Span(object):
def __init__(self, targets, intervals = []):
self.targets = targets #Selector instances making up the span
def __len__(self):
return len(self.targets)
@staticmethod
def parse(q, i=0):
targets = []
l = len(q)
while i < l:
if q.kw(i,"ID") or q[i] in folia.XML2CLASS:
target,i = Selector.parse(q,i, True)
targets.append(target)
elif q.kw(i,"&"):
#we're gonna have more targets
i += 1
elif q.kw(i,"NONE"):
#empty span
return Span([]), i+1
else:
break
if not targets:
raise SyntaxError("Expected one or more span targets, got " + str(q[i]) + " in: " + str(q))
return Span(targets), i
def __call__(self, query, contextselector, recurse=True,debug=False): #returns a list of element in a span
if debug: print("[FQL EVALUATION DEBUG] Span - Building span from target selectors (" + str(len(self.targets)) + ")",file=sys.stderr)
backtrack = []
l = len(self.targets)
if l == 0:
#span is explicitly empty, this is allowed in RESPAN context
if debug: print("[FQL EVALUATION DEBUG] Span - Yielding explicitly empty SpanSet",file=sys.stderr)
yield SpanSet()
else:
#find the first non-optional element, it will be our pivot:
pivotindex = None
for i, target in enumerate(self.targets):
if self.targets[i].id or not self.targets[i].expansion or self.targets[i].expansion[0] > 0:
pivotindex = i
break
if pivotindex is None:
raise QueryError("All parts in the SPAN expression are optional, at least one non-optional component is required")
#get first target
for element, target in self.targets[pivotindex](query, contextselector, recurse,debug):
if debug: print("[FQL EVALUATION DEBUG] Span - First item of span found (pivotindex=" + str(pivotindex) + ",l=" + str(l) + "," + str(repr(element)) + ")",file=sys.stderr)
spanset = SpanSet() #elemnent is added later
match = True #we attempt to disprove this
#now see if consecutive elements match up
#--- matching prior to pivot -------
#match optional elements before pivotindex
i = pivotindex
currentelement = element
while i > 0:
i -= 1
if i < 0: break
selector = self.targets[i]
minmatches = selector.expansion[0]
assert minmatches == 0 #everything before pivot has to have minmatches 0
maxmatches = selector.expansion[1]
done = False
matches = 0
while True:
prevelement = element
element = element.previous(selector.Class, None)
if not element or (target and target not in element.ancestors()):
if debug: print("[FQL EVALUATION DEBUG] Span - Prior element not found or out of scope",file=sys.stderr)
done = True #no more elements left
break
elif element and not selector.match(query, element,debug):
if debug: print("[FQL EVALUATION DEBUG] Span - Prior element does not match filter",file=sys.stderr)
element = prevelement #reset
break
if debug: print("[FQL EVALUATION DEBUG] Span - Prior element matches",file=sys.stderr)
#we have a match
matches += 1
spanset.insert(0,element)
if matches >= maxmatches:
if debug: print("[FQL EVALUATION DEBUG] Span - Maximum threshold reached for span selector " + str(i) + ", breaking", file=sys.stderr)
break
if done:
break
#--- matching pivot and selectors after pivot -------
done = False #are we done with this selector?
element = currentelement
i = pivotindex - 1 #loop does +1 at the start of each iteration, we want to start with the pivotindex
while i < l:
i += 1
if i == l:
if debug: print("[FQL EVALUATION DEBUG] Span - No more selectors to try",i,l, file=sys.stderr)
break
selector = self.targets[i]
if selector.id: #selection by ID, don't care about consecutiveness
try:
element = query.doc[selector.id]
if debug: print("[FQL EVALUATION DEBUG] Span - Obtained subsequent span item from ID: ", repr(element), file=sys.stderr)
except KeyError:
if debug: print("[FQL EVALUATION DEBUG] Span - Obtained subsequent with specified ID does not exist ", file=sys.stderr)
match = False
break
if element and not selector.match(query, element,debug):
if debug: print("[FQL EVALUATION DEBUG] Span - Subsequent element does not match filter",file=sys.stderr)
else:
spanset.append(element)
else: #element must be consecutive
if selector.expansion:
minmatches = selector.expansion[0]
maxmatches = selector.expansion[1]
else:
minmatches = maxmatches = 1
if debug: print("[FQL EVALUATION DEBUG] Span - Preparing to match selector " + str(i) + " of span, expansion={" + str(minmatches) + "," + str(maxmatches) + "}", file=sys.stderr)
matches = 0
while True:
submatch = True #does the element currenty under consideration match? (the match variable is reserved for the entire match)
done = False #are we done with this span selector?
holdelement = False #do not go to next element
if debug: print("[FQL EVALUATION DEBUG] Span - Processing element with span selector " + str(i) + ": ", repr(element), file=sys.stderr)
if not element or (target and target not in element.ancestors()):
if debug:
if not element:
print("[FQL EVALUATION DEBUG] Span - Element not found",file=sys.stderr)
elif target and not target in element.ancestors():
print("[FQL EVALUATION DEBUG] Span - Element out of scope",file=sys.stderr)
submatch = False
elif element and not selector.match(query, element,debug):
if debug: print("[FQL EVALUATION DEBUG] Span - Element does not match filter",file=sys.stderr)
submatch = False
if submatch:
matches += 1
if debug: print("[FQL EVALUATION DEBUG] Span - Element is a match, got " + str(matches) + " match(es) now", file=sys.stderr)
if matches > minmatches:
#check if the next selector(s) match too, then we have a point where we might branch two ways
#j = 1
#while i+j < len(self.targets):
# nextselector = self.targets[i+j]
# if nextselector.match(query, element,debug):
# #save this point for backtracking, when we get stuck, we'll roll back to this point
# backtrack.append( (i+j, prevelement, copy(spanset) ) ) #using prevelement, nextelement will be recomputed after backtracking, using different selector
# if not nextselector.expansion or nextselector.expansion[0] > 0:
# break
# j += 1
#TODO: implement
pass
elif matches < minmatches:
if debug: print("[FQL EVALUATION DEBUG] Span - Minimum threshold not reached yet for span selector " + str(i), file=sys.stderr)
spanset.append(element)
if matches >= maxmatches:
if debug: print("[FQL EVALUATION DEBUG] Span - Maximum threshold reached for span selector " + str(i) + ", breaking", file=sys.stderr)
done = True #done with this selector
else:
if matches < minmatches:
#can we backtrack?
if backtrack: #(not reached currently)
if debug: print("[FQL EVALUATION DEBUG] Span - Backtracking",file=sys.stderr)
index, element, spanset = backtrack.pop()
i = index - 1 #next iteration will do +1 again
match = True #default
continue
else:
#nope, all is lost, we have no match
if debug: print("[FQL EVALUATION DEBUG] Span - Minimum threshold could not be attained for span selector " + str(i), file=sys.stderr)
match = False
break
else:
if debug: print("[FQL EVALUATION DEBUG] Span - No match for span selector " + str(i) + ", but no problem since matching threshold was already reached", file=sys.stderr)
holdelement = True
done = True
break
if not holdelement:
prevelement = element
#get next element
element = element.next(selector.Class, None)
if debug: print("[FQL EVALUATION DEBUG] Span - Selecting next element for next round", repr(element), file=sys.stderr)
if done or not match:
if debug: print("[FQL EVALUATION DEBUG] Span - Done with span selector " + str(i), repr(element), file=sys.stderr)
break
if not match: break
if match:
if debug: print("[FQL EVALUATION DEBUG] Span - Span found, returning spanset (" + repr(spanset) + ")",file=sys.stderr)
yield spanset
else:
if debug: print("[FQL EVALUATION DEBUG] Span - Span not found",file=sys.stderr)
class Target(object): #FOR/IN... expression
def __init__(self, targets, strict=False,nested = None, start=None, end=None,endinclusive=True,repeat=False):
self.targets = targets #Selector instances
self.strict = strict #True for IN
self.nested = nested #in a nested another target
self.start = start
self.end = end
self.endinclusive = endinclusive
self.repeat = repeat
@staticmethod
def parse(q, i=0):
if q.kw(i,'FOR'):
strict = False
elif q.kw(i,'IN'):
strict = True
else:
raise SyntaxError("Expected target expression, got " + str(q[i]) + " in: " + str(q))
i += 1
targets = []
nested = None
start = end = None
endinclusive = True
repeat = False
l = len(q)
while i < l:
if q.kw(i,'SPAN'):
target,i = Span.parse(q,i+1)
targets.append(target)
elif q.kw(i,"ID") or q[i] in folia.XML2CLASS or q[i] == "ALL":
target,i = Selector.parse(q,i)
targets.append(target)
elif q.kw(i,","):
#we're gonna have more targets
i += 1
elif q.kw(i, ('FOR','IN')):
nested,i = Selector.parse(q,i+1)
elif q.kw(i,"START"):
start,i = Selector.parse(q,i+1)
elif q.kw(i,("END","ENDAFTER")): #inclusive
end,i = Selector.parse(q,i+1)
endinclusive = True
elif q.kw(i,"ENDBEFORE"): #exclusive
end,i = Selector.parse(q,i+1)
endinclusive = False
elif q.kw(i,"REPEAT"):
repeat = True
i += 1
else:
break
if not targets:
raise SyntaxError("Expected one or more targets, got " + str(q[i]) + " in: " + str(q))
return Target(targets,strict,nested,start,end,endinclusive, repeat), i
def __call__(self, query, contextselector, recurse, debug=False): #generator, lazy evaluation!
if self.nested:
if debug: print("[FQL EVALUATION DEBUG] Target - Deferring to nested target first",file=sys.stderr)
contextselector = (self.nested, (query, contextselector, not self.strict))
if debug: print("[FQL EVALUATION DEBUG] Target - Chaining and calling target selectors (" + str(len(self.targets)) + ")",file=sys.stderr)
if self.targets:
if isinstance(self.targets[0], Span):
for span in self.targets:
if not isinstance(span, Span): raise QueryError("SPAN statement may not be mixed with non-span statements in a single selection")
if debug: print("[FQL EVALUATION DEBUG] Target - Evaluation span ",file=sys.stderr)
for spanset in span(query, contextselector, recurse, debug):
if debug: print("[FQL EVALUATION DEBUG] Target - Yielding spanset ",file=sys.stderr)
yield spanset
else:
selector = self.targets[0]
selector.chain(self.targets)
started = (self.start is None)
dobreak = False
for e,_ in selector(query, contextselector, recurse, debug):
if not started:
if self.start.match(query, e):
if debug: print("[FQL EVALUATION DEBUG] Target - Matched start! Starting from here...",e, file=sys.stderr)
started = True
if started:
if self.end:
if self.end.match(query, e):
if not self.endinclusive:
if debug: print("[FQL EVALUATION DEBUG] Target - Matched end! Breaking before yielding...",e, file=sys.stderr)
started = False
if self.repeat:
continue
else:
break
else:
if debug: print("[FQL EVALUATION DEBUG] Target - Matched end! Breaking after yielding...",e, file=sys.stderr)
started = False
dobreak = True
if debug: print("[FQL EVALUATION DEBUG] Target - Yielding ",repr(e), file=sys.stderr)
yield e
if dobreak and not self.repeat:
break
class Alternative(object): #AS ALTERNATIVE ... expression
def __init__(self, subassignments={},assignments={},filter=None, nextalternative=None):
self.subassignments = subassignments
self.assignments = assignments
self.filter = filter
self.nextalternative = nextalternative
@staticmethod
def parse(q,i=0):
if q.kw(i,'AS') and q[i+1] == "ALTERNATIVE":
i += 1
subassignments = {}
assignments = {}
filter = None
if q.kw(i,'ALTERNATIVE'):
i += 1
if not q.kw(i,'WITH'):
i = getassignments(q, i, subassignments)
if q.kw(i,'WITH'):
i = getassignments(q, i+1, assignments)
if q.kw(i,'WHERE'):
filter, i = Filter.parse(q, i+1)
else:
raise SyntaxError("Expected ALTERNATIVE, got " + str(q[i]) + " in: " + str(q))
if q.kw(i,'ALTERNATIVE'):
#we have another!
nextalternative,i = Alternative.parse(q,i)
else:
nextalternative = None
return Alternative(subassignments, assignments, filter, nextalternative), i
def __call__(self, query, action, focus, target,debug=False):
"""Action delegates to this function"""
isspan = isinstance(action.focus.Class, folia.AbstractSpanAnnotation)
subassignments = {} #make a copy
for key, value in action.assignments.items():
subassignments[key] = value
for key, value in self.subassignments.items():
subassignments[key] = value
if action.action == "SELECT":
if not focus: raise QueryError("SELECT requires a focus element")
if not isspan:
for alternative in focus.alternatives(action.focus.Class, focus.set):
if not self.filter or (self.filter and self.filter.match(query, alternative, debug)):
yield alternative
else:
raise NotImplementedError("Selecting alternative span not implemented yet")
elif action.action == "EDIT" or action.action == "ADD":
if not isspan:
if focus:
parent = focus.ancestor(folia.AbstractStructureElement)
alternative = folia.Alternative( query.doc, action.focus.Class( query.doc , **subassignments), **self.assignments)
parent.append(alternative)
yield alternative
else:
alternative = folia.Alternative( query.doc, action.focus.Class( query.doc , **subassignments), **self.assignments)
target.append(alternative)
yield alternative
else:
raise NotImplementedError("Editing alternative span not implemented yet")
else:
raise QueryError("Alternative does not handle action " + action.action)
def autodeclare(self, doc):
pass #nothing to declare
def substitute(self, *args):
raise QueryError("SUBSTITUTE not supported with AS ALTERNATIVE")
class Correction(object): #AS CORRECTION/SUGGESTION expression...
def __init__(self, set,actionassignments={}, assignments={},filter=None,suggestions=[], bare=False):
self.set = set
self.actionassignments = actionassignments #the assignments in the action
self.assignments = assignments #the assignments for the correction
self.filter = filter
self.suggestions = suggestions # [ (subassignments, suggestionassignments) ]
self.bare = bare
@staticmethod
def parse(q,i, focus):
if q.kw(i,'AS') and q.kw(i+1,'CORRECTION'):
i += 1
bare = False
if q.kw(i,'AS') and q.kw(i+1,'BARE') and q.kw(i+2,'CORRECTION'):
bare = True
i += 2
set = None
actionassignments = {}
assignments = {}
filter = None
suggestions = []
if q.kw(i,'CORRECTION'):
i += 1
if q.kw(i,'OF') and q[i+1]:
set = q[i+1]
i += 2
if not q.kw(i,'WITH'):
i = getassignments(q, i, actionassignments, focus)
if q.kw(i,'WHERE'):
filter, i = Filter.parse(q, i+1)
if q.kw(i,'WITH'):
i = getassignments(q, i+1, assignments)
else:
raise SyntaxError("Expected CORRECTION, got " + str(q[i]) + " in: " + str(q))
l = len(q)
while i < l:
if q.kw(i,'SUGGESTION'):
i+= 1
suggestion = ( {}, {} ) #subassignments, suggestionassignments
if isinstance(q[i], UnparsedQuery):
if not q[i].kw(0,'SUBSTITUTE') and not q[i].kw(0,'ADD'):
raise SyntaxError("Subexpression after SUGGESTION, expected ADD or SUBSTITUTE, got " + str(q[i]))
Correction.parsesubstitute(q[i],suggestion)
i += 1
elif q.kw(i,'MERGE') or q.kw(i,'SPLIT'):
if q.kw(i,'MERGE'):
suggestion[1]['merge'] = True
else:
suggestion[1]['split'] = True
i+= 1
if q.kw(i,'DELETION'): #No need to do anything, DELETION is just to make things more explicit in the syntax, it will result in an empty suggestion
i+=1
elif isinstance(q[i], UnparsedQuery):
if not q[i].kw(0,'SUBSTITUTE') and not q[i].kw(0,'ADD'):
raise SyntaxError("Subexpression after SUGGESTION, expected ADD or SUBSTITUTE, got " + str(q[i]))
Correction.parsesubstitute(q[i],suggestion)
i += 1
elif not q.kw(i,'WITH'):
i = getassignments(q, i, suggestion[0], focus) #subassignments (the actual element in the suggestion)
elif not q.kw(i,'WITH'):
i = getassignments(q, i, suggestion[0], focus) #subassignments (the actual element in the suggestion)
if q.kw(i,'WITH'):
i = getassignments(q, i+1, suggestion[1]) #assignments for the suggestion
suggestions.append(suggestion)
else:
raise SyntaxError("Expected SUGGESTION or end of AS clause, got " + str(q[i]) + " in: " + str(q))
return Correction(set, actionassignments, assignments, filter, suggestions, bare), i
@staticmethod
def parsesubstitute(q,suggestion):
suggestion[0]['substitute'],_ = Action.parse(q)
def __call__(self, query, action, focus, target,debug=False):
"""Action delegates to this function"""
if debug: print("[FQL EVALUATION DEBUG] Correction - Processing ", repr(focus),file=sys.stderr)
isspan = isinstance(action.focus.Class, folia.AbstractSpanAnnotation)
actionassignments = {} #make a copy
for key, value in action.assignments.items():
if key == 'class': key = 'cls'
actionassignments[key] = value
for key, value in self.actionassignments.items():
if key == 'class': key = 'cls'
actionassignments[key] = value
if actionassignments:
if (not 'set' in actionassignments or actionassignments['set'] is None) and action.focus.Class:
try:
actionassignments['set'] = query.defaultsets[action.focus.Class.XMLTAG]
except KeyError:
actionassignments['set'] = query.doc.defaultset(action.focus.Class)
if action.focus.Class.REQUIRED_ATTRIBS and folia.Attrib.ID in action.focus.Class.REQUIRED_ATTRIBS:
actionassignments['id'] = getrandomid(query, "corrected." + action.focus.Class.XMLTAG + ".")
kwargs = {}
if self.set:
kwargs['set'] = self.set
for key, value in self.assignments.items():
if key == 'class': key = 'cls'
kwargs[key] = value
if action.action == "SELECT":
if not focus: raise QueryError("SELECT requires a focus element")
correction = focus.incorrection()
if correction:
if not self.filter or (self.filter and self.filter.match(query, correction, debug)):
yield correction
elif action.action in ("EDIT","ADD","PREPEND","APPEND"):
if focus:
correction = focus.incorrection()
else:
correction = False
inheritchildren = []
if focus and not self.bare: #copy all data within
inheritchildren = list(focus.copychildren(query.doc, True))
if action.action == "EDIT" and action.span: #respan
#delete all word references from the copy first, we will add new ones
inheritchildren = [ c for c in inheritchildren if not isinstance(c, folia.WordReference) ]
if not isinstance(focus, folia.AbstractSpanAnnotation): raise QueryError("Can only perform RESPAN on span annotation elements!")
contextselector = target if target else query.doc
spanset = next(action.span(query, contextselector, True, debug)) #there can be only one
for w in spanset:
inheritchildren.append(w)
if actionassignments:
kwargs['new'] = action.focus.Class(query.doc,*inheritchildren, **actionassignments)
if focus and action.action not in ('PREPEND','APPEND'):
kwargs['original'] = focus
#TODO: if not bare, fix all span annotation references to this element
elif focus and action.action not in ('PREPEND','APPEND'):
if isinstance(focus, folia.AbstractStructureElement):
kwargs['current'] = focus #current only needed for structure annotation
if correction and (not 'set' in kwargs or correction.set == kwargs['set']) and (not 'cls' in kwargs or correction.cls == kwargs['cls']): #reuse the existing correction element
print("Reusing " + correction.id,file=sys.stderr)
kwargs['reuse'] = correction
if action.action in ('PREPEND','APPEND'):
#get parent relative to target
parent = target.ancestor( (folia.AbstractStructureElement, folia.AbstractSpanAnnotation, folia.AbstractAnnotationLayer) )
elif focus:
if 'reuse' in kwargs and kwargs['reuse']:
parent = focus.ancestor( (folia.AbstractStructureElement, folia.AbstractSpanAnnotation, folia.AbstractAnnotationLayer) )
else:
parent = focus.ancestor( (folia.AbstractStructureElement, folia.AbstractSpanAnnotation, folia.AbstractAnnotationLayer, folia.Correction) )
else:
parent = target
if 'id' not in kwargs and 'reuse' not in kwargs:
kwargs['id'] = parent.generate_id(folia.Correction)
kwargs['suggestions'] = []
for subassignments, suggestionassignments in self.suggestions:
subassignments = copy(subassignments) #assignment for the element in the suggestion
for key, value in action.assignments.items():
if not key in subassignments:
if key == 'class': key = 'cls'
subassignments[key] = value
if (not 'set' in subassignments or subassignments['set'] is None) and action.focus.Class:
try:
subassignments['set'] = query.defaultsets[action.focus.Class.XMLTAG]
except KeyError:
subassignments['set'] = query.doc.defaultset(action.focus.Class)
if focus and not self.bare: #copy all data within (we have to do this again for each suggestion as it will generate different ID suffixes)
inheritchildren = list(focus.copychildren(query.doc, True))
if action.focus.Class.REQUIRED_ATTRIBS and folia.Attrib.ID in action.focus.Class.REQUIRED_ATTRIBS:
subassignments['id'] = getrandomid(query, "suggestion.")
kwargs['suggestions'].append( folia.Suggestion(query.doc, action.focus.Class(query.doc, *inheritchildren,**subassignments), **suggestionassignments ) )
if action.action == 'PREPEND':
index = parent.getindex(target,True) #recursive
if index == -1:
raise QueryError("Insertion point for PREPEND action not found")
kwargs['insertindex'] = index
kwargs['nooriginal'] = True
elif action.action == 'APPEND':
index = parent.getindex(target,True) #recursive
if index == -1:
raise QueryError("Insertion point for APPEND action not found")
kwargs['insertindex'] = index+1
kwargs['insertindex_offset'] = 1 #used by correct if it needs to recompute the index
kwargs['nooriginal'] = True
yield parent.correct(**kwargs) #generator
elif action.action == "DELETE":
if debug: print("[FQL EVALUATION DEBUG] Correction - Deleting ", repr(focus), " (in " + repr(focus.parent) + ")",file=sys.stderr)
if not focus: raise QueryError("DELETE AS CORRECTION did not find a focus to operate on")
kwargs['original'] = focus
kwargs['new'] = [] #empty new
c = focus.parent.correct(**kwargs) #generator
yield c
else:
raise QueryError("Correction does not handle action " + action.action)
def autodeclare(self,doc):
if self.set:
if not doc.declared(folia.Correction, self.set):
doc.declare(folia.Correction, self.set)
def prepend(self, query, content, contextselector, debug):
return self.insert(query, content, contextselector, 0, debug)
def append(self, query, content, contextselector, debug):
return self.insert(query, content, contextselector, 1, debug)
def insert(self, query, content, contextselector, offset, debug):
kwargs = {}
if self.set:
kwargs['set'] = self.set
for key, value in self.assignments.items():
if key == 'class': key = 'cls'
kwargs[key] = value
self.autodeclare(query.doc)
if not content:
#suggestions only, no subtitution obtained from main action yet, we have to process it still
if debug: print("[FQL EVALUATION DEBUG] Correction.insert - Initialising for suggestions only",file=sys.stderr)
if isinstance(contextselector,tuple) and len(contextselector) == 2:
contextselector = contextselector[0](*contextselector[1])
target = list(contextselector)[0] #not a spanset
insertindex = 0
#find insertion index:
if debug: print("[FQL EVALUATION DEBUG] Correction.insert - Finding insertion index for target ", repr(target), " in ", repr(target.parent),file=sys.stderr)
for i, e in enumerate(target.parent):
if e is target:
if debug: print("[FQL EVALUATION DEBUG] Correction.insert - Target ", repr(target), " found in ", repr(target.parent), " at index ", i,file=sys.stderr)
insertindex = i
break
content = {'parent': target.parent,'new':[]}
kwargs['insertindex'] = insertindex + offset
else:
kwargs['insertindex'] = content['index'] + offset
if debug: print("[FQL EVALUATION DEBUG] Correction.insert - Initialising correction",file=sys.stderr)
kwargs['new'] = [] #stuff will be appended
kwargs['nooriginal'] = True #this is an insertion, there is no original
kwargs = self.assemblesuggestions(query,content,debug,kwargs)
if debug: print("[FQL EVALUATION DEBUG] Correction.insert - Applying and returning correction ", repr(kwargs),file=sys.stderr)
return content['parent'].correct(**kwargs)
def substitute(self, query, substitution, contextselector, debug):
kwargs = {}
if self.set:
kwargs['set'] = self.set
for key, value in self.assignments.items():
if key == 'class': key = 'cls'
kwargs[key] = value
self.autodeclare(query.doc)
if not substitution:
#suggestions only, no subtitution obtained from main action yet, we have to process it still
if debug: print("[FQL EVALUATION DEBUG] Correction.substitute - Initialising for suggestions only",file=sys.stderr)
if isinstance(contextselector,tuple) and len(contextselector) == 2:
contextselector = contextselector[0](*contextselector[1])
target = list(contextselector)[0]
if not isinstance(target, SpanSet):
raise QueryError("SUBSTITUTE expects target SPAN")
prev = target[0].parent
for e in target[1:]:
if e.parent != prev:
raise QueryError("SUBSTITUTE can only be performed when the target items share the same parent. First parent is " + repr(prev) + ", parent of " + repr(e) + " is " + repr(e.parent))
insertindex = 0
#find insertion index:
for i, e in enumerate(target[0].parent):
if e is target[0]:
insertindex = i
break
substitution = {'parent': target[0].parent,'new':[]}
kwargs['insertindex'] = insertindex
kwargs['current'] = target
else:
kwargs['insertindex'] = substitution['index']
kwargs['original'] = substitution['span']
if debug: print("[FQL EVALUATION DEBUG] Correction.substitute - Initialising correction",file=sys.stderr)
kwargs['new'] = [] #stuff will be appended
kwargs = self.assemblesuggestions(query,substitution,debug,kwargs)
if debug: print("[FQL EVALUATION DEBUG] Correction.substitute - Applying and returning correction",file=sys.stderr)
return substitution['parent'].correct(**kwargs)
def assemblesuggestions(self, query, substitution, debug, kwargs):
if self.suggestions:
kwargs['suggestions'] = [] #stuff will be appended
for i, (Class, actionassignments, subactions) in enumerate(substitution['new']):
if actionassignments:
if (not 'set' in actionassignments or actionassignments['set'] is None):
try:
actionassignments['set'] = query.defaultsets[Class.XMLTAG]
except KeyError:
actionassignments['set'] = query.doc.defaultset(Class)
actionassignments['id'] = "corrected.%08x" % random.getrandbits(32) #generate a random ID
e = Class(query.doc, **actionassignments)
if debug: print("[FQL EVALUATION DEBUG] Correction.assemblesuggestions - Adding to new",file=sys.stderr)
kwargs['new'].append(e)
for subaction in subactions:
subaction.focus.autodeclare(query.doc)
if debug: print("[FQL EVALUATION DEBUG] Correction.assemblesuggestions - Invoking subaction", subaction.action,file=sys.stderr)
subaction(query, [e], debug ) #note: results of subactions will be silently discarded
for subassignments, suggestionassignments in self.suggestions:
suggestionchildren = []
if 'substitute' in subassignments:
#SUBTITUTE (or synonym ADD)
action = subassignments['substitute']
del subassignments['substitute']
else:
#we have a suggested deletion
action = None
if debug: print("[FQL EVALUATION DEBUG] Correction.assemblesuggestions - Adding suggestion",file=sys.stderr)
while action:
subassignments = copy(subassignments) #assignment for the element in the suggestion
if isinstance(action.focus, tuple) and len(action.focus) == 2:
action.focus = action.focus[0]
for key, value in action.assignments.items():
if key == 'class': key = 'cls'
subassignments[key] = value
if (not 'set' in subassignments or subassignments['set'] is None) and action.focus.Class:
try:
subassignments['set'] = query.defaultsets[action.focus.Class.XMLTAG]
except KeyError:
subassignments['set'] = query.doc.defaultset(action.focus.Class)
focus = action.focus
focus.autodeclare(query.doc)
if focus.Class.REQUIRED_ATTRIBS and folia.Attrib.ID in focus.Class.REQUIRED_ATTRIBS:
subassignments['id'] = getrandomid(query, "suggestion.")
suggestionchildren.append( focus.Class(query.doc, **subassignments))
action = action.nextaction
if debug: print("[FQL EVALUATION DEBUG] Correction.assemblesuggestions - Suggestionchildren: ", len(suggestionchildren),file=sys.stderr)
if 'split' in suggestionassignments and suggestionassignments['split']:
nextitem = substitution['parent'].next(substitution['parent'].__class__, None)
if nextitem:
suggestionassignments['split'] = nextitem.id
else:
del suggestionassignments['split']
if 'merge' in suggestionassignments and suggestionassignments['merge']:
nextitem = substitution['parent'].next(substitution['parent'].__class__, None)
if nextitem:
suggestionassignments['merge'] = nextitem.id
else:
del suggestionassignments['merge']
kwargs['suggestions'].append( folia.Suggestion(query.doc,*suggestionchildren, **suggestionassignments ) )
return kwargs
def getassignments(q, i, assignments, focus=None):
l = len(q)
while i < l:
if q.kw(i, ('id','set','subset','annotator','class','n')):
if q[i+1] == 'NONE':
assignments[q[i]] = None
else:
assignments[q[i]] = q[i+1]
i+=2
elif q.kw(i,'confidence'):
if q[i+1] == 'NONE':
assignments[q[i]] = None
else:
try:
assignments[q[i]] = float(q[i+1])
except:
raise SyntaxError("Invalid value for confidence: " + str(q[i+1]))
i+=2
elif q.kw(i,'annotatortype'):
if q[i+1] == "auto":
assignments[q[i]] = folia.AnnotatorType.AUTO
elif q[i+1] == "manual":
assignments[q[i]] = folia.AnnotatorType.MANUAL
elif q[i+1] == "NONE":
assignments[q[i]] = None
else:
raise SyntaxError("Invalid value for annotatortype: " + str(q[i+1]))
i+=2
elif q.kw(i,('text','value','phon')):
if not focus is None and focus.Class in (folia.TextContent, folia.Description, folia.Comment):
key = 'value'
elif not focus is None and focus.Class is folia.PhonContent:
key = 'phon'
else:
key = 'text'
assignments[key] = q[i+1]
i+=2
elif q.kw(i, 'datetime'):
if q[i+1] == "now":
assignments[q[i]] = datetime.datetime.now()
elif q[i+1] == "NONE":
assignments[q[i]] = None
elif q[i+1].isdigit():
try:
assignments[q[i]] = datetime.datetime.fromtimestamp(q[i+1])
except:
raise SyntaxError("Unable to parse datetime: " + str(q[i+1]))
else:
try:
assignments[q[i]] = datetime.strptime("%Y-%m-%dT%H:%M:%S")
except:
raise SyntaxError("Unable to parse datetime: " + str(q[i+1]))
i += 2
else:
if not assignments:
raise SyntaxError("Expected assignments after WITH statement, but no valid attribute found, got " + str(q[i]) + " at position " + str(i) + " in: " + str(q))
break
return i
class Action(object): #Action expression
def __init__(self, action, focus, assignments={}):
self.action = action
self.focus = focus #Selector
self.assignments = assignments
self.form = None
self.subactions = []
self.nextaction = None
self.span = None #encodes an extra SPAN/RESPAN action
@staticmethod
def parse(q,i=0):
if q.kw(i, ('SELECT','EDIT','DELETE','ADD','APPEND','PREPEND','SUBSTITUTE')):
action = q[i]
else:
raise SyntaxError("Expected action, got " + str(q[i]) + " in: " + str(q))
assignments = {}
i += 1
if (action in ('SUBSTITUTE','APPEND','PREPEND')) and (isinstance(q[i],UnparsedQuery)):
focus = None #We have a SUBSTITUTE/APPEND/PREPEND (AS CORRECTION) expression
elif (action == 'SELECT') and q.kw(i,('FOR','IN')): #select statement without focus, pure target
focus = None
else:
focus, i = Selector.parse(q,i)
if action == "ADD" and focus.filter:
raise SyntaxError("Focus has WHERE statement but ADD action does not support this")
if q.kw(i,"WITH"):
if action in ("SELECT", "DELETE"):
raise SyntaxError("Focus has WITH statement but " + action + " does not support this: " +str(q))
i += 1
i = getassignments(q,i ,assignments, focus)
#we have enough to set up the action now
action = Action(action, focus, assignments)
if action.action in ("EDIT","ADD", "APPEND","PREPEND") and q.kw(i,("RESPAN","SPAN")):
action.span, i = Span.parse(q,i+1)
if action.action == "DELETE" and q.kw(i,("RESTORE")):
action.restore = q[i+1].upper()
i += 2
else:
action.restore = None
done = False
while not done:
if isinstance(q[i], UnparsedQuery):
#we have a sub expression
if q[i].kw(0, ('EDIT','DELETE','ADD')):
#It's a sub-action!
if action.action in ("DELETE"):
raise SyntaxError("Subactions are not allowed for action " + action.action + ", in: " + str(q))
subaction, _ = Action.parse(q[i])
action.subactions.append( subaction )
elif q[i].kw(0, 'AS'):
if q[i].kw(1, "ALTERNATIVE"):
action.form,_ = Alternative.parse(q[i])
elif q[i].kw(1, "CORRECTION") or (q[i].kw(1,"BARE") and q[i].kw(2, "CORRECTION")):
action.form,_ = Correction.parse(q[i],0,action.focus)
else:
raise SyntaxError("Invalid keyword after AS: " + str(q[i][1]))
i+=1
else:
done = True
if q.kw(i, ('SELECT','EDIT','DELETE','ADD','APPEND','PREPEND','SUBSTITUTE')):
#We have another action!
action.nextaction, i = Action.parse(q,i)
return action, i
def __call__(self, query, contextselector, debug=False):
"""Returns a list focusselection after having performed the desired action on each element therein"""
#contextselector is a two-tuple function recipe (f,args), so we can reobtain the generator which it returns
#select all focuss, not lazy because we are going return them all by definition anyway
if debug: print("[FQL EVALUATION DEBUG] Action - Preparing to evaluate action chain starting with ", self.action,file=sys.stderr)
#handles all actions further in the chain, not just this one!!! This actual method is only called once
actions = [self]
a = self
while a.nextaction:
actions.append(a.nextaction)
a = a.nextaction
if len(actions) > 1:
#multiple actions to perform, apply contextselector once and load in memory (will be quicker at higher memory cost, proportionate to the target selection size)
if isinstance(contextselector, tuple) and len(contextselector) == 2:
contextselector = list(contextselector[0](*contextselector[1]))
focusselection_all = []
constrainedtargetselection_all = []
for action in actions:
if action.action != "SELECT" and action.focus:
#check if set is declared, if not, auto-declare
if debug: print("[FQL EVALUATION DEBUG] Action - Auto-declaring ",action.focus.Class.__name__, " of ", str(action.focus.set),file=sys.stderr)
action.focus.autodeclare(query.doc)
if action.form and isinstance(action.form, Correction) and action.focus:
if debug: print("[FQL EVALUATION DEBUG] Action - Auto-declaring ",action.focus.Class.__name__, " of ", str(action.focus.set),file=sys.stderr)
action.form.autodeclare(query.doc)
substitution = {}
if self.action == 'SUBSTITUTE' and not self.focus and self.form:
#we have a SUBSTITUTE (AS CORRECTION) statement with no correction but only suggestions
#defer substitute to form
result = self.form.substitute(query, None, contextselector, debug)
focusselection = [result]
constrainedtargetselection = []
#(no further chaining possible in this setup)
elif self.action == 'PREPEND' and not self.focus and self.form:
#we have a PREPEND (AS CORRECTION) statement with no correction but only suggestions
#defer substitute to form
result = self.form.prepend(query, None, contextselector, debug)
focusselection = [result]
constrainedtargetselection = []
#(no further chaining possible in this setup)
elif self.action == 'APPEND' and not self.focus and self.form:
#we have a APPEND (AS CORRECTION) statement with no correction but only suggestions
#defer substitute to form
result = self.form.append(query, None, contextselector, debug)
focusselection = [result]
constrainedtargetselection = []
#(no further chaining possible in this setup)
else:
for action in actions:
if debug: print("[FQL EVALUATION DEBUG] Action - Evaluating action ", action.action,file=sys.stderr)
focusselection = []
constrainedtargetselection = [] #selecting focus elements constrains the target selection
processed_form = []
if substitution and action.action != "SUBSTITUTE":
raise QueryError("SUBSTITUTE can not be chained with " + action.action)
if action.action == "SELECT" and not action.focus: #SELECT without focus, pure target-select
if isinstance(contextselector, tuple) and len(contextselector) == 2:
for e in contextselector[0](*contextselector[1]):
constrainedtargetselection.append(e)
focusselection.append(e)
else:
for e in contextselector:
constrainedtargetselection.append(e)
focusselection.append(e)
elif action.action not in ("ADD","APPEND","PREPEND"): #only for actions that operate on an existing focus
if contextselector is query.doc and action.focus.Class in ('ALL',folia.Text):
focusselector = ( (x,x) for x in query.doc ) #Patch to make root-level SELECT ALL work as intended
else:
strict = query.targets and query.targets.strict
focusselector = action.focus(query,contextselector, not strict, debug)
if debug: print("[FQL EVALUATION DEBUG] Action - Obtaining focus...",file=sys.stderr)
for focus, target in focusselector:
if target and action.action != "SUBSTITUTE":
if isinstance(target, SpanSet):
if not target.partof(constrainedtargetselection):
if debug: print("[FQL EVALUATION DEBUG] Action - Got target result (spanset), adding ", repr(target),file=sys.stderr)
constrainedtargetselection.append(target)
elif not any(x is target for x in constrainedtargetselection):
if debug: print("[FQL EVALUATION DEBUG] Action - Got target result, adding ", repr(target),file=sys.stderr)
constrainedtargetselection.append(target)
if action.form and action.action != "SUBSTITUTE":
#Delegate action to form (= correction or alternative)
if not any(x is focus for x in processed_form):
if debug: print("[FQL EVALUATION DEBUG] Action - Got focus result, processing using form ", repr(focus),file=sys.stderr)
processed_form.append(focus)
focusselection += list(action.form(query, action,focus,target,debug))
else:
if debug: print("[FQL EVALUATION DEBUG] Action - Focus result already obtained, skipping... ", repr(focus),file=sys.stderr)
continue
else:
if isinstance(focus,SpanSet):
if not focus.partof(focusselection):
if debug: print("[FQL EVALUATION DEBUG] Action - Got focus result (spanset), adding ", repr(target),file=sys.stderr)
focusselection.append(target)
else:
if debug: print("[FQL EVALUATION DEBUG] Action - Focus result (spanset) already obtained, skipping... ", repr(target),file=sys.stderr)
continue
elif not any(x is focus for x in focusselection):
if debug: print("[FQL EVALUATION DEBUG] Action - Got focus result, adding ", repr(focus),file=sys.stderr)
focusselection.append(focus)
else:
if debug: print("[FQL EVALUATION DEBUG] Action - Focus result already obtained, skipping... ", repr(focus),file=sys.stderr)
continue
if action.action == "EDIT":
if debug: print("[FQL EVALUATION DEBUG] Action - Applying EDIT to focus ", repr(focus),file=sys.stderr)
for attr, value in action.assignments.items():
if attr in ("text","value","phon"):
if isinstance(focus, (folia.Description, folia.Comment, folia.Content)):
if debug: print("[FQL EVALUATION DEBUG] Action - setting value ("+ value+ ") on focus ", repr(focus),file=sys.stderr)
focus.value = value
elif isinstance(focus, (folia.PhonContent)):
if debug: print("[FQL EVALUATION DEBUG] Action - setphon("+ value+ ") on focus ", repr(focus),file=sys.stderr)
focus.setphon(value)
else:
if debug: print("[FQL EVALUATION DEBUG] Action - settext("+ value+ ") on focus ", repr(focus),file=sys.stderr)
focus.settext(value)
elif attr == "class":
if debug: print("[FQL EVALUATION DEBUG] Action - " + attr + " = " + value + " on focus ", repr(focus),file=sys.stderr)
focus.cls = value
else:
if debug: print("[FQL EVALUATION DEBUG] Action - " + attr + " = " + value + " on focus ", repr(focus),file=sys.stderr)
setattr(focus, attr, value)
if action.span is not None: #respan
if not isinstance(focus, folia.AbstractSpanAnnotation): raise QueryError("Can only perform RESPAN on span annotation elements!")
spanset = next(action.span(query, contextselector, True, debug)) #there can be only one
focus.setspan(*spanset)
query._touch(focus)
elif action.action == "DELETE":
if debug: print("[FQL EVALUATION DEBUG] Action - Applying DELETE to focus ", repr(focus),file=sys.stderr)
p = focus.parent
if action.restore == "ORIGINAL":
index = p.getindex(focus, False, False)
if not isinstance(focus, folia.Correction):
raise QueryError("RESTORE ORIGINAL can only be performed when the focus is a correction")
#restore original
for original in reversed(focus.original()):
if debug: print("[FQL EVALUATION DEBUG] Action - Restoring original: ", repr(original),file=sys.stderr)
original.parent = p
p.insert(index, original)
p.remove(focus)
#we set the parent back on the element we return, so return types like ancestor-focus work
focus.parent = p
elif action.action == "SUBSTITUTE":
if debug: print("[FQL EVALUATION DEBUG] Action - Applying SUBSTITUTE to target ", repr(focus),file=sys.stderr)
if not isinstance(target,SpanSet) or not target: raise QueryError("SUBSTITUTE requires a target SPAN")
focusselection.remove(focus)
if not substitution:
#this is the first SUBSTITUTE in a chain
prev = target[0].parent
for e in target[1:]:
if e.parent != prev:
raise QueryError("SUBSTITUTE can only be performed when the target items share the same parent")
substitution['parent'] = target[0].parent
substitution['index'] = 0
substitution['span'] = target
substitution['new'] = []
#find insertion index:
for i, e in enumerate(target[0].parent):
if e is target[0]:
substitution['index'] = i
break
substitution['new'].append( (action.focus.Class, action.assignments, action.subactions) )
if action.action in ("ADD","APPEND","PREPEND") or (action.action == "EDIT" and not focusselection):
if debug: print("[FQL EVALUATION DEBUG] Action - Applying " + action.action + " to targets",file=sys.stderr)
if not action.focus.Class:
raise QueryError("Focus of action has no class!")
isspan = issubclass(action.focus.Class, folia.AbstractSpanAnnotation)
isspanrole = issubclass(action.focus.Class, folia.AbstractSpanRole)
if 'set' not in action.assignments and action.focus.Class not in (folia.Description, folia.Comment, folia.Feature) and not isspanrole:
if action.focus.set and action.focus.set != "undefined":
action.assignments['set'] = action.focus.set
elif action.focus.Class.XMLTAG in query.defaultsets:
action.assignments['set'] = action.focus.set = query.defaultsets[action.focus.Class.XMLTAG]
else:
action.assignments['set'] = action.focus.set = query.doc.defaultset(action.focus.Class)
if isinstance(contextselector, tuple) and len(contextselector) == 2:
targetselection = contextselector[0](*contextselector[1])
else:
targetselection = contextselector
for target in targetselection:
if action.form:
#Delegate action to form (= correction or alternative)
focusselection += list( action.form(query, action,None,target,debug) )
else:
if isinstance(target, SpanSet):
if action.action == "ADD" or action.action == "EDIT":
if debug: print("[FQL EVALUATION DEBUG] Action - Applying " + action.action + " of " + action.focus.Class.__name__ + " to target spanset " + repr(target),file=sys.stderr)
if action.span is not None and len(action.span) == 0:
action.assignments['emptyspan'] = True
focusselection.append( target[0].add(action.focus.Class, *target, **action.assignments) ) #handles span annotation too
query._touch(focusselection[-1])
else:
if action.action == "ADD" or action.action == "EDIT":
if debug: print("[FQL EVALUATION DEBUG] Action - Applying " + action.action + " of " + action.focus.Class.__name__ + " to target " + repr(target),file=sys.stderr)
focusselection.append( target.add(action.focus.Class, **action.assignments) ) #handles span annotation too
query._touch(focusselection[-1])
elif action.action == "APPEND":
if debug: print("[FQL EVALUATION DEBUG] Action - Applying " + action.action + " of " + action.focus.Class.__name__ +" to target " + repr(target),file=sys.stderr)
index = target.parent.getindex(target)
if index == -1:
raise QueryError("Insertion point for APPEND action not found")
focusselection.append( target.parent.insert(index+1, action.focus.Class, **action.assignments) )
query._touch(focusselection[-1])
elif action.action == "PREPEND":
if debug: print("[FQL EVALUATION DEBUG] Action - Applying " + action.action + " of " + action.focus.Class.__name__ +" to target " + repr(target),file=sys.stderr)
index = target.parent.getindex(target)
if index == -1:
raise QueryError("Insertion point for PREPEND action not found")
focusselection.append( target.parent.insert(index, action.focus.Class, **action.assignments) )
query._touch(focusselection[-1])
if isinstance(target, SpanSet):
if not target.partof(constrainedtargetselection):
constrainedtargetselection.append(target)
elif not any(x is target for x in constrainedtargetselection):
constrainedtargetselection.append(target)
if focusselection and action.span: #process SPAN keyword (ADD .. SPAN .. FOR .. rather than ADD ... FOR SPAN ..)
if not isspan: raise QueryError("Can only use SPAN with span annotation elements!")
for focus in focusselection:
spanset = next(action.span(query, contextselector, True, debug)) #there can be only one
focus.setspan(*spanset)
if focusselection and action.subactions and not substitution:
for subaction in action.subactions:
#check if set is declared, if not, auto-declare
if debug: print("[FQL EVALUATION DEBUG] Action - Auto-declaring ",action.focus.Class.__name__, " of ", str(action.focus.set),file=sys.stderr)
subaction.focus.autodeclare(query.doc)
if debug: print("[FQL EVALUATION DEBUG] Action - Invoking subaction ", subaction.action,file=sys.stderr)
subaction(query, focusselection, debug ) #note: results of subactions will be silently discarded, they can never select anything
if len(actions) > 1:
#consolidate results:
focusselection_all = []
for e in focusselection:
if isinstance(e, SpanSet):
if not e.partof(focusselection_all):
focusselection_all.append(e)
elif not any(x is e for x in focusselection_all):
focusselection_all.append(e)
constrainedtargetselection_all = []
for e in constrainedtargetselection:
if isinstance(e, SpanSet):
if not e.partof(constrainedtargetselection_all):
constrainedtargetselection_all.append(e)
elif not any(x is e for x in constrainedtargetselection_all):
constrainedtargetselection_all.append(e)
if substitution:
constrainedtargetselection_all = []
constrainedtargetselection = []
if action.form:
result = action.form.substitute(query, substitution, None, debug)
if len(actions) > 1:
focusselection_all.append(result)
else:
focusselection.append(result)
else:
if debug: print("[FQL EVALUATION DEBUG] Action - Substitution - Removing target",file=sys.stderr)
for e in substitution['span']:
substitution['parent'].remove(e)
for i, (Class, assignments, subactions) in enumerate(substitution['new']):
if debug: print("[FQL EVALUATION DEBUG] Action - Substitution - Inserting substitution",file=sys.stderr)
e = substitution['parent'].insert(substitution['index']+i, Class, **assignments)
for subaction in subactions:
subaction.focus.autodeclare(query.doc)
if debug: print("[FQL EVALUATION DEBUG] Action - Invoking subaction (in substitution) ", subaction.action,file=sys.stderr)
subaction(query, [e], debug ) #note: results of subactions will be silently discarded, they can never select anything
if len(actions) > 1:
focusselection_all.append(e)
else:
focusselection.append(e)
if len(actions) > 1:
return focusselection_all, constrainedtargetselection_all
else:
return focusselection, constrainedtargetselection
class Context(object):
def __init__(self):
self.format = "python"
self.returntype = "focus"
self.request = "all"
self.defaults = {}
self.defaultsets = {}
class Query(object):
"""This class represents an FQL query.
Selecting a word with a particular text is done as follows, ``doc`` is an instance of :class:`pynlpl.formats.folia.Document`::
query = fql.Query('SELECT w WHERE text = "house"')
for word in query(doc):
print(word) #this will be an instance of folia.Word
Regular expression matching can be done using the ``MATCHES`` operator::
query = fql.Query('SELECT w WHERE text MATCHES "^house.*$"')
for word in query(doc):
print(word)
The classes of other annotation types can be easily queried as follows::
query = fql.Query('SELECT w WHERE :pos = "v"' AND :lemma = "be"')
for word in query(doc):
print(word)
You can constrain your queries to a particular target selection using the ``FOR`` keyword::
query = fql.Query('SELECT w WHERE text MATCHES "^house.*$" FOR s WHERE text CONTAINS "sell"')
for word in query(doc):
print(word)
This construction also allows you to select the actual annotations. To select all people (a named entity) for words that are not John::
query = fql.Query('SELECT entity WHERE class = "person" FOR w WHERE text != "John"')
for entity in query(doc):
print(entity) #this will be an instance of folia.Entity
**FOR** statement may be chained, and Explicit IDs can be passed using the ``ID`` keyword::
query = fql.Query('SELECT entity WHERE class = "person" FOR w WHERE text != "John" FOR div ID "section.21"')
for entity in query(doc):
print(entity)
Sets are specified using the **OF** keyword, it can be omitted if there is only one for the annotation type, but will be required otherwise::
query = fql.Query('SELECT su OF "http://some/syntax/set" WHERE class = "np"')
for su in query(doc):
print(su) #this will be an instance of folia.SyntacticUnit
We have just covered just the **SELECT** keyword, FQL has other keywords for manipulating documents, such as **EDIT**, **ADD**, **APPEND** and **PREPEND**.
Note:
Consult the FQL documentation at https://github.com/proycon/foliadocserve/blob/master/README.rst for further documentation on the language.
"""
def __init__(self, q, context=Context()):
self.action = None
self.targets = None
self.declarations = []
self.format = context.format
self.returntype = context.returntype
self.request = copy(context.request)
self.defaults = copy(context.defaults)
self.defaultsets = copy(context.defaultsets)
self.parse(q)
def parse(self, q, i=0):
if not isinstance(q,UnparsedQuery):
q = UnparsedQuery(q)
l = len(q)
if q.kw(i,"DECLARE"):
try:
Class = folia.XML2CLASS[q[i+1]]
except:
raise SyntaxError("DECLARE statement expects a FoLiA element, got: " + str(q[i+1]))
if not Class.ANNOTATIONTYPE:
raise SyntaxError("DECLARE statement for undeclarable element type: " + str(q[i+1]))
i += 2
defaults = {}
decset = None
if q.kw(i,"OF") and q[i+1]:
i += 1
decset = q[i]
i += 1
if q.kw(i,"WITH"):
i = getassignments(q,i+1,defaults)
if not decset:
raise SyntaxError("DECLARE statement must state a set")
self.declarations.append( (Class, decset, defaults) )
if i < l:
self.action,i = Action.parse(q,i)
if q.kw(i,("FOR","IN")):
self.targets, i = Target.parse(q,i)
while i < l:
if q.kw(i,"RETURN"):
self.returntype = q[i+1]
i+=2
elif q.kw(i,"FORMAT"):
self.format = q[i+1]
i+=2
elif q.kw(i,"REQUEST"):
self.request = q[i+1].split(",")
i+=2
else:
raise SyntaxError("Unexpected " + str(q[i]) + " at position " + str(i) + " in: " + str(q))
if i != l:
raise SyntaxError("Expected end of query, got " + str(q[i]) + " in: " + str(q))
def __call__(self, doc, wrap=True,debug=False):
"""Execute the query on the specified document"""
self.doc = doc
if debug: print("[FQL EVALUATION DEBUG] Query - Starting on document ", doc.id,file=sys.stderr)
if self.declarations:
for Class, decset, defaults in self.declarations:
if debug: print("[FQL EVALUATION DEBUG] Processing declaration for ", Class.__name__, "of",str(decset),file=sys.stderr)
doc.declare(Class,decset,**defaults)
if self.action:
targetselector = doc
if self.targets and not (isinstance(self.targets.targets[0], Selector) and self.targets.targets[0].Class in ("ALL", folia.Text)):
targetselector = (self.targets, (self, targetselector, True, debug)) #function recipe to get the generator for the targets, (f, *args) (first is always recursive)
focusselection, targetselection = self.action(self, targetselector, debug) #selecting focus elements further constrains the target selection (if any), return values will be lists
if self.returntype == "nothing":
return ""
elif self.returntype == "focus":
responseselection = focusselection
elif self.returntype == "target" or self.returntype == "inner-target":
responseselection = []
for e in targetselection:
if not any(x is e for x in responseselection): #filter out duplicates
responseselection.append(e)
elif self.returntype == "outer-target":
raise NotImplementedError
elif self.returntype == "ancestor" or self.returntype == "ancestor-focus":
responseselection = []
try:
responseselection.append( next(folia.commonancestors(folia.AbstractStructureElement,*focusselection)) )
except StopIteration:
raise QueryError("No ancestors found for focus: " + str(repr(focusselection)))
elif self.returntype == "ancestor-target":
elems = []
for e in targetselection:
if isinstance(e, SpanSet):
elems += e
else:
elems.append(e)
responseselection = []
try:
responseselection.append( next(folia.commonancestors(folia.AbstractStructureElement,*elems)) )
except StopIteration:
raise QueryError("No ancestors found for targets: " + str(repr(targetselection)))
else:
raise QueryError("Invalid return type: " + self.returntype)
else:
responseselection = []
if self.returntype == "nothing": #we're done
return ""
#convert response selection to proper format and return
if self.format.startswith('single'):
if len(responseselection) > 1:
raise QueryError("A single response was expected, but multiple are returned")
if self.format == "single-xml":
if debug: print("[FQL EVALUATION DEBUG] Query - Returning single-xml",file=sys.stderr)
if not responseselection:
return ""
else:
if isinstance(responseselection[0], SpanSet):
r = "\n"
for e in responseselection[0]:
r += e.xmlstring(True)
r += "\n"
return r
else:
return responseselection[0].xmlstring(True)
elif self.format == "single-json":
if debug: print("[FQL EVALUATION DEBUG] Query - Returning single-json",file=sys.stderr)
if not responseselection:
return "null"
else:
return json.dumps(responseselection[0].json())
elif self.format == "single-python":
if debug: print("[FQL EVALUATION DEBUG] Query - Returning single-python",file=sys.stderr)
if not responseselection:
return None
else:
return responseselection[0]
else:
if self.format == "xml":
if debug: print("[FQL EVALUATION DEBUG] Query - Returning xml",file=sys.stderr)
if not responseselection:
if wrap:
return ""
else:
return ""
else:
if wrap:
r = "\n"
else:
r = ""
for e in responseselection:
if isinstance(e, SpanSet):
r += "\n"
for e2 in e:
r += "" + e2.xmlstring(True) + "\n"
r += "\n"
else:
r += "\n" + e.xmlstring(True) + "\n"
if wrap:
r += "\n"
return r
elif self.format == "json":
if debug: print("[FQL EVALUATION DEBUG] Query - Returning json",file=sys.stderr)
if not responseselection:
if wrap:
return "[]"
else:
return ""
else:
if wrap:
s = "[ "
else:
s = ""
for e in responseselection:
if isinstance(e, SpanSet):
s += json.dumps([ e2.json() for e2 in e ] ) + ", "
else:
s += json.dumps(e.json()) + ", "
s = s.strip(", ")
if wrap:
s += "]"
return s
else: #python and undefined formats
if debug: print("[FQL EVALUATION DEBUG] Query - Returning python",file=sys.stderr)
return responseselection
return QueryError("Invalid format: " + self.format)
def _touch(self, *args):
for e in args:
if isinstance(e, folia.AbstractElement):
e.changedbyquery = self
self._touch(*e.data)
PyNLPl-1.2.9/pynlpl/formats/giza.py 0000664 0001750 0000144 00000024463 12463674253 020035 0 ustar proycon users 0000000 0000000 # -*- coding: utf-8 -*-
###############################################################
# PyNLPl - WordAlignment Library for reading GIZA++ A3 files
# by Maarten van Gompel (proycon)
# http://ilk.uvt.nl/~mvgompel
# Induction for Linguistic Knowledge Research Group
# Universiteit van Tilburg
#
# In part using code by Sander Canisius
#
# Licensed under GPLv3
#
#
# This library reads GIZA++ A3 files. It contains three classes over which
# you can iterate to obtain (sourcewords,targetwords,alignment) pairs.
#
# - WordAlignment - Reads target-source.A3.final files, in which each source word is aligned to one target word
# - MultiWordAlignment - Reads source-target.A3.final files, in which each source word may be aligned to multiple target target words
# - IntersectionAlignment - Computes the intersection between the above two alignments
#
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
from pynlpl.common import u
import bz2
import gzip
import copy
import io
from sys import stderr
class GizaSentenceAlignment(object):
def __init__(self, sourceline, targetline, index):
self.index = index
self.alignment = []
if sourceline:
self.source = self._parsesource(sourceline.strip())
else:
self.source = []
self.target = targetline.strip().split(' ')
def _parsesource(self, line):
cleanline = ""
inalignment = False
begin = 0
sourceindex = 0
for i in range(0,len(line)):
if line[i] == ' ' or i == len(line) - 1:
if i == len(line) - 1:
offset = 1
else:
offset = 0
word = line[begin:i+offset]
if word == '})':
inalignment = False
begin = i + 1
continue
elif word == "({":
inalignment = True
begin = i + 1
continue
if word.strip() and word != 'NULL':
if not inalignment:
sourceindex += 1
if cleanline: cleanline += " "
cleanline += word
else:
targetindex = int(word)
self.alignment.append( (sourceindex-1, targetindex-1) )
begin = i + 1
return cleanline.split(' ')
def intersect(self,other):
if other.target != self.source:
print("GizaSentenceAlignment.intersect(): Mismatch between self.source and other.target: " + repr(self.source) + " -- vs -- " + repr(other.target),file=stderr)
return None
intersection = copy.copy(self)
intersection.alignment = []
for sourceindex, targetindex in self.alignment:
for targetindex2, sourceindex2 in other.alignment:
if targetindex2 == targetindex and sourceindex2 == sourceindex:
intersection.alignment.append( (sourceindex, targetindex) )
return intersection
def __repr__(self):
s = " ".join(self.source)+ " ||| "
s += " ".join(self.target) + " ||| "
for S,T in sorted(self.alignment):
s += self.source[S] + "->" + self.target[T] + " ; "
return s
def getalignedtarget(self, index):
"""Returns target range only if source index aligns to a single consecutive range of target tokens."""
targetindices = []
target = None
foundindex = -1
for sourceindex, targetindex in self.alignment:
if sourceindex == index:
targetindices.append(targetindex)
if len(targetindices) > 1:
for i in range(1,len(targetindices)):
if abs(targetindices[i] - targetindices[i-1]) != 1:
break # not consecutive
foundindex = (min(targetindices), max(targetindices))
target = ' '.join(self.target[min(targetindices):max(targetindices)+1])
elif targetindices:
foundindex = targetindices[0]
target = self.target[foundindex]
return target, foundindex
class GizaModel(object):
def __init__(self, filename, encoding= 'utf-8'):
if filename.split(".")[-1] == "bz2":
self.f = bz2.BZ2File(filename,'r')
elif filename.split(".")[-1] == "gz":
self.f = gzip.GzipFile(filename,'r')
else:
self.f = io.open(filename,'r',encoding=encoding)
self.nextlinebuffer = None
def __iter__(self):
self.f.seek(0)
nextlinebuffer = u(next(self.f))
sentenceindex = 0
done = False
while not done:
sentenceindex += 1
line = nextlinebuffer
if line[0] != '#':
raise Exception("Error parsing GIZA++ Alignment at sentence " + str(sentenceindex) + ", expected new fragment, found: " + repr(line))
targetline = u(next(self.f))
sourceline = u(next(self.f))
yield GizaSentenceAlignment(sourceline, targetline, sentenceindex)
try:
nextlinebuffer = u(next(self.f))
except StopIteration:
done = True
def __del__(self):
if self.f: self.f.close()
#------------------ OLD -------------------
def parseAlignment(tokens): #by Sander Canisius
assert tokens.pop(0) == "NULL"
while tokens.pop(0) != "})":
pass
while tokens:
word = tokens.pop(0)
assert tokens.pop(0) == "({"
positions = []
token = tokens.pop(0)
while token != "})":
positions.append(int(token))
token = tokens.pop(0)
yield word, positions
class WordAlignment:
"""Target to Source alignment: reads target-source.A3.final files, in which each source word is aligned to one target word"""
def __init__(self,filename, encoding=False):
"""Open a target-source GIZA++ A3 file. The file may be bzip2 compressed. If an encoding is specified, proper unicode strings will be returned"""
if filename.split(".")[-1] == "bz2":
self.stream = bz2.BZ2File(filename,'r')
else:
self.stream = open(filename)
self.encoding = encoding
def __del__(self):
self.stream.close()
def __iter__(self): #by Sander Canisius
line = self.stream.readline()
while line:
assert line.startswith("#")
src = self.stream.readline().split()
trg = []
alignment = [None for i in xrange(len(src))]
for i, (targetWord, positions) in enumerate(parseAlignment(self.stream.readline().split())):
trg.append(targetWord)
for pos in positions:
assert alignment[pos - 1] is None
alignment[pos - 1] = i
if self.encoding:
yield [ u(w,self.encoding) for w in src ], [ u(w,self.encoding) for w in trg ], alignment
else:
yield src, trg, alignment
line = self.stream.readline()
def targetword(self, index, targetwords, alignment):
"""Return the aligned targetword for a specified index in the source words"""
if alignment[index]:
return targetwords[alignment[index]]
else:
return None
def reset(self):
self.stream.seek(0)
class MultiWordAlignment:
"""Source to Target alignment: reads source-target.A3.final files, in which each source word may be aligned to multiple target words (adapted from code by Sander Canisius)"""
def __init__(self,filename, encoding = False):
"""Load a target-source GIZA++ A3 file. The file may be bzip2 compressed. If an encoding is specified, proper unicode strings will be returned"""
if filename.split(".")[-1] == "bz2":
self.stream = bz2.BZ2File(filename,'r')
else:
self.stream = open(filename)
self.encoding = encoding
def __del__(self):
self.stream.close()
def __iter__(self):
line = self.stream.readline()
while line:
assert line.startswith("#")
trg = self.stream.readline().split()
src = []
alignment = []
for i, (word, positions) in enumerate(parseAlignment(self.stream.readline().split())):
src.append(word)
alignment.append( [ p - 1 for p in positions ] )
if self.encoding:
yield [ unicode(w,self.encoding) for w in src ], [ unicode(w,self.encoding) for w in trg ], alignment
else:
yield src, trg, alignment
line = self.stream.readline()
def targetword(self, index, targetwords, alignment):
"""Return the aligned targeword for a specified index in the source words. Multiple words are concatenated together with a space in between"""
return " ".join(targetwords[alignment[index]])
def targetwords(self, index, targetwords, alignment):
"""Return the aligned targetwords for a specified index in the source words"""
return [ targetwords[x] for x in alignment[index] ]
def reset(self):
self.stream.seek(0)
class IntersectionAlignment:
def __init__(self,source2target,target2source,encoding=False):
self.s2t = MultiWordAlignment(source2target, encoding)
self.t2s = WordAlignment(target2source, encoding)
self.encoding = encoding
def __iter__(self):
for (src, trg, alignment), (revsrc, revtrg, revalignment) in zip(self.s2t,self.t2s): #will take unnecessary memory in Python 2.x, optimal in Python 3
if src != revsrc or trg != revtrg:
raise Exception("Files are not identical!")
else:
#keep only those alignments that are present in both
intersection = []
for i, x in enumerate(alignment):
if revalignment[i] in x:
intersection.append(revalignment[i])
else:
intersection.append(None)
yield src, trg, intersection
def reset(self):
self.s2t.reset()
self.t2s.reset()
PyNLPl-1.2.9/pynlpl/formats/imdi.py 0000664 0001750 0000144 00000174772 12201265173 020021 0 ustar proycon users 0000000 0000000 RELAXNG_IMDI = """
The root element for IMDI descriptions
Instantiation of a VocabularyDef_Type
Revision history of the metadata description
Information on creation location for this data
The name of a continent
The name of a country
The name of a geographic region
The address
List of a number of key name value pairs. Should be used to add information that is not covered by other metadata elements at this level
Groups information about the languages used in the session
Description for the list of languages spoken by this participant
Groups information about access rights for this data
Availability of the data
Date when access rights were evaluated
Name of owner resource
Publisher responsible for distribution of this data
Resource is preferably a metadata resource. In the case of a well-defined merged metadata/content format such as TEI or legacy resources for which no further metadata is available it is the resource itself. If the external resource is an IMDI session with written resources Type & SubType will be the same as the Type & SubType of the primary written resource in that session. If it is a session with IMDI multi-media resources the Type of the Media
File will designate it. SubType is used only for written resources. Non-IMDI metadata resource types need to be mapped to IMDI types
The type of the external (metadata) resource
The sub type of the external (metadata) resource. Only used in case its metadata for a written resource
The metadata format
The URL of the external metadata record
Project Information
A short name or abbreviation for the project
The full title of the project
A unique identifier for the project
Contact information for this project
Description for this project
Type for group of metadata pertaining to a session
Groups information about the location where the session was created
Groups information about the project for which the session was (originally) created
Project keys
Groups information about the content of the session. The content description takes place in several (overlapping) dimensions
Groups information about all actors in the session
Major genre classification
Sub genre classification
List of he major tasks carried out in the session
List of modalities used in the session
Classifies the subject of the session. Uses preferably an existing library classification scheme such as LCSH. The element has a scheme attribute that indicates what scheme is used. Comments: The element can be repeated but the user should guarantee consistency
This groups information concerning the context of communication
degree of interactivity
Degree of planning of the event
Indicates in how far the researcher was involved in the linguistic event
Indicates the social context the event took place in
Indicates the structure of the communication event
Indicates the channel of the communication
Description for the content of this session
Description about the actors as a group
Group of actors
Functional role of the actor e.g. consultant, contributor, interviewer, researcher, publisher, collector, translator
Name of the actor as used by others in the transcription
Official name of the actor
Short unique code to identify the actor as used in the transcription
The family social role of the actor
The actor languages
The ethnic groups of the actor
The age of the actor
The birthdate of the actor
The sex of the actor
The education of the actor
Indicates if real names or anonymized codes are used to identify the actor
Contact information of the actor
Actor keys
Description for this individual actor
Type for a corpus that points to either other corpora or sessions
Name of the (sub-)corpus
Title for the (sub-)corpus
Description of the (sub-)corpus
Link to other resource. Attribute name is for the benefit of browsing
Type for group metadata pertaining to published corpora
Name of the published corpus
Title of the published corpus
Identifier of the published corpus
Description of the published corpus
The languages used for documentation of the corpus
Description for the list of languages
The languages in the corpus that are subject of analysis
Description for the list of languages
Content type of the published corpus
Publisher responsible for distribution of the published corpus
Authors for the resources
Human readabusle string that indicates total size of corpus
Pricing info of the corpus
Person to be contacted about the resource
URL to the resource
URL to the metadata for the resource
List of any publications related to the resource
Groups information of language resources connected to the session
Groups all media resources
Groups information about a Written Resource
Groups information only pertaining to a Lexical resource
Groups information only pertaining to a lexiconComponent
Groups information about the source; e.g. media-carrier, book, newspaper archive etc.
Groups data about name conversions for persons who are anonymised
Groups information about external documentation associated with this session
Every description is a reference
Groups information about the media file
URL to media file
Major part of mime-type
Minor part of mime-type
Size of media file
Quality of the recording
describes technical conditions of recording
Groups information about a Written Resource
URL to file containing the annotations/transcription
URL to media file from which the annotations/transcriptions originate
Date when Written Resource was created
The type of the WrittenResource
The subtype of the WrittenResource
File format used for Written Resource
The size of the Written Resource file. Integer value with addition of M (mega) or K (kilo)
How this document relates to another resource
Character encoding used in the written resource
Content encoding used in the written resource
Language used in the resource
Indicates if data has been anonymised. CV boolean
Groups information only pertaining to a Lexical resource
URL to lexical resource
Date when lexical resource was created
The type of the WrittenResource
The format of the LexicalResource
The character encoding of the LexicalResource
The size of the LexicalResource in bytes
The number of head entries of the LexicalResource
The number of sub entries of the LexicalResource
OCV: Sentence, Phrase, Wordform, Lemma, ...
OCV: HyphenatedSpelling, SyllabifiedSpelling, ...
OCV: Stem,StemALlomorphy, Segmentation, ...
OCV: POS, Inflexion, Countability, ...
OCV: Complementation, Alternation, Modification, ...
OCV: Transcription, IPA Transcription, CV pattern, ...
OCV: Sense dstinction
A block to describe the languages that are used to define terms, to describe meaning
Groups information only pertaining to a lexiconComponent
URL to lexiconComponent
Date when lexiconComponent was created
The type of the lexiconComponent
The format of the lexiconComponent
The character encoding of the lexiconComponent
The size of the lexiconComponent in bytes
Describes the tree in which the component can be embedded
Describes the possible parents of the lexiconComponent in the schema tree
Descibes the preferred parent of the lexiconComponent in the schema tree
Describes the possible component children of the lexiconComponent in the schema tree
Describes the possible category children of the lexiconComponent in the schema tree
Gives information on the lexical applications of the lexiconComponent
Describes whether the lexiconComponent can be used to add orthography to the lexicon schema
Describes whether the lexiconComponent can be used to add morphology to the lexicon schema.
Describes whether the lexiconComponent can be used to add morphosyntactic features to the lexicon schema
Describes whether the lexiconComponent can be used to add syntactic features to the lexicon schema
Describes whether the lexiconComponent can be used to add phonology to the lexicon schema.
Describes whether the lexiconComponent can be used to add a semantic element to the lexicon schema
A block to describe the languages that are used to define terms, to describe meaning
Groups information about the original source; e.g. media-carrier, book, newspaper archive etc.
Unique code to identify the original source
Physical storage format of the source
Quality of original recording
Description for the original source
Groups data about name conversions for persons who are anonymised
URL to information to convert pseudo named to real-names
The definition of a vocabulary. Attributes: Date of creattion, Link to origin. Contails a Description be element to descr+++ ibe the domain of the vocabulary and a (unspecified) number of value enries
Human readable description in the form of a text with language id specification and/or a link to a file with a description and language id specification. The name attribute is to name the link (if present)
Contact information for this data
The validation used for the resource
CV: content, type, manual, automatic, semi-automatic
Validation methodology
Percentage of resource validated
Specifies age of a person with differerent counting methods
Specifies age of a person in the form of a range
An element from a set of languages used in the session
Unique code to identify a language
Name of the language
Is it the speakers mother tongue. Only applicable if used in the context of a speakers language
Is it the speakers primary language. Only applicable if used in the context of a speakers language
Is it the most frequently used language in the document. Only applicable if used in the context of the resource's language
Direction of translation. Only applicable in case it is the context of a lexicon resource
Direction of translation. Only applicable in case it is the context of a lexicon resource
Description for this particular language
Indicates if language is dominant language
Indicates if language is source language
Indicates if language is target language
Description of the language
Information on language name and id
Unique code to identify a language
The name of the language
String type for single spaced, single line strings
Comma separated string
The age of a person
([0-9]+)*(;[0-9]+)*(.[0-9]+)*|Unknown|Unspecified
The age of a person given as a range
([0-9]+)?(;[0-9]+)?(.[0-9]+)?(/([0-9]+)?(;[0-9]+)?(.[0-9]+)?)?|Unknown|Unspecified
The age counting method
SinceConception
SinceBirth
Vocabulary content and attributes
Link to a vocabulary definition
Position (start (+end) ) on a old fashioned tape without time indication
Position in a media file or modern tape
The start time position of a recording
The end time position of a recording
Quality indication
Unspecified is a non-existing (null) value. Unknown is a informational value indicating that the real value is not known
Unknown
Unspecified
empty string definition
0
Comma seperated string
[^,]*(,[^,]+)*
Loose boolean value where empty values are allowed
xsd:boolean imdi:Empty_Value_Type
xsd:boolean imdi:Empty_Value_Type
integer + Unspecified and Unknown
xsd:unsignedInt imdi:Empty_Value_Type
xsd:unsignedInt imdi:Empty_Value_Type
Defines a date that can also be empty or Unknown or Unspecified
imdi:DateRange_Value_Type imdi:EmptyString_Value_Type imdi:Empty_Value_Typeimdi:DateRange_Value_Type imdi:EmptyString_Value_Type imdi:Empty_Value_Typeimdi:DateRange_Value_Type imdi:EmptyString_Value_Type imdi:Empty_Value_Type
Defines a date range that can also be Unspecified or Unknown
([0-9]+)((-[0-9]+)(-[0-9]+)?)?(/([0-9]+)((-[0-9]+)(-[0-9]+)?)?)?|Unknown|Unspecified
Language identifiers
(ISO639(-1|-2|-3)?:.*)?
(RFC3066:.*)?
(RFC1766:.*)?
(SIL:.*)?
Unknown
Unspecified
Time position in the hh:mm:ss:ff format
[0-9][0-9]:[0-9][0-9]:[0-9][0-9]:?[0-9]*|Unknown|Unspecified
Quality values (1 .. 5) also allows empty values
1
2
3
4
5
All possible vocabulary type values
ClosedVocabulary
ClosedVocabularyList
OpenVocabulary
OpenVocabularyList
Allowed values for metadata transcripts
SESSION
SESSION.Profile
LEXICON_RESOURCE_BUNDLE
LEXICON_RESOURCE_BUNDLE.Profile
CATALOGUE
CATALOGUE.Profile
CORPUS
CORPUS.Profile
Attributes allowed for profiles
"""
PyNLPl-1.2.9/pynlpl/formats/moses.py 0000664 0001750 0000144 00000017312 12213555600 020207 0 ustar proycon users 0000000 0000000 ###############################################################
# PyNLPl - Moses formats
# by Maarten van Gompel (proycon)
# http://ilk.uvt.nl/~mvgompel
# Induction for Linguistic Knowledge Research Group
# Universiteit van Tilburg
#
# Licensed under GPLv3
#
# This is a Python library classes and functions for
# reading file-formats produced by Moses. Currently
# contains only a class for reading a Moses PhraseTable.
# (migrated to pynlpl from pbmbmt)
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
from pynlpl.common import u
import sys
import bz2
import gzip
import datetime
import socket
import io
try:
from twisted.internet import protocol, reactor #No Python 3 support yet :(
from twisted.protocols import basic
twistedimported = True
except:
print("WARNING: Twisted could not be imported",file=sys.stderr)
twistedimported = False
class PhraseTable(object):
def __init__(self,filename, quiet=False, reverse=False, delimiter="|||", score_column = 3, max_sourcen = 0,sourceencoder=None, targetencoder=None, scorefilter=None):
"""Load a phrase table from file into memory (memory intensive!)"""
self.phrasetable = {}
self.sourceencoder = sourceencoder
self.targetencoder = targetencoder
if filename.split(".")[-1] == "bz2":
f = bz2.BZ2File(filename,'r')
elif filename.split(".")[-1] == "gz":
f = gzip.GzipFile(filename,'r')
else:
f = io.open(filename,'r',encoding='utf-8')
linenum = 0
prevsource = None
targets = []
while True:
if not quiet:
linenum += 1
if (linenum % 100000) == 0:
print("Loading phrase-table: @%d" % linenum, "\t(" + datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + ")",file=sys.stderr)
line = u(f.readline())
if not line:
break
#split into (trimmed) segments
segments = [ segment.strip() for segment in line.split(delimiter) ]
if len(segments) < 3:
print("Invalid line: ", line, file=sys.stderr)
continue
#Do we have a score associated?
if score_column > 0 and len(segments) >= score_column:
scores = tuple( ( float(x) for x in segments[score_column-1].strip().split() ) )
else:
scores = tuple()
#if align2_column > 0:
# try:
# null_alignments = segments[align2_column].count("()")
# except:
# null_alignments = 0
#else:
# null_alignments = 0
if scorefilter:
if not scorefilter(scores): continue
if reverse:
if max_sourcen > 0 and segments[1].count(' ') + 1 > max_sourcen:
continue
if self.sourceencoder:
source = self.sourceencoder(segments[1]) #tuple(segments[1].split(" "))
else:
source = segments[1]
if self.targetencoder:
target = self.targetencoder(segments[0]) #tuple(segments[0].split(" "))
else:
target = segments[0]
else:
if max_sourcen > 0 and segments[0].count(' ') + 1 > max_sourcen:
continue
if self.sourceencoder:
source = self.sourceencoder(segments[0]) #tuple(segments[0].split(" "))
else:
source = segments[0]
if self.targetencoder:
target = self.targetencoder(segments[1]) #tuple(segments[1].split(" "))
else:
target = segments[1]
if prevsource and source != prevsource and targets:
self.phrasetable[prevsource] = tuple(targets)
targets = []
targets.append( (target,scores) )
prevsource = source
#don't forget last one:
if prevsource and targets:
self.phrasetable[prevsource] = tuple(targets)
f.close()
def __contains__(self, phrase):
"""Query if a certain phrase exist in the phrase table"""
if self.sourceencoder: phrase = self.sourceencoder(phrase)
return (phrase in self.phrasetable)
#d = self.phrasetable
#for word in phrase:
# if not word in d:
# return False
# d = d[word
#return ("" in d)
def __iter__(self):
for phrase, targets in self.phrasetable.items():
yield phrase, targets
def __len__(self):
return len(self.phrasetable)
def __bool__(self):
return bool(self.phrasetable)
def __getitem__(self, phrase): #same as translations
"""Return a list of (translation, scores) tuples"""
if self.sourceencoder: phrase = self.sourceencoder(phrase)
return self.phrasetable[phrase]
#d = self.phrasetable
#for word in phrase:
# if not word in d:
# raise KeyError
# d = d[word]
#if "" in d:
# return d[""]
#else:
# raise KeyError
if twistedimported:
class PTProtocol(basic.LineReceiver):
def lineReceived(self, phrase):
try:
for target,Pst,Pts,null_alignments in self.factory.phrasetable[phrase]:
self.sendLine(target+"\t"+str(Pst)+"\t"+str(Pts)+"\t"+str(null_alignments))
except KeyError:
self.sendLine("NOTFOUND")
class PTFactory(protocol.ServerFactory):
protocol = PTProtocol
def __init__(self, phrasetable):
self.phrasetable = phrasetable
class PhraseTableServer(object):
def __init__(self, phrasetable, port=65432):
reactor.listenTCP(port, PTFactory(phrasetable))
reactor.run()
class PhraseTableClient(object):
def __init__(self,host= "localhost",port=65432):
self.BUFSIZE = 4048
self.socket = socket.socket(socket.AF_INET,socket.SOCK_STREAM) #Create the socket
self.socket.settimeout(120)
self.socket.connect((host, port)) #Connect to server
self.lastresponse = ""
self.lastquery = ""
def __getitem__(self, phrase):
solutions = []
if phrase != self.lastquery:
self.socket.send(phrase+ "\r\n")
data = b""
while not data or data[-1] != '\n':
data += self.socket.recv(self.BUFSIZE)
else:
data = self.lastresponse
data = u(data)
for line in data.split('\n'):
line = line.strip('\r\n')
if line == "NOTFOUND":
raise KeyError(phrase)
elif line:
fields = tuple(line.split("\t"))
if len(fields) == 4:
solutions.append( fields )
else:
print >>sys.stderr,"PHRASETABLECLIENT WARNING: Unable to parse response line"
self.lastresponse = data
self.lastquery = phrase
return solutions
def __contains__(self, phrase):
self.socket.send(phrase.encode('utf-8')+ b"\r\n")\
data = b""
while not data or data[-1] != '\n':
data += self.socket.recv(self.BUFSIZE)
data = u(data)
for line in data.split('\n'):
line = line.strip('\r\n')
if line == "NOTFOUND":
return False
self.lastresponse = data
self.lastquery = phrase
return True
PyNLPl-1.2.9/pynlpl/formats/sonar.py 0000664 0001750 0000144 00000023742 12441561134 020211 0 ustar proycon users 0000000 0000000 #---------------------------------------------------------------
# PyNLPl - Simple Read library for D-Coi/SoNaR format
# by Maarten van Gompel, ILK, Universiteit van Tilburg
# http://ilk.uvt.nl/~mvgompel
# proycon AT anaproy DOT nl
#
# Licensed under GPLv3
#
# This library facilitates parsing and reading corpora in
# the SoNaR/D-Coi format.
#
#----------------------------------------------------------------
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
import io
import re
import glob
import os.path
import sys
from lxml import etree as ElementTree
if sys.version < '3':
from StringIO import StringIO
else:
from io import StringIO
namespaces = {
'dcoi': "http://lands.let.ru.nl/projects/d-coi/ns/1.0",
'standalone':"http://ilk.uvt.nl/dutchsemcor-standalone",
'dsc':"http://ilk.uvt.nl/dutchsemcor",
'xml':"http://www.w3.org/XML/1998/namespace"
}
class CorpusDocument(object):
"""This class represent one document/text of the Corpus (read-only)"""
def __init__(self, filename, encoding = 'iso-8859-15'):
self.filename = filename
self.id = os.path.basename(filename).split(".")[0]
self.f = io.open(filename,'r', encoding=encoding)
self.metadata = {}
def _parseimdi(self,line):
r = re.compile('(.*)')
matches = r.findall(line)
if matches:
self.metadata['title'] = matches[0]
if not 'date' in self.metadata:
r = re.compile('(.*)')
matches = r.findall(line)
if matches:
self.metadata['date'] = matches[0]
def __iter__(self):
"""Iterate over all words, a four-tuple (word,id,pos,lemma), in the document"""
r = re.compile('(.*)')
for line in self.f.readlines():
matches = r.findall(line)
for id, attribs, word in matches:
pos = lemma = None
m = re.findall('pos="([^"]+)"', attribs)
if m: pos = m[0]
m = re.findall('lemma="([^"]+)"', attribs)
if m: lemma = m[0]
yield word, id, pos, lemma
if line.find('imdi:') != -1:
self._parseimdi(line)
def words(self):
#alias
return iter(self)
def sentences(self):
"""Iterate over all sentences (sentence_id, sentence) in the document, sentence is a list of 4-tuples (word,id,pos,lemma)"""
prevp = 0
prevs = 0
sentence = [];
sentence_id = ""
for word, id, pos, lemma in iter(self):
try:
doc_id, ptype, p, s, w = re.findall('([\w\d-]+)\.(p|head)\.(\d+)\.s\.(\d+)\.w\.(\d+)',id)[0]
if ((p != prevp) or (s != prevs)) and sentence:
yield sentence_id, sentence
sentence = []
sentence_id = doc_id + '.' + ptype + '.' + str(p) + '.s.' + str(s)
prevp = p
except IndexError:
doc_id, s, w = re.findall('([\w\d-]+)\.s\.(\d+)\.w\.(\d+)',id)[0]
if s != prevs and sentence:
yield sentence_id, sentence
sentence = []
sentence_id = doc_id + '.s.' + str(s)
sentence.append( (word,id,pos,lemma) )
prevs = s
if sentence:
yield sentence_id, sentence
def paragraphs(self, with_id = False):
"""Extracts paragraphs, returns list of plain-text(!) paragraphs"""
prevp = 0
partext = []
for word, id, pos, lemma in iter(self):
doc_id, ptype, p, s, w = re.findall('([\w\d-]+)\.(p|head)\.(\d+)\.s\.(\d+)\.w\.(\d+)',id)[0]
if prevp != p and partext:
yield ( doc_id + "." + ptype + "." + prevp , " ".join(partext) )
partext = []
partext.append(word)
prevp = p
if partext:
yield (doc_id + "." + ptype + "." + prevp, " ".join(partext) )
class Corpus:
def __init__(self,corpusdir, extension = 'pos', restrict_to_collection = "", conditionf=lambda x: True, ignoreerrors=False):
self.corpusdir = corpusdir
self.extension = extension
self.restrict_to_collection = restrict_to_collection
self.conditionf = conditionf
self.ignoreerrors = ignoreerrors
def __iter__(self):
if not self.restrict_to_collection:
for f in glob.glob(self.corpusdir+"/*." + self.extension):
if self.conditionf(f):
try:
yield CorpusDocument(f)
except:
print("Error, unable to parse " + f,file=sys.stderr)
if not self.ignoreerrors:
raise
for d in glob.glob(self.corpusdir+"/*"):
if (not self.restrict_to_collection or self.restrict_to_collection == os.path.basename(d)) and (os.path.isdir(d)):
for f in glob.glob(d+ "/*." + self.extension):
if self.conditionf(f):
try:
yield CorpusDocument(f)
except:
print("Error, unable to parse " + f,file=sys.stderr)
if not self.ignoreerrors:
raise
#######################################################
def ns(namespace):
"""Resolves the namespace identifier to a full URL"""
global namespaces
return '{'+namespaces[namespace]+'}'
class CorpusFiles(Corpus):
def __iter__(self):
if not self.restrict_to_collection:
for f in glob.glob(self.corpusdir+"/*." + self.extension):
if self.conditionf(f):
yield f
for d in glob.glob(self.corpusdir+"/*"):
if (not self.restrict_to_collection or self.restrict_to_collection == os.path.basename(d)) and (os.path.isdir(d)):
for f in glob.glob(d+ "/*." + self.extension):
if self.conditionf(f):
yield f
class CorpusX(Corpus):
def __iter__(self):
if not self.restrict_to_collection:
for f in glob.glob(self.corpusdir+"/*." + self.extension):
if self.conditionf(f):
try:
yield CorpusDocumentX(f)
except:
print("Error, unable to parse " + f,file=sys.stderr)
if not self.ignoreerrors:
raise
for d in glob.glob(self.corpusdir+"/*"):
if (not self.restrict_to_collection or self.restrict_to_collection == os.path.basename(d)) and (os.path.isdir(d)):
for f in glob.glob(d+ "/*." + self.extension):
if self.conditionf(f):
try:
yield CorpusDocumentX(f)
except:
print("Error, unable to parse " + f,file=sys.stderr)
if not self.ignoreerrors:
raise
class CorpusDocumentX:
"""This class represent one document/text of the Corpus, loaded into memory at once and retaining the full structure"""
def __init__(self, filename, tree = None, index=True ):
global namespaces
self.filename = filename
if not tree:
self.tree = ElementTree.parse(self.filename)
self.committed = True
elif isinstance(tree, ElementTree._Element):
self.tree = tree
self.committed = False
#Grab root element and determine if we run inline or standalone
self.root = self.xpath("/dcoi:DCOI")
if self.root:
self.root = self.root[0]
self.inline = True
else:
raise Exception("Not in DCOI/SoNaR format!")
#self.root = self.xpath("/standalone:text")
#self.inline = False
#if not self.root:
# raise FormatError()
#build an index
self.index = {}
if index:
self._index(self.root)
def _index(self,node):
if ns('xml') + 'id' in node.attrib:
self.index[node.attrib[ns('xml') + 'id']] = node
for subnode in node: #TODO: can we do this with xpath instead?
self._index(subnode)
def validate(self, formats_dir="../formats/"):
"""checks if the document is valid"""
#TODO: download XSD from web
if self.inline:
xmlschema = ElementTree.XMLSchema(ElementTree.parse(StringIO("\n".join(open(formats_dir+"dcoi-dsc.xsd").readlines()))))
xmlschema.assertValid(self.tree)
#return xmlschema.validate(self)
else:
xmlschema = ElementTree.XMLSchema(ElementTree.parse(StringIO("\n".join(open(formats_dir+"dutchsemcor-standalone.xsd").readlines()))))
xmlschema.assertValid(self.tree)
#return xmlschema.validate(self)
def xpath(self, expression):
"""Executes an xpath expression using the correct namespaces"""
global namespaces
return self.tree.xpath(expression, namespaces=namespaces)
def __exists__(self, id):
return (id in self.index)
def __getitem__(self, id):
return self.index[id]
def paragraphs(self, node=None):
"""iterate over paragraphs"""
if node == None: node = self
return node.xpath("//dcoi:p")
def sentences(self, node=None):
"""iterate over sentences"""
if node == None: node = self
return node.xpath("//dcoi:s")
def words(self,node=None):
"""iterate over words"""
if node == None: node = self
return node.xpath("//dcoi:w")
def save(self, filename=None, encoding='iso-8859-15'):
if not filename: filename = self.filename
self.tree.write(filename, encoding=encoding, method='xml', pretty_print=True, xml_declaration=True)
PyNLPl-1.2.9/pynlpl/formats/taggerdata.py 0000664 0001750 0000144 00000011504 12201265173 021161 0 ustar proycon users 0000000 0000000 #-*- coding:utf-8 -*-
###############################################################
# PyNLPl - Read tagger data
# by Maarten van Gompel (proycon)
# http://ilk.uvt.nl/~mvgompel
# Induction for Linguistic Knowledge Research Group
# Universiteit van Tilburg
#
# Licensed under GPLv3
#
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
import io
class Taggerdata(object):
def __init__(self,filename, encoding = 'utf-8', mode ='r'):
self.filename = filename
self.encoding = encoding
assert (mode == 'r' or mode == 'w')
self.mode = mode
self.reset()
self.firstiter = True
self.indexed = False
self.writeindex = 0
def __iter__(self):
words = []
lemmas = []
postags = []
for line in self.f:
line = line.strip()
if self.firstiter:
self.indexed = (line == "#0")
self.firstiter = False
if not line and not self.indexed:
yield (words, lemmas, postags)
words = []
lemmas = []
postags = []
elif self.indexed and len(line) > 1 and line[0] == '#' and line[1:].isdigit():
if line != "#0":
yield (words, lemmas, postags)
words = []
lemmas = []
postags = []
elif line:
try:
word, lemma, pos = line.split("\t")
except:
word = lemma = pos = "NONE"
if word == "NONE": word = None
if lemma == "NONE": lemma = None
if pos == "NONE": pos = None
words.append(word)
lemmas.append(lemma)
postags.append(pos)
if words:
yield (words, lemmas, postags)
def next(self):
words = []
lemmas = []
postags = []
while True:
try:
line = self.f.next().strip()
except StopIteration:
if words:
return (words, lemmas, postags)
else:
raise
if self.firstiter:
self.indexed = (line == "#0")
self.firstiter = False
if not line and not self.indexed:
return (words, lemmas, postags)
elif self.indexed and len(line) > 1 and line[0] == '#' and line[1:].isdigit():
if line != "#0":
return (words, lemmas, postags)
elif line:
try:
word, lemma, pos = line.split("\t")
except:
word = lemma = pos = "NONE"
if word == "NONE": word = None
if lemma == "NONE": lemma = None
if pos == "NONE": pos = None
words.append(word)
lemmas.append(lemma)
postags.append(pos)
def align(self, referencewords, datatuple):
"""align the reference sentence with the tagged data"""
targetwords = []
for i, (word,lemma,postag) in enumerate(zip(datatuple[0],datatuple[1],datatuple[2])):
if word:
subwords = word.split("_")
for w in subwords: #split multiword expressions
targetwords.append( (w, lemma, postag, i, len(subwords) > 1 ) ) #word, lemma, pos, index, multiword?
referencewords = [ w.lower() for w in referencewords ]
alignment = []
for i, referenceword in enumerate(referencewords):
found = False
best = 0
distance = 999999
for j, (targetword, lemma, pos, index, multiword) in enumerate(targetwords):
if referenceword == targetword and abs(i-j) < distance:
found = True
best = j
distance = abs(i-j)
if found:
alignment.append(targetwords[best])
else:
alignment.append((None,None,None,None,False)) #no alignment found
return alignment
def reset(self):
self.f = io.open(self.filename,self.mode, encoding=self.encoding)
def write(self, sentence):
self.f.write("#" + str(self.writeindex)+"\n")
for word, lemma, pos in sentence:
if not word: word = "NONE"
if not lemma: lemma = "NONE"
if not pos: pos = "NONE"
self.f.write( word + "\t" + lemma + "\t" + pos + "\n" )
self.writeindex += 1
def close(self):
self.f.close()
PyNLPl-1.2.9/pynlpl/formats/timbl.py 0000644 0001750 0000144 00000010745 13022533070 020165 0 ustar proycon users 0000000 0000000 ###############################################################
# PyNLPl - Timbl Classifier Output Library
# by Maarten van Gompel (proycon)
# http://ilk.uvt.nl/~mvgompel
# Induction for Linguistic Knowledge Research Group
# Universiteit van Tilburg
#
# Derived from code by Sander Canisius
#
# Licensed under GPLv3
#
# This library offers a TimblOutput class for reading Timbl
# classifier output. It supports full distributions (+v+db) and comment (#)
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
import sys
if sys.version < '3':
from codecs import getwriter
stderr = getwriter('utf-8')(sys.stderr)
stdout = getwriter('utf-8')(sys.stdout)
else:
stderr = sys.stderr
stdout = sys.stdout
from pynlpl.statistics import Distribution
class TimblOutput(object):
"""A class for reading Timbl classifier output, supports the +v+db option and ignores comments starting with #"""
def __init__(self, stream, delimiter = ' ', ignorecolumns = [], ignorevalues = []):
self.stream = stream
self.delimiter = delimiter
self.ignorecolumns = ignorecolumns #numbers, ignore the specified FEATURE columns: first column is 1
self.ignorevalues = ignorevalues #Ignore columns with the following values
def __iter__(self):
# Note: distance parsing (+v+di) works only if distributions (+v+db) are also enabled!
for line in self.stream:
endfvec = None
line = line.strip()
if line and line[0] != '#': #ignore empty lines and comments
segments = [ x for i, x in enumerate(line.split(self.delimiter)) if x not in self.ignorevalues and i+1 not in self.ignorecolumns ]
#segments = [ x for x in line.split() if x != "^" and not (len(x) == 3 and x[0:2] == "n=") ] #obtain segments, and filter null fields and "n=?" feature (in fixed-feature configuration)
if not endfvec:
try:
# Modified by Ruben. There are some cases where one of the features is a {, and then
# the module is not able to obtain the distribution of scores and senses
# We have to look for the last { in the vector, and due to there is no rindex method
# we obtain the reverse and then apply index.
aux=list(reversed(segments)).index("{")
endfvec=len(segments)-aux-1
#endfvec = segments.index("{")
except ValueError:
endfvec = None
if endfvec and endfvec > 2: # only for +v+db
try:
enddistr = segments.index('}',endfvec)
except ValueError:
raise
distribution = self.parseDistribution(segments, endfvec, enddistr)
if len(segments) > enddistr + 1:
distance = float(segments[-1])
else:
distance = None
else:
endfvec = len(segments)
distribution = None
distance = None
#features, referenceclass, predictedclass, distribution, distance
yield segments[:endfvec - 2], segments[endfvec - 2], segments[endfvec - 1], distribution, distance
def parseDistribution(self, instance, start,end= None):
dist = {}
i = start + 1
if not end:
end = len(instance) - 1
while i < end: #instance[i] != "}":
label = instance[i]
try:
score = float(instance[i+1].rstrip(","))
dist[label] = score
except:
print("ERROR: pynlpl.input.timbl.TimblOutput -- Could not fetch score for class '" + label + "', expected float, but found '"+instance[i+1].rstrip(",")+"'. Instance= " + " ".join(instance)+ ".. Attempting to compensate...",file=stderr)
i = i - 1
i += 2
if not dist:
print("ERROR: pynlpl.input.timbl.TimblOutput -- Did not find class distribution for ", instance,file=stderr)
return Distribution(dist)
PyNLPl-1.2.9/pynlpl/fsa.py 0000644 0001750 0000144 00000011261 12526410603 016152 0 ustar proycon users 0000000 0000000 #---------------------------------------------------------------
# PyNLPl - Finite State Automata
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
# http://proycon.github.com/folia
# http://www.github.com/proycon/pynlpl
# proycon AT anaproy DOT nl
#
# Partially based/inspired on code by Xiayun Sun (https://github.com/xysun/regex)
#
# Licensed under GPLv3
#
#----------------------------------------------------------------
from __future__ import print_function, unicode_literals, division, absolute_import
import sys
class State(object):
def __init__(self, **kwargs):
if 'epsilon' in kwargs:
self.epsilon = kwargs['epsilon'] # epsilon-closure (lis of states)
else:
self.epsilon = [] # epsilon-closure
if 'transitions' in kwargs:
self.transitions = kwargs['transitions']
else:
self.transitions = [] #(matchitem, matchfunction(value), state)
if 'final' in kwargs:
self.final = bool(kwargs['final']) # ending state
else:
self.final = False
self.transitioned = None #will be a tuple (state, matchitem) indicating how this state was reached
class NFA(object):
"""Non-deterministic finite state automaton. Can be used to model DFAs as well if your state transitions are not ambiguous and epsilon is empty."""
def __init__(self, initialstate):
self.initialstate = initialstate
def run(self, sequence, mustmatchall=False,debug=False):
def add(state, states):
"""add state and recursively add epsilon transitions"""
assert isinstance(state, State)
if state in states:
return
states.add(state)
for eps in state.epsilon: #recurse into epsilon transitions
add(eps, states)
current_states = set()
add(self.initialstate, current_states)
if debug: print("Starting run, current states: ", repr(current_states),file=sys.stderr)
for offset, value in enumerate(sequence):
if not current_states: break
if debug: print("Value: ", repr(value),file=sys.stderr)
next_states = set()
for state in current_states:
for matchitem, matchfunction, trans_state in state.transitions:
if matchfunction(value):
trans_state.transitioned = (state, matchitem)
add(trans_state, next_states)
current_states = next_states
if debug: print("Current states: ", repr(current_states),file=sys.stderr)
if not mustmatchall:
for s in current_states:
if s.final:
if debug: print("Final state reached",file=sys.stderr)
yield offset+1
if mustmatchall:
for s in current_states:
if s.final:
if debug: print("Final state reached",file=sys.stderr)
yield offset+1
def match(self, sequence):
try:
return next(self.run(sequence,True)) == len(sequence)
except StopIteration:
return False
def find(self, sequence, debug=False):
l = len(sequence)
for i in range(0,l):
for length in self.run(sequence[i:], False, debug):
yield sequence[i:i+length]
def __iter__(self):
return iter(self._states(self.initialstate))
def _states(self, state, processedstates=[]): #pylint: disable=dangerous-default-value
"""Iterate over all states in no particular order"""
processedstates.append(state)
for nextstate in state.epsilon:
if not nextstate in processedstates:
self._states(nextstate, processedstates)
for _, nextstate in state.transitions:
if not nextstate in processedstates:
self._states(nextstate, processedstates)
return processedstates
def __repr__(self):
out = []
for state in self:
staterep = repr(state)
if state is self.initialstate:
staterep += " (INITIAL)"
for nextstate in state.epsilon:
nextstaterep = repr(nextstate)
if nextstate.final:
nextstaterep += " (FINAL)"
out.append( staterep + " -e-> " + nextstaterep )
for item, _, nextstate in state.transitions:
nextstaterep = repr(nextstate)
if nextstate.final:
nextstaterep += " (FINAL)"
out.append( staterep + " -(" + repr(item) + ")-> " + nextstaterep )
return "\n".join(out)
PyNLPl-1.2.9/pynlpl/lm/ 0000755 0001750 0000144 00000000000 13442242642 015442 5 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/pynlpl/lm/__init__.py 0000664 0001750 0000144 00000000157 12201265173 017554 0 ustar proycon users 0000000 0000000 """This package contains modules for Language Models, with a C++/Python module for SRILM by Sander Canisius"""
PyNLPl-1.2.9/pynlpl/lm/client.py 0000664 0001750 0000144 00000003270 12201265173 017272 0 ustar proycon users 0000000 0000000 #!/usr/bin/env python
#-*- coding:utf-8 -*-
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
import socket
class LMClient(object):
def __init__(self,host= "localhost",port=12346,n = 0):
self.BUFSIZE = 1024
self.socket = socket.socket(socket.AF_INET,socket.SOCK_STREAM) #Create the socket
self.socket.settimeout(120)
assert isinstance(port,int)
self.socket.connect((host, port)) #Connect to server
assert isinstance(n,int)
self.n = n
def scoresentence(self, sentence):
if self.n > 0:
raise Exception("This client instance has been set to send only " + str(self.n) + "-grams")
if isinstance(sentence,list) or isinstance(sentence,tuple):
sentence = " ".join(sentence)
self.socket.send(sentence+ "\r\n")
return float(self.socket.recv(self.BUFSIZE).strip())
def __getitem__(self, ngram):
if self.n == 0:
raise Exception("This client has been set to send only full sentence, not n-grams")
if isinstance(ngram,str) or isinstance(ngram,unicode):
ngram = ngram.split(" ")
if len(ngram) != self.n:
raise Exception("This client instance has been set to send only " + str(self.n) + "-grams.")
ngram = " ".join(ngram)
if (sys.version < '3' and isinstance(ngram,unicode)) or( sys.version == '3' and isinstance(ngram,str)):
ngram = ngram.encode('utf-8')
self.socket.send(ngram + b"\r\n")
return float(self.socket.recv(self.BUFSIZE).strip())
PyNLPl-1.2.9/pynlpl/lm/lm.py 0000664 0001750 0000144 00000026576 12347545414 016454 0 ustar proycon users 0000000 0000000 #---------------------------------------------------------------
# PyNLPl - Language Models
# by Maarten van Gompel, ILK, Universiteit van Tilburg
# http://ilk.uvt.nl/~mvgompel
# proycon AT anaproy DOT nl
#
# Licensed under GPLv3
#
#----------------------------------------------------------------
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import io
import math
import sys
from pynlpl.statistics import FrequencyList, product
from pynlpl.textprocessors import Windower
if sys.version < '3':
from codecs import getwriter
stderr = getwriter('utf-8')(sys.stderr)
stdout = getwriter('utf-8')(sys.stdout)
else:
stderr = sys.stderr
stdout = sys.stdout
class SimpleLanguageModel:
"""This is a simple unsmoothed language model. This class can both hold and compute the model."""
def __init__(self, n=2, casesensitive = True, beginmarker = "", endmarker = ""):
self.casesensitive = casesensitive
self.freqlistN = FrequencyList(None, self.casesensitive)
self.freqlistNm1 = FrequencyList(None, self.casesensitive)
assert isinstance(n,int) and n >= 2
self.n = n
self.beginmarker = beginmarker
self.endmarker = endmarker
self.sentences = 0
if self.beginmarker:
self._begingram = tuple([self.beginmarker] * (n-1))
if self.endmarker:
self._endgram = tuple([self.endmarker] * (n-1))
def append(self, sentence):
if isinstance(sentence, str) or isinstance(sentence, unicode):
sentence = sentence.strip().split(' ')
self.sentences += 1
for ngram in Windower(sentence,self.n, self.beginmarker, self.endmarker):
self.freqlistN.count(ngram)
for ngram in Windower(sentence,self.n-1, self.beginmarker, self.endmarker):
self.freqlistNm1.count(ngram)
def load(self, filename):
self.freqlistN = FrequencyList(None, self.casesensitive)
self.freqlistNm1 = FrequencyList(None, self.casesensitive)
f = io.open(filename,'r',encoding='utf-8')
mode = False
for line in f.readlines():
line = line.strip()
if line:
if not mode:
if line != "[simplelanguagemodel]":
raise Exception("File is not a SimpleLanguageModel")
else:
mode = 1
elif mode == 1:
if line[:2] == 'n=':
self.n = int(line[2:])
elif line[:12] == 'beginmarker=':
self.beginmarker = line[12:]
elif line[:10] == 'endmarker=':
self.endmarker = line[10:]
elif line[:10] == 'sentences=':
self.sentences = int(line[10:])
elif line[:14] == 'casesensitive=':
self.casesensitive = bool(int(line[14:]))
self.freqlistN = FrequencyList(None, self.casesensitive)
self.freqlistNm1 = FrequencyList(None, self.casesensitive)
elif line == "[freqlistN]":
mode = 2
else:
raise Exception("Syntax error in language model file: ", line)
elif mode == 2:
if line == "[freqlistNm1]":
mode = 3
else:
try:
type, count = line.split("\t")
self.freqlistN.count(type.split(' '),int(count))
except:
print("Warning, could not parse line whilst loading frequency list: ", line,file=stderr)
elif mode == 3:
try:
type, count = line.split("\t")
self.freqlistNm1.count(type.split(' '),int(count))
except:
print("Warning, could not parse line whilst loading frequency list: ", line,file=stderr)
if self.beginmarker:
self._begingram = [self.beginmarker] * (self.n-1)
if self.endmarker:
self._endgram = [self.endmarker] * (self.n-1)
def save(self, filename):
f = io.open(filename,'w',encoding='utf-8')
f.write("[simplelanguagemodel]\n")
f.write("n="+str(self.n)+"\n")
f.write("sentences="+str(self.sentences)+"\n")
f.write("beginmarker="+self.beginmarker+"\n")
f.write("endmarker="+self.endmarker+"\n")
f.write("casesensitive="+str(int(self.casesensitive))+"\n")
f.write("\n")
f.write("[freqlistN]\n")
for line in self.freqlistN.output():
f.write(line+"\n")
f.write("[freqlistNm1]\n")
for line in self.freqlistNm1.output():
f.write(line+"\n")
f.close()
def scoresentence(self, sentence):
return product([self[x] for x in Windower(sentence, self.n, self.beginmarker, self.endmarker)])
def __getitem__(self, ngram):
assert len(ngram) == self.n
nm1gram = ngram[:-1]
if (self.beginmarker and nm1gram == self._begingram) or (self.endmarker and nm1gram == self._endgram):
return self.freqlistN[ngram] / float(self.sentences)
else:
return self.freqlistN[ngram] / float(self.freqlistNm1[nm1gram])
class ARPALanguageModel(object):
"""Full back-off language model, loaded from file in ARPA format.
This class does not build the model but allows you to use a pre-computed one.
You can use the tool ngram-count from for instance SRILM to actually build the model.
"""
class NgramsProbs(object):
"""Store Ngrams with their probabilities and backoffs.
This class is used in order to abstract the physical storage layout,
and enable memory/speed tradeoffs.
"""
def __init__(self, data, mode='simple', delim=' '):
"""Create an ngrams storage with the given method:
'simple' method is a Python dictionary (quick, takes much memory).
'trie' method is more space-efficient (~35% reduction) but slower.
data is a dictionary of ngram-tuple => (probability, backoff).
delim is the strings which converts ngrams between tuple and
unicode string (for saving in trie mode).
"""
self.delim = delim
self.mode = mode
if mode == 'simple':
self._data = data
elif mode == 'trie':
import marisa_trie
self._data = marisa_trie.RecordTrie("@dd", [(self.delim.join(k), v) for k, v in data.items()])
else:
raise ValueError("mode {} is not supported for NgramsProbs".format(mode))
def prob(self, ngram):
"""Return probability of given ngram tuple"""
return self._data[ngram][0] if self.mode == 'simple' else self._data[self.delim.join(ngram)][0][0]
def backoff(self, ngram):
"""Return backoff value of a given ngram tuple"""
return self._data[ngram][1] if self.mode == 'simple' else self._data[self.delim.join(ngram)][0][1]
def __len__(self):
return len(self._data)
def __init__(self, filename, encoding='utf-8', encoder=None, base_e=True, dounknown=True, debug=False, mode='simple'):
# parameters
self.encoder = (lambda x: x) if encoder is None else encoder
self.base_e = base_e
self.dounknown = dounknown
self.debug = debug
self.mode = mode
# other attributes
self.total = {}
data = {}
with io.open(filename, 'rt', encoding=encoding) as f:
order = None
for line in f:
line = line.strip()
if line == '\\data\\':
order = 0
elif line == '\\end\\':
break
elif line.startswith('\\') and line.endswith(':'):
for i in range(1, 10):
if line == '\\{}-grams:'.format(i):
order = i
break
else:
raise ValueError("Order of n-gram is not supported!")
elif line:
if order == 0: # still in \data\ section
if line.startswith('ngram'):
n = int(line[6])
v = int(line[8:])
self.total[n] = v
elif order > 0:
fields = line.split('\t')
logprob = float(fields[0])
if base_e: # * log(10) does log10 to log_e conversion
logprob *= math.log(10)
ngram = self.encoder(tuple(fields[1].split()))
if len(fields) > 2:
backoffprob = float(fields[2])
if base_e: # * log(10) does log10 to log_e conversion
backoffprob *= math.log(10)
if self.debug:
msg = "Adding to LM: {}\t{}\t{}"
print(msg.format(ngram, logprob, backoffprob), file=stderr)
else:
backoffprob = 0.0
if self.debug:
msg = "Adding to LM: {}\t{}"
print(msg.format(ngram, logprob), file=stderr)
data[ngram] = (logprob, backoffprob)
elif self.debug:
print("Unable to parse ARPA LM line: " + line, file=stderr)
self.order = order
self.ngrams = self.NgramsProbs(data, mode)
def score(self, data, history=None):
result = 0
for word in data:
result += self.scoreword(word, history)
if history:
history += (word,)
else:
history = (word,)
return result
def scoreword(self, word, history=None):
if isinstance(word, str) or (sys.version < '3' and isinstance(word, unicode)):
word = (word,)
if history:
lookup = history + word
else:
lookup = word
if len(lookup) > self.order:
lookup = lookup[-self.order:]
try:
return self.ngrams.prob(lookup)
except KeyError: # not found, back off
if not history:
if self.dounknown:
try:
return self.ngrams.prob(('',))
except KeyError:
msg = "Word {} not found. And no history specified and model has no ."
raise KeyError(msg.format(word))
else:
msg = "Word {} not found. And no history specified."
raise KeyError(msg.format(word))
else:
try:
backoffweight = self.ngrams.backoff(history)
except KeyError:
backoffweight = 0 # backoff weight will be 0 if not found
return backoffweight + self.scoreword(word, history[1:])
def __len__(self):
return len(self.ngrams)
PyNLPl-1.2.9/pynlpl/lm/server.py 0000664 0001750 0000144 00000003317 12201265173 017324 0 ustar proycon users 0000000 0000000 #!/usr/bin/env python
#-*- coding:utf-8 -*-
#---------------------------------------------------------------
# PyNLPl - Language Models
# by Maarten van Gompel, ILK, Universiteit van Tilburg
# http://ilk.uvt.nl/~mvgompel
# proycon AT anaproy DOT nl
#
# Generic Server for Language Models
#
#----------------------------------------------------------------
#No Python 3 support for twisted yet...
from twisted.internet import protocol, reactor
from twisted.protocols import basic
class LMSentenceProtocol(basic.LineReceiver):
def lineReceived(self, sentence):
try:
score = self.factory.lm.scoresentence(sentence)
except:
score = 0.0
self.sendLine(str(score))
class LMSentenceFactory(protocol.ServerFactory):
protocol = LMSentenceProtocol
def __init__(self, lm):
self.lm = lm
class LMNGramProtocol(basic.LineReceiver):
def lineReceived(self, ngram):
ngram = ngram.split(" ")
try:
score = self.factory.lm[ngram]
except:
score = 0.0
self.sendLine(str(score))
class LMNGramFactory(protocol.ServerFactory):
protocol = LMNGramProtocol
def __init__(self, lm):
self.lm = lm
class LMServer:
"""Language Model Server"""
def __init__(self, lm, port=12346, n=0):
"""n indicates the n-gram size, if set to 0 (which is default), the server will expect to only receive whole sentence, if set to a particular value, it will only expect n-grams of that value"""
if n == 0:
reactor.listenTCP(port, LMSentenceFactory(lm))
else:
reactor.listenTCP(port, LMNGramFactory(lm))
reactor.run()
PyNLPl-1.2.9/pynlpl/lm/srilm.py 0000644 0001750 0000144 00000004123 13022533066 017137 0 ustar proycon users 0000000 0000000 #---------------------------------------------------------------
# PyNLPl - SRILM Language Model
# by Maarten van Gompel, ILK, Universiteit van Tilburg
# http://ilk.uvt.nl/~mvgompel
# proycon AT anaproy DOT nl
#
# Adapted from code by Sander Canisius
#
# Licensed under GPLv3
#
#
# This library enables using SRILM as language model
#
#----------------------------------------------------------------
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
try:
import srilmcc
except ImportError:
import warnings
warnings.warn("srilmcc module is not compiled")
srilmcc = None
from pynlpl.textprocessors import Windower
class SRILMException(Exception):
"""Base Exception for SRILM."""
class SRILM:
def __init__(self, filename, n):
if not srilmcc:
raise SRILMException(
"SRILM is not downloaded and compiled."
"Please follow the instructions in makesrilmcc")
self.model = srilmcc.LanguageModel(filename, n)
self.n = n
def scoresentence(self, sentence, unknownwordprob=-12):
score = 0
for ngram in Windower(sentence, self.n, "", ""):
try:
score += self.logscore(ngram)
except KeyError:
score += unknownwordprob
return 10**score
def __getitem__(self, ngram):
return 10**self.logscore(ngram)
def __contains__(self, key):
return self.model.exists( key )
def logscore(self, ngram):
#Bug work-around
#if "" in ngram or "_" in ngram or "__" in ngram:
# print >> sys.stderr, "WARNING: Invalid word in n-gram! Ignoring", ngram
# return -999.9
if len(ngram) == self.n:
if all( (self.model.exists(x) for x in ngram) ):
#no phrases, basic trigram, compute directly
return self.model.wordProb(*ngram)
else:
raise KeyError
else:
raise Exception("Not an " + str(self.n) + "-gram")
PyNLPl-1.2.9/pynlpl/mt/ 0000755 0001750 0000144 00000000000 13442242642 015452 5 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/pynlpl/mt/__init__.py 0000664 0001750 0000144 00000000000 12201265173 017547 0 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/pynlpl/mt/wordalign.py 0000664 0001750 0000144 00000006225 12201265173 020015 0 ustar proycon users 0000000 0000000 from pynlpl.statistics import FrequencyList, Distribution
class WordAlignment(object):
def __init__(self, casesensitive = False):
self.casesensitive = casesensitive
def train(self, sourcefile, targetfile):
sourcefile = open(sourcefile)
targetfile = open(targetfile)
self.sourcefreqlist = FrequencyList(None, self.casesensitive)
self.targetfreqlist = FrequencyList(None, self.casesensitive)
#frequency lists
self.source2target = {}
self.target2source = {}
for sourceline, targetline in zip(sourcefile, targetfile):
sourcetokens = sourceline.split()
targettokens = targetline.split()
self.sourcefreqlist.append(sourcetokens)
self.targetfreqlist.append(targettokens)
for sourcetoken in sourcetokens:
if not sourcetoken in self.source2target:
self.source2target[sourcetoken] = FrequencyList(targettokens,self.casesensitive)
else:
self.source2target[sourcetoken].append(targettokens)
for targettoken in targettokens:
if not targettoken in self.target2source:
self.target2source[targettoken] = FrequencyList(sourcetokens,self.casesensitive)
else:
self.target2source[targettoken].append(sourcetokens)
sourcefile.close()
targetfile.close()
def test(self, sourcefile, targetfile):
sourcefile = open(sourcefile)
targetfile = open(targetfile)
#stage 2
for sourceline, targetline in zip(sourcefile, targetfile):
sourcetokens = sourceline.split()
targettokens = targetline.split()
S2Talignment = []
T2Salignment = []
for sourcetoken in sourcetokens:
#which of the target-tokens is most frequent?
besttoken = None
bestscore = -1
for i, targettoken in enumerate(targettokens):
if targettoken in self.source2target[sourcetoken]:
score = self.source2target[sourcetoken][targettoken] / float(self.targetfreqlist[targettoken])
if score > bestscore:
bestscore = self.source2target[sourcetoken][targettoken]
besttoken = i
S2Talignment.append(besttoken) #TODO: multi-alignment?
for targettoken in targettokens:
besttoken = None
bestscore = -1
for i, sourcetoken in enumerate(sourcetokens):
if sourcetoken in self.target2source[targettoken]:
score = self.target2source[targettoken][sourcetoken] / float(self.sourcefreqlist[sourcetoken])
if score > bestscore:
bestscore = self.target2source[targettoken][sourcetoken]
besttoken = i
T2Salignment.append(besttoken) #TODO: multi-alignment?
yield sourcetokens, targettokens, S2Talignment, T2Salignment
sourcefile.close()
targetfile.close()
PyNLPl-1.2.9/pynlpl/net.py 0000644 0001750 0000144 00000011723 12770202266 016177 0 ustar proycon users 0000000 0000000 #-*- coding:utf-8 -*-
#---------------------------------------------------------------
# PyNLPl - Network utilities
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
# http://www.github.com/proycon/pynlpl
# proycon AT anaproy DOT nl
#
# Generic Server for Language Models
#
#----------------------------------------------------------------
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
from pynlpl.common import u,b
import sys
if sys.version < '3':
from codecs import getwriter
stderr = getwriter('utf-8')(sys.stderr)
stdout = getwriter('utf-8')(sys.stdout)
else:
stderr = sys.stderr
stdout = sys.stdout
from twisted.internet import protocol, reactor # will fail on Python 3 for now
from twisted.protocols import basic
import shlex
class GWSNetProtocol(basic.LineReceiver):
def connectionMade(self):
print("Client connected", file=stderr)
self.factory.connections += 1
if self.factory.connections < 1:
self.transport.loseConnection()
else:
self.sendLine(b("READY"))
def lineReceived(self, line):
try:
if sys.version >= '3' and isinstance(line,bytes):
print("Client in: " + str(line,'utf-8'),file=stderr)
else:
print("Client in: " + line,file=stderr)
except UnicodeDecodeError:
print("Client in: (unicodeerror)",file=stderr)
if sys.version < '3':
if isinstance(line,unicode):
self.factory.processprotocol.transport.write(line.encode('utf-8'))
else:
self.factory.processprotocol.transport.write(line)
self.factory.processprotocol.transport.write(b('\n'))
else:
self.factory.processprotocol.transport.write(b(line) + b('\n'))
self.factory.processprotocol.currentclient = self
def connectionLost(self, reason):
self.factory.connections -= 1
if self.factory.processprotocol.currentclient == self:
self.factory.processprotocol.currentclient = None
class GWSFactory(protocol.ServerFactory):
protocol = GWSNetProtocol
def __init__(self, processprotocol):
self.connections = 0
self.processprotocol = processprotocol
class GWSProcessProtocol(protocol.ProcessProtocol):
def __init__(self, printstderr=True, sendstderr= False, filterout = None, filtererr = None):
self.currentclient = None
self.printstderr = printstderr
self.sendstderr = sendstderr
if not filterout:
self.filterout = lambda x: x
else:
self.filterout = filterout
if not filtererr:
self.filtererr = lambda x: x
else:
self.filtererr = filtererr
def connectionMade(self):
pass
def outReceived(self, data):
try:
if sys.version >= '3' and isinstance(data,bytes):
print("Process out " + str(data, 'utf-8'),file=stderr)
else:
print("Process out " + data,file=stderr)
except UnicodeDecodeError:
print("Process out (unicodeerror)",file=stderr)
print("DEBUG:", repr(b(data).strip().split(b('\n'))))
for line in b(data).strip().split(b('\n')):
line = self.filterout(line.strip())
if self.currentclient and line:
self.currentclient.sendLine(b(line))
def errReceived(self, data):
try:
if sys.version >= '3' and isinstance(data,bytes):
print("Process err " + str(data,'utf-8'), file=sys.stderr)
else:
print("Process err " + data,file=stderr)
except UnicodeDecodeError:
print("Process out (unicodeerror)",file=stderr)
if self.printstderr and data:
print(data.strip(),file=stderr)
for line in b(data).strip().split(b('\n')):
line = self.filtererr(line.strip())
if self.sendstderr and self.currentclient and line:
self.currentclient.sendLine(b(line))
def processExited(self, reason):
print("Process exited",file=stderr)
def processEnded(self, reason):
print("Process ended",file=stderr)
if self.currentclient:
self.currentclient.transport.loseConnection()
reactor.stop()
class GenericWrapperServer:
"""Generic Server around a stdin/stdout based CLI tool. Only accepts one client at a time to prevent concurrency issues !!!!!"""
def __init__(self, cmdline, port, printstderr= True, sendstderr= False, filterout = None, filtererr = None):
gwsprocessprotocol = GWSProcessProtocol(printstderr, sendstderr, filterout, filtererr)
cmdline = shlex.split(cmdline)
reactor.spawnProcess(gwsprocessprotocol, cmdline[0], cmdline)
gwsfactory = GWSFactory(gwsprocessprotocol)
reactor.listenTCP(port, gwsfactory)
reactor.run()
PyNLPl-1.2.9/pynlpl/search.py 0000664 0001750 0000144 00000053756 12201265173 016667 0 ustar proycon users 0000000 0000000 #---------------------------------------------------------------
# PyNLPl - Search Algorithms
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
# http://www.github.com/proycon/pynlpl
# proycon AT anaproy DOT nl
#
# Licensed under GPLv3
#
#----------------------------------------------------------------
"""This module contains various search algorithms."""
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
#from pynlpl.common import u
import sys
if sys.version < '3':
from codecs import getwriter
stderr = getwriter('utf-8')(sys.stderr)
stdout = getwriter('utf-8')(sys.stdout)
else:
stderr = sys.stderr
stdout = sys.stdout
from pynlpl.datatypes import FIFOQueue, PriorityQueue
from collections import deque
from bisect import bisect_left
class AbstractSearchState(object):
def __init__(self, parent = None, cost = 0):
self.parent = parent
self.cost = cost
def test(self, goalstates = None):
"""Checks whether this state is a valid goal state, returns a boolean. If no goalstate is defined, then all states will test positively, this is what you usually want for optimisation problems."""
if goalstates:
return (self in goalstates)
else:
return True
#raise Exception("Classes derived from AbstractSearchState must define a test() method!")
def score(self):
"""Should return a heuristic value. This needs to be set if you plan to used an informed search algorithm."""
raise Exception("Classes derived from AbstractSearchState must define a score() method if used in informed search algorithms!")
def expand(self):
"""Generates successor states, implement your custom operators in the derived method."""
raise Exception("Classes derived from AbstractSearchState must define an expand() method!")
def __eq__(self):
"""Implement an equality test in the derived method, based only on the state's content (not its path etc!)"""
raise Exception("Classes derived from AbstractSearchState must define an __eq__() method!")
def __lt__(self, other):
assert isinstance(other, AbstractSearchState)
return self.score() < other.score()
def __gt__(self, other):
assert isinstance(other, AbstractSearchState)
return self.score() > other.score()
def __hash__(self):
"""Return a unique hash for this state, based on its ID"""
raise Exception("Classes derived from AbstractSearchState must define a __hash__() method if the search space is a graph and visited nodes to be are stored in memory!")
def depth(self):
if not self.parent:
return 0
else:
return self.parent.depth() + 1
#def __len__(self):
# return len(self.path())
def path(self):
if not self.parent:
return [self]
else:
return self.parent.path() + [self]
def pathcost(self):
if not self.parent:
return self.cost
else:
return self.parent.pathcost() + self.cost
#def __cmp__(self, other):
# if self.score < other.score:
# return -1
# elif self.score > other.score:
# return 1
# else:
# return 0
class AbstractSearch(object): #not a real search, just a base class for DFS and BFS
def __init__(self, **kwargs):
"""For graph-searches graph=True is required (default), otherwise the search may loop forever. For tree-searches, set tree=True for better performance"""
self.usememory = True
self.poll = lambda x: x.pop
self.maxdepth = False #unlimited
self.minimize = False #minimize rather than maximize the score function? default: no
self.keeptraversal = False
self.goalstates = None
self.exhaustive = False #only some subclasses use this
self.traversed = 0 #Count of number of nodes visited
self.solutions = 0 #Counts the number of solutions
self.debug = 0
for key, value in kwargs.items():
if key == 'graph':
self.usememory = value #search space is a graph? memory required to keep visited states
elif key == 'tree':
self.usememory = not value; #search space is a tree? memory not required
elif key == 'poll':
self.poll = value #function
elif key == 'maxdepth':
self.maxdepth = value
elif key == 'minimize':
self.minimize = value
elif key == 'maximize':
self.minimize = not value
elif key == 'keeptraversal': #remember entire traversal?
self.keeptraversal = value
elif key == 'goal' or key == 'goals':
if isinstance(value, list) or isinstance(value, tuple):
self.goalstates = value
else:
self.goalstates = [value]
elif key == 'exhaustive':
self.exhaustive = True
elif key == 'debug':
self.debug = value
self._visited = {}
self._traversal = []
self.incomplete = False
self.traversed = 0
def reset(self):
self._visited = {}
self._traversal = []
self.incomplete = False
self.traversed = 0 #Count of all visited nodes
self.solutions = 0 #Counts the number of solutions found
def traversal(self):
"""Returns all visited states (only when keeptraversal=True), note that this is not equal to the path, but contains all states that were checked!"""
if self.keeptraversal:
return self._traversal
else:
raise Exception("No traversal available, algorithm not started with keeptraversal=True!")
def traversalsize(self):
"""Returns the number of nodes visited (also when keeptravel=False). Note that this is not equal to the path, but contains all states that were checked!"""
return self.traversed
def visited(self, state):
if self.usememory:
return (hash(state) in self._visited)
else:
raise Exception("No memory kept, algorithm not started with graph=True!")
def __iter__(self):
"""Generator yielding *all* valid goalstates it can find,"""
n = 0
while len(self.fringe) > 0:
n += 1
if self.debug: print("\t[pynlpl debug] *************** ITERATION #" + str(n) + " ****************",file=stderr)
if self.debug: print("\t[pynlpl debug] FRINGE: ", self.fringe,file=stderr)
state = self.poll(self.fringe)()
if self.debug:
try:
print("\t[pynlpl debug] CURRENT STATE (depth " + str(state.depth()) + "): " + str(state),end="",file=stderr)
except AttributeError:
print("\t[pynlpl debug] CURRENT STATE: " + str(state),end="",file=stderr)
print(" hash="+str(hash(state)),file=stderr)
try:
print(" score="+str(state.score()),file=stderr)
except:
pass
#If node not visited before (or no memory kept):
if not self.usememory or (self.usememory and not hash(state) in self._visited):
#Evaluate the current state
self.traversed += 1
if state.test(self.goalstates):
if self.debug: print("\t[pynlpl debug] Valid goalstate, yielding",file=stderr)
yield state
elif self.debug:
print("\t[pynlpl debug] (no goalstate, not yielding)",file=stderr)
#Expand the specified state and add to the fringe
#if self.debug: print >>stderr,"\t[pynlpl debug] EXPANDING:"
statecount = 0
for i, s in enumerate(state.expand()):
statecount += 1
if self.debug >= 2:
print("\t[pynlpl debug] (Iteration #" + str(n) +") Expanded state #" + str(i+1) + ", adding to fringe: " + str(s),end="",file=stderr)
try:
print(s.score(),file=stderr)
except:
print("ERROR SCORING!",file=stderr)
pass
if not self.maxdepth or s.depth() <= self.maxdepth:
self.fringe.append(s)
else:
if self.debug: print("\t[pynlpl debug] (Iteration #" + str(n) +") Not adding to fringe, maxdepth exceeded",file=stderr)
self.incomplete = True
if self.debug:
print("\t[pynlpl debug] Expanded " + str(statecount) + " states, offered to fringe",file=stderr)
if self.keeptraversal: self._traversal.append(state)
if self.usememory: self._visited[hash(state)] = True
self.prune(state) #calls prune method
else:
if self.debug:
print("\t[pynlpl debug] State already visited before, not expanding again...(hash="+str(hash(state))+")",file=stderr)
if self.debug:
print("\t[pynlpl debug] Search complete: " + str(self.solutions) + " solution(s), " + str(self.traversed) + " states traversed in " + str(n) + " rounds",file=stderr)
def searchfirst(self):
"""Returns the very first result (regardless of it being the best or not!)"""
for solution in self:
return solution
def searchall(self):
"""Returns a list of all solutions"""
return list(iter(self))
def searchbest(self):
"""Returns the single best result (if multiple have the same score, the first match is returned)"""
finalsolution = None
bestscore = None
for solution in self:
if bestscore == None:
bestscore = solution.score()
finalsolution = solution
elif self.minimize:
score = solution.score()
if score < bestscore:
bestscore = score
finalsolution = solution
elif not self.minimize:
score = solution.score()
if score > bestscore:
bestscore = score
finalsolution = solution
return finalsolution
def searchtop(self,n=10):
"""Return the top n best resulta (or possibly less if not enough is found)"""
solutions = PriorityQueue([], lambda x: x.score, self.minimize, length=n, blockworse=False, blockequal=False,duplicates=False)
for solution in self:
solutions.append(solution)
return solutions
def searchlast(self,n=10):
"""Return the last n results (or possibly less if not found). Note that the last results are not necessarily the best ones! Depending on the search type."""
solutions = deque([], n)
for solution in self:
solutions.append(solution)
return solutions
def prune(self, state):
"""Pruning method is called AFTER expansion of each node"""
#pruning nothing by default
pass
class DepthFirstSearch(AbstractSearch):
def __init__(self, state, **kwargs):
assert isinstance(state, AbstractSearchState)
self.fringe = [ state ]
super(DepthFirstSearch,self).__init__(**kwargs)
class BreadthFirstSearch(AbstractSearch):
def __init__(self, state, **kwargs):
assert isinstance(state, AbstractSearchState)
self.fringe = FIFOQueue([state])
super(BreadthFirstSearch,self).__init__(**kwargs)
class IterativeDeepening(AbstractSearch):
def __init__(self, state, **kwargs):
assert isinstance(state, AbstractSearchState)
self.state = state
self.kwargs = kwargs
self.traversed = 0
def __iter__(self):
self.traversed = 0
d = 0
while not 'maxdepth' in self.kwargs or d <= self.kwargs['maxdepth']:
dfs = DepthFirstSearch(self.state, **self.kwargs)
self.traversed += dfs.traversalsize()
for match in dfs:
yield match
if dfs.incomplete:
d +=1
else:
break
def traversal(self):
#TODO: add
raise Exception("not implemented yet")
def traversalsize(self):
return self.traversed
class BestFirstSearch(AbstractSearch):
def __init__(self, state, **kwargs):
super(BestFirstSearch,self).__init__(**kwargs)
assert isinstance(state, AbstractSearchState)
self.fringe = PriorityQueue([state], lambda x: x.score, self.minimize, length=0, blockworse=False, blockequal=False,duplicates=False)
class BeamSearch(AbstractSearch):
"""Local beam search algorithm"""
def __init__(self, states, beamsize, **kwargs):
if isinstance(states, AbstractSearchState):
states = [states]
else:
assert all( ( isinstance(x, AbstractSearchState) for x in states) )
self.beamsize = beamsize
if 'eager' in kwargs:
self.eager = kwargs['eager']
else:
self.eager = False
super(BeamSearch,self).__init__(**kwargs)
self.incomplete = True
self.duplicates = kwargs['duplicates'] if 'duplicates' in kwargs else False
self.fringe = PriorityQueue(states, lambda x: x.score, self.minimize, length=0, blockworse=False, blockequal=False,duplicates= self.duplicates)
def __iter__(self):
"""Generator yielding *all* valid goalstates it can find"""
i = 0
while len(self.fringe) > 0:
i +=1
if self.debug: print("\t[pynlpl debug] *************** STARTING ROUND #" + str(i) + " ****************",file=stderr)
b = 0
#Create a new empty fixed-length priority queue (this implies there will be pruning if more items are offered than it can hold!)
successors = PriorityQueue([], lambda x: x.score, self.minimize, length=self.beamsize, blockworse=False, blockequal=False,duplicates= self.duplicates)
while len(self.fringe) > 0:
b += 1
if self.debug: print("\t[pynlpl debug] *************** ROUND #" + str(i) + " BEAM# " + str(b) + " ****************",file=stderr)
#if self.debug: print >>stderr,"\t[pynlpl debug] FRINGE: ", self.fringe
state = self.poll(self.fringe)()
if self.debug:
try:
print("\t[pynlpl debug] CURRENT STATE (depth " + str(state.depth()) + "): " + str(state),end="",file=stderr)
except AttributeError:
print("\t[pynlpl debug] CURRENT STATE: " + str(state),end="",file=stderr)
print(" hash="+str(hash(state)),file=stderr)
try:
print(" score="+str(state.score()),file=stderr)
except:
pass
if not self.usememory or (self.usememory and not hash(state) in self._visited):
self.traversed += 1
#Evaluate state
if state.test(self.goalstates):
if self.debug: print("\t[pynlpl debug] Valid goalstate, yielding",file=stderr)
self.solutions += 1 #counts the number of solutions
yield state
elif self.debug:
print("\t[pynlpl debug] (no goalstate, not yielding)",file=stderr)
if self.eager:
score = state.score()
#Expand the specified state and offer to the fringe
statecount = offers = 0
for j, s in enumerate(state.expand()):
statecount += 1
if self.debug >= 2:
print("\t[pynlpl debug] (Round #" + str(i) +" Beam #" + str(b) + ") Expanded state #" + str(j+1) + ", offering to successor pool: " + str(s),end="",file=stderr)
try:
print(s.score(),end="",file=stderr)
except:
print("ERROR SCORING!",end="",file=stderr)
pass
if not self.maxdepth or s.depth() <= self.maxdepth:
if not self.eager:
#use all successors (even worse ones than the current state)
offers += 1
accepted = successors.append(s)
else:
#use only equal or better successors
if s.score() >= score:
offers += 1
accepted = successors.append(s)
else:
accepted = False
if self.debug >= 2:
if accepted:
print(" ACCEPTED",file=stderr)
else:
print(" REJECTED",file=stderr)
else:
if self.debug >= 2:
print(" REJECTED, MAXDEPTH EXCEEDED.",file=stderr)
elif self.debug:
print("\t[pynlpl debug] Not offered to successor pool, maxdepth exceeded",file=stderr)
if self.debug:
print("\t[pynlpl debug] Expanded " + str(statecount) + " states, " + str(offers) + " offered to successor pool",file=stderr)
if self.keeptraversal: self._traversal.append(state)
if self.usememory: self._visited[hash(state)] = True
self.prune(state) #calls prune method (does nothing by default in this search!!!)
else:
if self.debug:
print("\t[pynlpl debug] State already visited before, not expanding again... (hash=" + str(hash(state)) +")",file=stderr)
#AFTER EXPANDING ALL NODES IN THE FRINGE/BEAM:
#set fringe for next round
self.fringe = successors
#Pruning is implicit, successors was a fixed-size priority queue
if self.debug:
print("\t[pynlpl debug] (Round #" + str(i) + ") Implicitly pruned with beamsize " + str(self.beamsize) + "...",file=stderr)
#self.fringe.prune(self.beamsize)
if self.debug: print(" (" + str(offers) + " to " + str(len(self.fringe)) + " items)",file=stderr)
if self.debug:
print("\t[pynlpl debug] Search complete: " + str(self.solutions) + " solution(s), " + str(self.traversed) + " states traversed in " + str(i) + " rounds with " + str(b) + " beams",file=stderr)
class EarlyEagerBeamSearch(AbstractSearch):
"""A beam search that prunes early (after each state expansion) and eagerly (weeding out worse successors)"""
def __init__(self, state, beamsize, **kwargs):
assert isinstance(state, AbstractSearchState)
self.beamsize = beamsize
super(EarlyEagerBeamSearch,self).__init__(**kwargs)
self.fringe = PriorityQueue(state, lambda x: x.score, self.minimize, length=0, blockworse=False, blockequal=False,duplicates= kwargs['duplicates'] if 'duplicates' in kwargs else False)
self.incomplete = True
def prune(self, state):
if self.debug:
l = len(self.fringe)
print("\t[pynlpl debug] pruning with beamsize " + str(self.beamsize) + "...",end="",file=stderr)
self.fringe.prunebyscore(state.score(), retainequalscore=True)
self.fringe.prune(self.beamsize)
if self.debug: print(" (" + str(l) + " to " + str(len(self.fringe)) + " items)",file=stderr)
class BeamedBestFirstSearch(BeamSearch):
"""Best first search with a beamsize (non-optimal!)"""
def prune(self, state):
if self.debug:
l = len(self.fringe)
print("\t[pynlpl debug] pruning with beamsize " + str(self.beamsize) + "...",end="",file=stderr)
self.fringe.prune(self.beamsize)
if self.debug: print(" (" + str(l) + " to " + str(len(self.fringe)) + " items)",file=stderr)
class StochasticBeamSearch(BeamSearch):
def prune(self, state):
if self.debug:
l = len(self.fringe)
print("\t[pynlpl debug] pruning with beamsize " + str(self.beamsize) + "...",end="",file=stderr)
if not self.exhaustive:
self.fringe.prunebyscore(state.score(), retainequalscore=True)
self.fringe.stochasticprune(self.beamsize)
if self.debug: print(" (" + str(l) + " to " + str(len(self.fringe)) + " items)",file=stderr)
class HillClimbingSearch(AbstractSearch): #TODO: TEST
"""(identical to beamsearch with beam 1, but implemented differently)"""
def __init__(self, state, **kwargs):
assert isinstance(state, AbstractSearchState)
super(HillClimbingSearch,self).__init__(**kwargs)
self.fringe = PriorityQueue([state], lambda x: x.score, self.minimize, length=0, blockworse=True, blockequal=False,duplicates=False)
#From http://stackoverflow.com/questions/212358/binary-search-in-python
def binary_search(a, x, lo=0, hi=None): # can't use a to specify default for hi
hi = hi if hi is not None else len(a) # hi defaults to len(a)
pos = bisect_left(a,x,lo,hi) # find insertion position
return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end
PyNLPl-1.2.9/pynlpl/statistics.py 0000664 0001750 0000144 00000053233 12463714547 017620 0 ustar proycon users 0000000 0000000 ###############################################################
# PyNLPp - Statistics & Information Theory Library
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
# http://www.github.com/proycon/pynlpl
# proycon AT anaproy DOT nl
#
# Also contains MIT licensed code from
# AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html
# Peter Norvig
#
# Licensed under GPLv3
#
###############################################################
"""This is a Python library containing classes for Statistic and Information Theoretical computations. It also contains some code from Peter Norvig, AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html"""
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
from pynlpl.common import u, isstring
import sys
if sys.version < '3':
from codecs import getwriter
stderr = getwriter('utf-8')(sys.stderr)
stdout = getwriter('utf-8')(sys.stdout)
else:
stderr = sys.stderr
stdout = sys.stdout
import io
import math
import random
import operator
from collections import Counter
class FrequencyList(object):
"""A frequency list (implemented using dictionaries)"""
def __init__(self, tokens = None, casesensitive = True, dovalidation = True):
self._count = Counter()
self._ranked = {}
self.total = 0 #number of tokens
self.casesensitive = casesensitive
self.dovalidation = dovalidation
if tokens: self.append(tokens)
def load(self, filename):
"""Load a frequency list from file (in the format produced by the save method)"""
f = io.open(filename,'r',encoding='utf-8')
for line in f:
data = line.strip().split("\t")
type, count = data[:2]
self.count(type,count)
f.close()
def save(self, filename, addnormalised=False):
"""Save a frequency list to file, can be loaded later using the load method"""
f = io.open(filename,'w',encoding='utf-8')
for line in self.output("\t", addnormalised):
f.write(line + '\n')
f.close()
def _validate(self,type):
if isinstance(type,list):
type = tuple(type)
if isinstance(type,tuple):
if not self.casesensitive:
return tuple([x.lower() for x in type])
else:
return type
else:
if not self.casesensitive:
return type.lower()
else:
return type
def append(self,tokens):
"""Add a list of tokens to the frequencylist. This method will count them for you."""
for token in tokens:
self.count(token)
def count(self, type, amount = 1):
"""Count a certain type. The counter will increase by the amount specified (defaults to one)"""
if self.dovalidation: type = self._validate(type)
if self._ranked: self._ranked = None
if type in self._count:
self._count[type] += amount
else:
self._count[type] = amount
self.total += amount
def sum(self):
"""Returns the total amount of tokens"""
return self.total
def _rank(self):
if not self._ranked: self._ranked = self._count.most_common()
def __iter__(self):
"""Iterate over the frequency lists, in order (frequent to rare). This is a generator that yields (type, count) pairs. The first time you iterate over the FrequencyList, the ranking will be computed. For subsequent calls it will be available immediately, unless the frequency list changed in the meantime."""
self._rank()
for type, count in self._ranked:
yield type, count
def items(self):
"""Returns an *unranked* list of (type, count) pairs. Use this only if you are not interested in the order."""
for type, count in self._count.items():
yield type, count
def __getitem__(self, type):
if self.dovalidation: type = self._validate(type)
try:
return self._count[type]
except KeyError:
return 0
def __setitem__(self, type, value):
"""alias for count, but can only be called once"""
if self.dovalidation: type = self._validate(type)
if not type in self._count:
self.count(type,value)
else:
raise ValueError("This type is already set!")
def __delitem__(self, type):
if self.dovalidation: type = self._validate(type)
del self._count[type]
if self._ranked: self._ranked = None
def typetokenratio(self):
"""Computes the type/token ratio"""
return len(self._count) / float(self.total)
def __len__(self):
"""Returns the total amount of types"""
return len(self._count)
def tokens(self):
"""Returns the total amount of tokens"""
return self.total
def mode(self):
"""Returns the type that occurs the most frequently in the frequency list"""
self._rank()
return self._ranked[0][0]
def p(self, type):
"""Returns the probability (relative frequency) of the token"""
if self.dovalidation: type = self._validate(type)
return self._count[type] / float(self.total)
def __eq__(self, otherfreqlist):
return (self.total == otherfreqlist.total and self._count == otherfreqlist._count)
def __contains__(self, type):
"""Checks if the specified type is in the frequency list"""
if self.dovalidation: type = self._validate(type)
return type in self._count
def __add__(self, otherfreqlist):
"""Multiple frequency lists can be added together"""
assert isinstance(otherfreqlist,FrequencyList)
product = FrequencyList(None,)
for type, count in self.items():
product.count(type,count)
for type, count in otherfreqlist.items():
product.count(type,count)
return product
def output(self,delimiter = '\t', addnormalised=False):
"""Print a representation of the frequency list"""
for type, count in self:
if isinstance(type,tuple) or isinstance(type,list):
if addnormalised:
yield " ".join((u(x) for x in type)) + delimiter + str(count) + delimiter + str(count/self.total)
else:
yield " ".join((u(x) for x in type)) + delimiter + str(count)
elif isstring(type):
if addnormalised:
yield type + delimiter + str(count) + delimiter + str(count/self.total)
else:
yield type + delimiter + str(count)
else:
if addnormalised:
yield str(type) + delimiter + str(count) + delimiter + str(count/self.total)
else:
yield str(type) + delimiter + str(count)
def __repr__(self):
return repr(self._count)
def __unicode__(self): #Python 2
return str(self)
def __str__(self):
return "\n".join(self.output())
def values(self):
return self._count.values()
def dict(self):
return self._count
#class FrequencyTrie:
# def __init__(self):
# self.data = Tree()
#
# def count(self, sequence):
#
#
# self.data.append( Tree(item) )
class Distribution(object):
"""A distribution can be created over a FrequencyList or a plain dictionary with numeric values. It will be normalized automatically. This implemtation uses dictionaries/hashing"""
def __init__(self, data, base = 2):
self.base = base #logarithmic base: can be set to 2, 10 or math.e (or anything else). when set to None, it's set to e automatically
self._dist = {}
if isinstance(data, FrequencyList):
for type, count in data.items():
self._dist[type] = count / data.total
elif isinstance(data, dict) or isinstance(data, list):
if isinstance(data, list):
self._dist = {}
for key,value in data:
self._dist[key] = float(value)
else:
self._dist = data
total = sum(self._dist.values())
if total < 0.999 or total > 1.000:
#normalize again
for key, value in self._dist.items():
self._dist[key] = value / total
else:
raise Exception("Can't create distribution")
self._ranked = None
def _rank(self):
if not self._ranked: self._ranked = sorted(self._dist.items(),key=lambda x: x[1], reverse=True )
def information(self, type):
"""Computes the information content of the specified type: -log_e(p(X))"""
if not self.base:
return -math.log(self._dist[type])
else:
return -math.log(self._dist[type], self.base)
def poslog(self, type):
"""alias for information content"""
return self.information(type)
def entropy(self, base = 2):
"""Compute the entropy of the distribution"""
entropy = 0
if not base and self.base: base = self.base
for type in self._dist:
if not base:
entropy += self._dist[type] * -math.log(self._dist[type])
else:
entropy += self._dist[type] * -math.log(self._dist[type], base)
return entropy
def perplexity(self, base=2):
return base ** self.entropy(base)
def mode(self):
"""Returns the type that occurs the most frequently in the probability distribution"""
self._rank()
return self._ranked[0][0]
def maxentropy(self, base = 2):
"""Compute the maximum entropy of the distribution: log_e(N)"""
if not base and self.base: base = self.base
if not base:
return math.log(len(self._dist))
else:
return math.log(len(self._dist), base)
def __len__(self):
"""Returns the number of types"""
return len(self._dist)
def __getitem__(self, type):
"""Return the probability for this type"""
return self._dist[type]
def __iter__(self):
"""Iterate over the *ranked* distribution, returns (type, probability) pairs"""
self._rank()
for type, p in self._ranked:
yield type, p
def items(self):
"""Returns an *unranked* list of (type, prob) pairs. Use this only if you are not interested in the order."""
for type, count in self._dist.items():
yield type, count
def output(self,delimiter = '\t', freqlist = None):
"""Generator yielding formatted strings expressing the time and probabily for each item in the distribution"""
for type, prob in self:
if freqlist:
if isinstance(type,list) or isinstance(type, tuple):
yield " ".join(type) + delimiter + str(freqlist[type]) + delimiter + str(prob)
else:
yield type + delimiter + str(freqlist[type]) + delimiter + str(prob)
else:
if isinstance(type,list) or isinstance(type, tuple):
yield " ".join(type) + delimiter + str(prob)
else:
yield type + delimiter + str(prob)
def __unicode__(self):
return str(self)
def __str__(self):
return "\n".join(self.output())
def __repr__(self):
return repr(self._dist)
def keys(self):
return self._dist.keys()
def values(self):
return self._dist.values()
class MarkovChain(object):
def __init__(self, startstate, endstate = None):
self.nodes = set()
self.edges_out = {}
self.startstate = startstate
self.endstate = endstate
def settransitions(self, state, distribution):
self.nodes.add(state)
if not isinstance(distribution, Distribution):
distribution = Distribution(distribution)
self.edges_out[state] = distribution
self.nodes.update(distribution.keys())
def __iter__(self):
for state, distribution in self.edges_out.items():
yield state, distribution
def __getitem__(self, state):
for distribution in self.edges_out[state]:
yield distribution
def size(self):
return len(self.nodes)
def accessible(self,fromstate, tostate):
"""Is state tonode directly accessible (in one step) from state fromnode? (i.e. is there an edge between the nodes). If so, return the probability, else zero"""
if (not (fromstate in self.nodes)) or (not (tostate in self.nodes)) or not (fromstate in self.edges_out):
return 0
if tostate in self.edges_out[fromstate]:
return self.edges_out[fromstate][tostate]
else:
return 0
def communicates(self,fromstate, tostate, maxlength=999999):
"""See if a node communicates (directly or indirectly) with another. Returns the probability of the *shortest* path (probably, but not necessarily the highest probability)"""
if (not (fromstate in self.nodes)) or (not (tostate in self.nodes)):
return 0
assert (fromstate != tostate)
def _test(node,length,prob):
if length > maxlength:
return 0
if node == tostate:
prob *= self.edges_out[node][tostate]
return True
for child in self.edges_out[node].keys():
if not child in visited:
visited.add(child)
if child == tostate:
return prob * self.edges_out[node][tostate]
else:
r = _test(child, length+1, prob * self.edges_out[node][tostate])
if r:
return r
return 0
visited = set(fromstate)
return _test(fromstate,1,1)
def p(self, sequence, subsequence=True):
"""Returns the probability of the given sequence or subsequence (if subsequence=True, default)."""
if sequence[0] != self.startstate:
if isinstance(sequence, tuple):
sequence = (self.startstate,) + sequence
else:
sequence = (self.startstate,) + tuple(sequence)
if self.endstate:
if sequence[-1] != self.endstate:
if isinstance(sequence, tuple):
sequence = sequence + (self.endstate,)
else:
sequence = tuple(sequence) + (self.endstate,)
prevnode = None
prob = 1
for node in sequence:
if prevnode:
try:
prob *= self.edges_out[prevnode][node]
except:
return 0
return prob
def __contains__(self, sequence):
"""Is the given sequence generated by the markov model? Does not work for subsequences!"""
return bool(self.p(sequence,False))
def reducible(self):
#TODO: implement
raise NotImplementedError
class HiddenMarkovModel(MarkovChain):
def __init__(self, startstate, endstate = None):
self.observablenodes = set()
self.edges_toobservables = {}
super(HiddenMarkovModel, self).__init__(startstate,endstate)
def setemission(self, state, distribution):
self.nodes.add(state)
if not isinstance(distribution, Distribution):
distribution = Distribution(distribution)
self.edges_toobservables[state] = distribution
self.observablenodes.update(distribution.keys())
def print_dptable(self, V):
print(" ",end="",file=stdout)
for i in range(len(V)): print("%7s" % ("%d" % i),end="",file=stdout)
print(file=stdout)
for y in V[0].keys():
print("%.5s: " % y, end="",file=stdout)
for t in range(len(V)):
print("%.7s" % ("%f" % V[t][y]),end="",file=stdout)
print(file=stdout)
#Adapted from: http://en.wikipedia.org/wiki/Viterbi_algorithm
def viterbi(self,observations, doprint=False):
#states, start_p, trans_p, emit_p):
V = [{}] #Viterbi matrix
path = {}
# Initialize base cases (t == 0)
for node in self.edges_out[self.startstate].keys():
try:
V[0][node] = self.edges_out[self.startstate][node] * self.edges_toobservables[node][observations[0]]
path[node] = [node]
except KeyError:
pass #will be 0, don't store
# Run Viterbi for t > 0
for t in range(1,len(observations)):
V.append({})
newpath = {}
for node in self.nodes:
column = []
for prevnode in V[t-1].keys():
try:
column.append( (V[t-1][prevnode] * self.edges_out[prevnode][node] * self.edges_toobservables[node][observations[t]], prevnode ) )
except KeyError:
pass #will be 0
if column:
(prob, state) = max(column)
V[t][node] = prob
newpath[node] = path[state] + [node]
# Don't need to remember the old paths
path = newpath
if doprint: self.print_dptable(V)
if not V[len(observations) - 1]:
return (0,[])
else:
(prob, state) = max([(V[len(observations) - 1][node], node) for node in V[len(observations) - 1].keys()])
return (prob, path[state])
# ********************* Common Functions ******************************
def product(seq):
"""Return the product of a sequence of numerical values.
>>> product([1,2,6])
12
"""
if len(seq) == 0:
return 0
else:
product = 1
for x in seq:
product *= x
return product
# All below functions are mathematical functions from AI: A Modern Approach, see: http://aima.cs.berkeley.edu/python/utils.html
def histogram(values, mode=0, bin_function=None): #from AI: A Modern Appproach
"""Return a list of (value, count) pairs, summarizing the input values.
Sorted by increasing value, or if mode=1, by decreasing count.
If bin_function is given, map it over values first."""
if bin_function: values = map(bin_function, values)
bins = {}
for val in values:
bins[val] = bins.get(val, 0) + 1
if mode:
return sorted(bins.items(), key=lambda v: v[1], reverse=True)
else:
return sorted(bins.items())
def log2(x): #from AI: A Modern Appproach
"""Base 2 logarithm.
>>> log2(1024)
10.0
"""
return math.log(x, 2)
def mode(values): #from AI: A Modern Appproach
"""Return the most common value in the list of values.
>>> mode([1, 2, 3, 2])
2
"""
return histogram(values, mode=1)[0][0]
def median(values): #from AI: A Modern Appproach
"""Return the middle value, when the values are sorted.
If there are an odd number of elements, try to average the middle two.
If they can't be averaged (e.g. they are strings), choose one at random.
>>> median([10, 100, 11])
11
>>> median([1, 2, 3, 4])
2.5
"""
n = len(values)
values = sorted(values)
if n % 2 == 1:
return values[n/2]
else:
middle2 = values[(n/2)-1:(n/2)+1]
try:
return mean(middle2)
except TypeError:
return random.choice(middle2)
def mean(values): #from AI: A Modern Appproach
"""Return the arithmetic average of the values."""
return sum(values) / len(values)
def stddev(values, meanval=None): #from AI: A Modern Appproach
"""The standard deviation of a set of values.
Pass in the mean if you already know it."""
if meanval == None: meanval = mean(values)
return math.sqrt( sum([(x - meanval)**2 for x in values]) / (len(values)-1) )
def dotproduct(X, Y): #from AI: A Modern Appproach
"""Return the sum of the element-wise product of vectors x and y.
>>> dotproduct([1, 2, 3], [1000, 100, 10])
1230
"""
return sum([x * y for x, y in zip(X, Y)])
def vector_add(a, b): #from AI: A Modern Appproach
"""Component-wise addition of two vectors.
>>> vector_add((0, 1), (8, 9))
(8, 10)
"""
return tuple(map(operator.add, a, b))
def normalize(numbers, total=1.0): #from AI: A Modern Appproach
"""Multiply each number by a constant such that the sum is 1.0 (or total).
>>> normalize([1,2,1])
[0.25, 0.5, 0.25]
"""
k = total / sum(numbers)
return [k * n for n in numbers]
###########################################################################################
def levenshtein(s1, s2, maxdistance=9999):
"""Computes the levenshtein distance between two strings. Adapted from: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python"""
l1 = len(s1)
l2 = len(s2)
if l1 < l2:
return levenshtein(s2, s1)
if not s1:
return len(s2)
#If the words differ too much in length, (if we have a low maxdistance) , we needn't bother compute distance:
if l1 > l2 + maxdistance:
return maxdistance+1
previous_row = list(range(l2 + 1))
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
deletions = current_row[j] + 1 # than s2
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
if current_row[-1] > maxdistance:
return current_row[-1]
previous_row = current_row
return previous_row[-1]
PyNLPl-1.2.9/pynlpl/tagger.py 0000775 0001750 0000144 00000032236 12201265173 016664 0 ustar proycon users 0000000 0000000 #! /usr/bin/env python
# -*- coding: utf8 -*-
###############################################################
# PyNLPl - FreeLing Library
# by Maarten van Gompel (proycon)
# http://ilk.uvt.nl/~mvgompel
# Radboud University Nijmegen
#
# Licensed under GPLv3
#
# Generic Tagger interface for PoS-tagging and lemmatisation,
# offers an interface to various software
#
###############################################################
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import division
from __future__ import absolute_import
from pynlpl.common import u
import sys
if sys.version < '3':
from codecs import getwriter
stderr = getwriter('utf-8')(sys.stderr)
stdout = getwriter('utf-8')(sys.stdout)
else:
stderr = sys.stderr
stdout = sys.stdout
import io
import codecs
import json
import getopt
import subprocess
class Tagger(object):
def __init__(self, *args):
global WSDDIR
self.tagger = None
self.mode = args[0]
if args[0] == "file":
if len(args) != 2:
raise Exception("Syntax: file:[filename]")
self.tagger = codecs.open(args[1],'r','utf-8')
elif args[0] == "frog":
if len(args) != 3:
raise Exception("Syntax: frog:[host]:[port]")
from pynlpl.clients.frogclient import FrogClient
port = int(args[2])
self.tagger = FrogClient(args[1],port)
elif args[0] == "freeling":
if len(args) != 3:
raise Exception("Syntax: freeling:[host]:[port]")
from pynlpl.clients.freeling import FreeLingClient
host = args[1]
port = int(args[2])
self.tagger = FreeLingClient(host,port)
elif args[0] == "corenlp":
if len(args) != 1:
raise Exception("Syntax: corenlp")
import corenlp
print("Initialising Stanford Core NLP",file=stderr)
self.tagger = corenlp.StanfordCoreNLP()
elif args[0] == 'treetagger':
if not len(args) == 2:
raise Exception("Syntax: treetagger:[treetagger-bin]")
self.tagger = args[1]
elif args[0] == "durmlex":
if not len(args) == 2:
raise Exception("Syntax: durmlex:[filename]")
print("Reading durm lexicon: ", args[1],file=stderr)
self.mode = "lookup"
self.tagger = {}
f = codecs.open(args[1],'r','utf-8')
for line in f:
fields = line.split('\t')
wordform = fields[0].lower()
lemma = fields[4].split('.')[0]
self.tagger[wordform] = (lemma, 'n')
f.close()
print("Loaded ", len(self.tagger), " wordforms",file=stderr)
elif args[0] == "oldlex":
if not len(args) == 2:
raise Exception("Syntax: oldlex:[filename]")
print("Reading OLDLexique: ", args[1],file=stderr)
self.mode = "lookup"
self.tagger = {}
f = codecs.open(args[1],'r','utf-8')
for line in f:
fields = line.split('\t')
wordform = fields[0].lower()
lemma = fields[1]
if lemma == '=':
lemma == fields[0]
pos = fields[2][0].lower()
self.tagger[wordform] = (lemma, pos)
print("Loaded ", len(self.tagger), " wordforms",file=stderr)
f.close()
else:
raise Exception("Invalid mode: " + args[0])
def __iter__(self):
if self.mode != 'file':
raise Exception("Iteration only possible in file mode")
line = self.tagger.next()
newwords = []
postags = []
lemmas = []
for item in line:
word,lemma,pos = item.split('|')
newwords.append(word)
postags.append(pos)
lemmas.append(lemma)
yield newwords, postags, lemmas
def reset(self):
if self.mode == 'file':
self.tagger.seek(0)
def process(self, words, debug=False):
if self.mode == 'file':
line = self.tagger.next()
newwords = []
postags = []
lemmas = []
for item in line.split(' '):
if item.strip():
try:
word,lemma,pos = item.split('|')
except:
raise Exception("Unable to parse word|lemma|pos in " + item)
newwords.append(word)
postags.append(pos)
lemmas.append(lemma)
return newwords, postags, lemmas
elif self.mode == "frog":
newwords = []
postags = []
lemmas = []
for fields in self.tagger.process(' '.join(words)):
word,lemma,morph,pos = fields[:4]
newwords.append(word)
postags.append(pos)
lemmas.append(lemma)
return newwords, postags, lemmas
elif self.mode == "freeling":
postags = []
lemmas = []
for fields in self.tagger.process(words, debug):
word, lemma,pos = fields[:3]
postags.append(pos)
lemmas.append(lemma)
return words, postags, lemmas
elif self.mode == "corenlp":
data = json.loads(self.tagger.parse(" ".join(words)))
words = []
postags = []
lemmas = []
for sentence in data['sentences']:
for word, worddata in sentence['words']:
words.append(word)
lemmas.append(worddata['Lemma'])
postags.append(worddata['PartOfSpeech'])
return words, postags, lemmas
elif self.mode == 'lookup':
postags = []
lemmas = []
for word in words:
try:
lemma, pos = self.tagger[word.lower()]
lemmas.append(lemma)
postags.append(pos)
except KeyError:
lemmas.append(word)
postags.append('?')
return words, postags, lemmas
elif self.mode == 'treetagger':
s = " ".join(words)
s = u(s)
p = subprocess.Popen([self.tagger], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(out, err) = p.communicate(s.encode('utf-8'))
newwords = []
postags = []
lemmas = []
for line in out.split('\n'):
line = line.strip()
if line:
fields = line.split('\t')
newwords.append( unicode(fields[0],'utf-8') )
postags.append( unicode(fields[1],'utf-8') )
lemmas.append( unicode(fields[2],'utf-8') )
if p.returncode != 0:
print(err,file=stderr)
raise OSError('TreeTagger failed')
return newwords, postags, lemmas
else:
raise Exception("Unknown mode")
def treetagger_tag(self, f_in, f_out,oneperline=False, debug=False):
def flush(sentences):
if sentences:
print("Processing " + str(len(sentences)) + " lines",file=stderr)
for sentence in sentences:
out = ""
p = subprocess.Popen([self.tagger], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(results, err) = p.communicate("\n".join(sentences).encode('utf-8'))
for line in results.split('\n'):
line = line.strip()
if line:
fields = line.split('\t')
word = fields[0]
pos = fields[1]
lemma = fields[2]
if oneperline:
if out: out += "\n"
out += word + "\t" + lemma + "\t" + pos
else:
if out: out += " "
if '|' in word:
word = word.replace('|','_')
if '|' in lemma:
lemma = lemma.replace('|','_')
if '|' in pos:
pos = pos.replace('|','_')
out += word + "|" + lemma + "|" + pos
if pos[0] == '$':
out = u(out)
f_out.write(out + "\n")
if oneperline: f_out.write("\n")
out = ""
if out:
out = u(out)
f_out.write(out + "\n")
if oneperline: f_out.write("\n")
#buffered tagging
sentences = []
linenum = 0
for line in f_in:
linenum += 1
print(" Buffering input @" + str(linenum),file=stderr)
line = line.strip()
if not line or ('.' in line[:-1] or '?' in line[:-1] or '!' in line[:-1]) or (line[-1] != '.' and line[-1] != '?' and line[-1] != '!'):
flush(sentences)
sentences = []
if not line.strip():
f_out.write("\n")
if oneperline: f_out.write("\n")
sentences.append(line)
flush(sentences)
def tag(self, f_in, f_out,oneperline=False, debug=False):
if self.mode == 'treetagger':
self.treetagger_tag(f_in, f_out,oneperline=False, debug=False)
else:
linenum = 0
for line in f_in:
linenum += 1
print(" Tagger input @" + str(linenum),file=stderr)
if line.strip():
words = line.strip().split(' ')
words, postags, lemmas = self.process(words, debug)
out = ""
for word, pos, lemma in zip(words,postags, lemmas):
if word is None: word = ""
if lemma is None: lemma = "?"
if pos is None: pos = "?"
if oneperline:
if out: out += "\n"
out += word + "\t" + lemma + "\t" + pos
else:
if out: out += " "
if '|' in word:
word = word.replace('|','_')
if '|' in lemma:
lemma = lemma.replace('|','_')
if '|' in pos:
pos = pos.replace('|','_')
out += word + "|" + lemma + "|" + pos
if not isinstance(out, unicode):
out = unicode(out, 'utf-8')
f_out.write(out + "\n")
if oneperline:
f_out.write("\n")
else:
f_out.write("\n")
def usage():
print("tagger.py -c [conf] -f [input-filename] -o [output-filename]",file=stderr)
if __name__ == "__main__":
try:
opts, args = getopt.getopt(sys.argv[1:], "f:c:o:D")
except getopt.GetoptError as err:
# print help information and exit:
print(str(err),file=stderr)
usage()
sys.exit(2)
taggerconf = None
filename = None
outfilename = None
oneperline = False
debug = False
for o, a in opts:
if o == "-c":
taggerconf = a
elif o == "-f":
filename = a
elif o == '-o':
outfilename =a
elif o == '-l':
oneperline = True
elif o == '-D':
debug = True
else:
print >>sys.stderr,"Unknown option: ", o
sys.exit(2)
if not taggerconf:
print("ERROR: Specify a tagger configuration with -c",file=stderr)
sys.exit(2)
if not filename:
print("ERROR: Specify a filename with -f",file=stderr)
sys.exit(2)
if outfilename:
f_out = io.open(outfilename,'w',encoding='utf-8')
else:
f_out = stdout;
f_in = io.open(filename,'r',encoding='utf-8')
tagger = Tagger(*taggerconf.split(':'))
tagger.tag(f_in, f_out, oneperline, debug)
f_in.close()
if outfilename:
f_out.close()
PyNLPl-1.2.9/pynlpl/tests/ 0000755 0001750 0000144 00000000000 13442242642 016174 5 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/pynlpl/tests/FoLiA/ 0000755 0001750 0000144 00000000000 13442242642 017126 5 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/pynlpl/tests/FoLiA/foliatools/ 0000755 0001750 0000144 00000000000 13442242642 021301 5 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/pynlpl/tests/FoLiA/foliatools/__init__.py 0000644 0001750 0000144 00000000000 13145013073 023372 0 ustar proycon users 0000000 0000000 PyNLPl-1.2.9/pynlpl/tests/FoLiA/foliatools/alpino2folia.py 0000755 0001750 0000144 00000017007 13145013073 024234 0 ustar proycon users 0000000 0000000 #! /usr/bin/env python3
# -*- coding: utf8 -*-
from __future__ import print_function, unicode_literals, division, absolute_import
import lxml
import getopt
import sys
import os
try:
from pynlpl.formats import folia
except:
print("ERROR: pynlpl not found, please obtain PyNLPL from the Python Package Manager ($ sudo easy_install pynlpl) or directly from github: $ git clone git://github.com/proycon/pynlpl.git", file=sys.stderr)
sys.exit(2)
def usage():
print("alpino2folia",file=sys.stderr)
print(" by Maarten van Gompel (proycon)",file=sys.stderr)
print(" Centre for Language and Speech Technology, Radboud University Nijmegen",file=sys.stderr)
print(" 2012-2016 - Licensed under GPLv3",file=sys.stderr)
print("",file=sys.stderr)
print("This conversion script reads an Alpino XML document and converts",file=sys.stderr)
print("it to FoLiA. If multiple input files are specified, and/or the output FoLiA document already exists, then the",file=sys.stderr)
print("converter will append it.",file=sys.stderr)
print("",file=sys.stderr)
print("Usage: alpino2folia [options] alpino-input [alpino-input 2..] folia-output" ,file=sys.stderr)
def extract_syntax(alpinonode, folianode, foliasentence, alpinoroot):
for node in alpinonode:
if 'word' in node.attrib:
folianode.append(folia.SyntacticUnit, foliasentence[int(node.attrib['begin'])], cls=node.attrib['pos'],id=foliasentence.id+'.alpinonode.'+node.attrib['id'] )
elif 'cat' in node.attrib:
su = folianode.append(folia.SyntacticUnit, cls=node.attrib['cat'],id=foliasentence.id+'.alpinonode.'+node.attrib['id'] )
extract_syntax(node, su, foliasentence,alpinoroot)
else:
print("SYNTAX: Don't know what to do with node...", repr(node.attrib) ,file=sys.stderr)
def extract_dependencies(alpinonode, deplayer, foliasentence):
deps = []
head = None
for node in alpinonode:
#print("DEP:", node,file=sys.stderr)
if not 'word' in node.attrib:
extract_dependencies(node, deplayer, foliasentence )
if 'rel' in node.attrib:
if node.attrib['rel'] == 'hd':
head = folia.DependencyHead(deplayer.doc, foliasentence[int(node.attrib['begin'])])
else:
deps.append( (node.attrib['rel'], folia.DependencyDependent(deplayer.doc, foliasentence[int(node.attrib['begin'])]) ) )
if head:
for cls, dep in deps:
deplayer.append( folia.Dependency, head, dep, cls=cls)
def makefoliadoc(outputfile):
baseid = os.path.basename(outputfile).replace('.folia.xml','').replace('.xml','')
foliadoc = folia.Document(id=baseid)
foliadoc.append(folia.Text(foliadoc, id=baseid+'.text'))
if not foliadoc.declared(folia.AnnotationType.TOKEN, 'alpino-tokens'):
foliadoc.declare(folia.AnnotationType.TOKEN, 'alpino-tokens')
if not foliadoc.declared(folia.LemmaAnnotation, 'alpino-lemmas'):
foliadoc.declare(folia.LemmaAnnotation, 'alpino-lemmas')
if not foliadoc.declared(folia.SenseAnnotation, 'alpino-sense'):
foliadoc.declare(folia.SenseAnnotation, 'alpino-sense')
if not foliadoc.declared(folia.PosAnnotation, 'alpino-pos'):
foliadoc.declare(folia.PosAnnotation, 'alpino-pos')
if not foliadoc.declared(folia.AnnotationType.DEPENDENCY, 'alpino-dependency'):
foliadoc.declare(folia.AnnotationType.DEPENDENCY, 'alpino-dependency')
if not foliadoc.declared(folia.AnnotationType.SYNTAX, 'alpino-syntax'):
foliadoc.declare(folia.AnnotationType.SYNTAX, 'alpino-syntax')
if not foliadoc.declared(folia.AnnotationType.MORPHOLOGICAL, 'alpino-morphology'):
foliadoc.declare(folia.AnnotationType.MORPHOLOGICAL, 'alpino-morphology')
return foliadoc
def alpino2folia(alpinofile, foliadoc):
tree = lxml.etree.parse(alpinofile)
alpinoroot = tree.getroot()
if alpinoroot.tag != 'alpino_ds':
raise Exception("source file is not an alpino file")
sentencenode = alpinoroot.xpath('//sentence')[0]
foliatextbody = foliadoc[-1]
foliasentence = foliatextbody.append(folia.Sentence)
#first pass, extract words
alpinowords = sentencenode.text.split(' ')
for alpinoword in alpinowords:
foliasentence.append(folia.Word,alpinoword.strip())
#loop over lexical nodes
for node in alpinoroot.xpath('//node'):
if 'word' in node.attrib and 'pos' in node.attrib:
index = int(node.attrib['begin'])
if alpinowords[index].strip() != node.attrib['word'].strip():
raise Exception("Inconsistency in Alpino XML! Node@begin refers to word index " + str(index) + ", which has value \"" + alpinowords[index] + "\" and does not correspond with node@word \"" + node.attrib['word'] + "\"")
foliaword = foliasentence[index]
if 'lemma' in node.attrib:
foliaword.append(folia.LemmaAnnotation, cls=node.attrib['lemma'])
if 'sense' in node.attrib:
foliaword.append(folia.SenseAnnotation, cls=node.attrib['sense'])
if 'root' in node.attrib:
layer = foliaword.append(folia.MorphologyLayer)
layer.append(folia.Morpheme, folia.TextContent(foliadoc, node.attrib['root']), cls='root')
if 'postag' in node.attrib and 'pt' in node.attrib:
foliapos = foliaword.append(folia.PosAnnotation, cls=node.attrib['postag'], head=node.attrib['pt'])
elif 'frame' in node.attrib:
foliaword.append(folia.PosAnnotation, cls=node.attrib['frame'], head=node.attrib['pos'])
else:
foliaword.append(folia.PosAnnotation, cls=node.attrib['pos'])
#gather pos features
for key, value in node.attrib.items():
if key in ('wh','per','num','gen','case','def','infl','sc','buiging','refl','tense','comparative','positie','pvagr','pvtijd','graad','pdtype','wvorm','ntype','vwtype','getal','status','naamval','persoon','genus'):
foliapos.append(folia.Feature, subset=key, cls=value)
elif not key in ('sense','pos','rel','postag','pt','frame','root','lemma','id','begin','end','word','index'):
print("WARNING: Ignored attribute " + key + "=\"" + value + "\" on node ",file=sys.stderr)
foliasyntaxlayer = foliasentence.append(folia.SyntaxLayer)
foliasyntaxtop = foliasyntaxlayer.append(folia.SyntacticUnit, cls='top')
#extract syntax
extract_syntax(alpinoroot[0], foliasyntaxtop, foliasentence, alpinoroot)
foliadeplayer = foliasentence.append(folia.DependenciesLayer)
#extract dependencies:
extract_dependencies(alpinoroot[0], foliadeplayer, foliasentence)
return foliadoc
def main():
try:
opts, args = getopt.getopt(sys.argv[1:], "-h", ["help"])
except getopt.GetoptError as err:
print(str(err),file=sys.stderr)
usage()
sys.exit(2)
for o, a in opts:
if o == '-h' or o == '--help':
usage()
sys.exit(0)
else:
raise Exception("No such option: " + o)
if len(args) < 2:
usage()
sys.exit(2)
else:
alpinofiles = []
for i, arg in enumerate(args):
if i < len(args) - 1:
alpinofiles.append(arg)
foliafile = args[-1]
if os.path.exists(foliafile):
doc = folia.Document(file=foliafile)
else:
doc = makefoliadoc(foliafile)
for alpinofile in alpinofiles:
doc = alpino2folia(alpinofile, doc)
doc.save(foliafile)
if __name__ == "__main__":
main()
PyNLPl-1.2.9/pynlpl/tests/FoLiA/foliatools/cgn2folia.py 0000755 0001750 0000144 00000005603 13145013073 023520 0 ustar proycon users 0000000 0000000 #!/usr/bin/env python
#-*- coding:utf-8 -*-
#---------------------------------------------------------------
# CGN to FoLiA Converter
# by Maarten van Gompel
# Centre for Language Studies
# Radboud University Nijmegen
# proycon AT anaproy DOT nl
#
# Licensed under GPLv3
#
# This script converts CGN to FoLiA format. (Note that paragraph information
# is not available in CGN and therefore not stored in FoLiA format either.)
#
#----------------------------------------------------------------
from __future__ import print_function, unicode_literals, division, absolute_import
import sys
import glob
import gzip
import os
from pynlpl.formats import folia
CGN_ENCODING = 'iso-8859-15' #not yet used!
if len(sys.argv) != 3:
print("SYNTAX: ./cgn2folia.py cgnrootdir outputdir", file=sys.stderr)
sys.exit(1)
cgndir = sys.argv[1]
outdir = sys.argv[2]
plkdir = os.path.join(cgndir,"data","annot","text","plk")
for compdir in glob.glob(os.path.join(plkdir, "comp-*")):
collection_id = "CGN-" + os.path.basename(compdir)
print(collection_id)
try:
os.mkdir(os.path.join(outdir, collection_id))
except:
pass
files = list(glob.glob(os.path.join(compdir,"nl","*.gz"))) + list(glob.glob(os.path.join(compdir, "vl","*.gz")))
for path in files:
text_id = os.path.basename(path).split(".")[0]
print("\t" + text_id)
full_id = collection_id + "_" + text_id
au_id = None
sentence = None
doc = folia.Document(id=full_id)
doc.metadatatype = folia.MetaDataType.IMDI
doc.metadatafile = text_id + ".imdi"
textbody = doc.append(folia.Text(doc, id=full_id+"."+text_id))
doc.declare(folia.PosAnnotation, set="hdl:1839/00-SCHM-0000-0000-000B-9")
doc.declare(folia.LemmaAnnotation, set="hdl:1839/00-SCHM-0000-0000-000E-3")
fin = gzip.open(path,'r')
for line in fin:
line = unicode(line,CGN_ENCODING)
if line:
if line[0:3] == '