PyNLPl-1.1.2/0000755000175000001440000000000013024723552013504 5ustar proyconusers00000000000000PyNLPl-1.1.2/setup.cfg0000644000175000001440000000031213024723552015321 0ustar proyconusers00000000000000[build_sphinx] source-dir = ../docs/ build-dir = ../docs/build all_files = 1 [upload_sphinx] upload-dir = ../docs/build/html [easy_install] [egg_info] tag_build = tag_date = 0 tag_svn_revision = 0 PyNLPl-1.1.2/LICENSE0000644000175000001440000010451313024723552014515 0ustar proyconusers00000000000000 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU General Public License is a free, copyleft license for software and other kinds of works. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others. For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it. For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions. Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users. Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Use with the GNU Affero General Public License. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode: Copyright (C) This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an "about box". You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see . The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read . PyNLPl-1.1.2/setup.py0000755000175000001440000000537213024723325015226 0ustar proyconusers00000000000000#! /usr/bin/env python # -*- coding: utf8 -*- from __future__ import print_function import os import sys from setuptools import setup, find_packages if os.path.dirname(__file__) != "": os.chdir(os.path.dirname(__file__)) if not os.path.exists('pynlpl'): print("Preparing build",file=sys.stderr) if os.path.exists('build'): os.system('rm -Rf build') os.mkdir('build') os.chdir('build') if not os.path.exists('pynlpl'): os.mkdir('pynlpl') os.system('cp -Rpf ../* pynlpl/ 2> /dev/null') os.system('mv -f pynlpl/setup.py pynlpl/setup.cfg .') os.system('cp -f pynlpl/README.rst .') os.system('cp -f pynlpl/LICENSE .') os.system('cp -f pynlpl/MANIFEST.in .') #Do not include unfininished WIP modules: os.system('rm -f pynlpl/formats/colibri.py pynlpl/formats/alpino.py pynlpl/foliaprocessing.py pynlpl/grammar.py') def read(fname): return open(os.path.join(os.path.dirname(__file__), fname)).read() entry_points = {} if sys.version > '3': entry_points = { 'console_scripts': [ 'pynlpl-computepmi = pynlpl.tools.computepmi:main', 'pynlpl-sampler = pynlpl.tools.sampler:main', 'pynlpl-makefreqlist = pynlpl.tools.freqlist:main', ] } setup( name = "PyNLPl", version = "1.1.2", #edit version in __init__.py as well and ensure tests/folia.py FOLIARELEASE points to the right version! author = "Maarten van Gompel", author_email = "proycon@anaproy.nl", description = ("PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl contains modules for basic tasks, clients for interfacting with server, and modules for parsing several file formats common in NLP, most notably FoLiA."), license = "GPL", keywords = "nlp computational_linguistics search ngrams language_models linguistics toolkit", url = "https://github.com/proycon/pynlpl", packages=['pynlpl','pynlpl.clients','pynlpl.lm','pynlpl.formats','pynlpl.mt','pynlpl.tools','pynlpl.tests'], long_description=read('README.rst'), classifiers=[ "Development Status :: 5 - Production/Stable", "Topic :: Text Processing :: Linguistic", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Operating System :: POSIX", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", ], zip_safe=False, include_package_data=True, package_data = {'pynlpl': ['tests/test.sh', 'tests/evaluation_timbl/*'] }, install_requires=['lxml >= 2.2','httplib2 >= 0.6','rdflib'], entry_points = entry_points ) PyNLPl-1.1.2/PKG-INFO0000644000175000001440000001112213024723552014576 0ustar proyconusers00000000000000Metadata-Version: 1.1 Name: PyNLPl Version: 1.1.2 Summary: PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl contains modules for basic tasks, clients for interfacting with server, and modules for parsing several file formats common in NLP, most notably FoLiA. Home-page: https://github.com/proycon/pynlpl Author: Maarten van Gompel Author-email: proycon@anaproy.nl License: GPL Description: PyNLPl - Python Natural Language Processing Library ===================================================== .. image:: https://travis-ci.org/proycon/pynlpl.svg?branch=master :target: https://travis-ci.org/proycon/pynlpl .. image:: http://readthedocs.org/projects/pynlpl/badge/?version=latest :target: http://pynlpl.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status .. image:: http://applejack.science.ru.nl/lamabadge.php/pynlpl :target: http://applejack.science.ru.nl/languagemachines/ PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotatation). The library is a divided into several packages and modules. It works on Python 2.7, as well as Python 3. The following modules are available: - ``pynlpl.datatypes`` - Extra datatypes (priority queues, patterns, tries) - ``pynlpl.evaluation`` - Evaluation & experiment classes (parameter search, wrapped progressive sampling, class evaluation (precision/recall/f-score/auc), sampler, confusion matrix, multithreaded experiment pool) - ``pynlpl.formats.cgn`` - Module for parsing CGN (Corpus Gesproken Nederlands) part-of-speech tags - ``pynlpl.formats.folia`` - Extensive library for reading and manipulating the documents in `FoLiA `_ format (Format for Linguistic Annotation). - ``pynlpl.formats.fql`` - Extensive library for the FoLiA Query Language (FQL), built on top of ``pynlpl.formats.folia``. FQL is currently documented `here `__. - ``pynlpl.formats.cql`` - Parser for the Corpus Query Language (CQL), as also used by Corpus Workbench and Sketch Engine. Contains a convertor to FQL. - ``pynlpl.formats.giza`` - Module for reading GIZA++ word alignment data - ``pynlpl.formats.moses`` - Module for reading Moses phrase-translation tables. - ``pynlpl.formats.sonar`` - Largely obsolete module for pre-releases of the SoNaR corpus, use ``pynlpl.formats.folia`` instead. - ``pynlpl.formats.timbl`` - Module for reading Timbl output (consider using `python-timbl `_ instead though) - ``pynlpl.lm.lm`` - Module for simple language model and reader for ARPA language model data as well (used by SRILM). - ``pynlpl.search`` - Various search algorithms (Breadth-first, depth-first, beam-search, hill climbing, A star, various variants of each) - ``pynlpl.statistics`` - Frequency lists, Levenshtein, common statistics and information theory functions - ``pynlpl.textprocessors`` - Simple tokeniser, n-gram extraction API Documentation can be found `here `__. Keywords: nlp computational_linguistics search ngrams language_models linguistics toolkit Platform: UNKNOWN Classifier: Development Status :: 5 - Production/Stable Classifier: Topic :: Text Processing :: Linguistic Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Operating System :: POSIX Classifier: Intended Audience :: Developers Classifier: Intended Audience :: Science/Research Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) PyNLPl-1.1.2/MANIFEST.in0000644000175000001440000000023713024723552015244 0ustar proyconusers00000000000000include README.rst include LICENSE include requirements.txt recursive-include pynlpl *.py include pynlpl/tests/test.sh include pynlpl/tests/evaluation_timbl/* PyNLPl-1.1.2/PyNLPl.egg-info/0000755000175000001440000000000013024723552016314 5ustar proyconusers00000000000000PyNLPl-1.1.2/PyNLPl.egg-info/entry_points.txt0000644000175000001440000000024013024723552021606 0ustar proyconusers00000000000000[console_scripts] pynlpl-computepmi = pynlpl.tools.computepmi:main pynlpl-makefreqlist = pynlpl.tools.freqlist:main pynlpl-sampler = pynlpl.tools.sampler:main PyNLPl-1.1.2/PyNLPl.egg-info/SOURCES.txt0000644000175000001440000000350613024723552020204 0ustar proyconusers00000000000000LICENSE MANIFEST.in README.rst setup.cfg setup.py PyNLPl.egg-info/PKG-INFO PyNLPl.egg-info/SOURCES.txt PyNLPl.egg-info/dependency_links.txt PyNLPl.egg-info/entry_points.txt PyNLPl.egg-info/not-zip-safe PyNLPl.egg-info/requires.txt PyNLPl.egg-info/top_level.txt pynlpl/__init__.py pynlpl/algorithms.py pynlpl/common.py pynlpl/datatypes.py pynlpl/evaluation.py pynlpl/fsa.py pynlpl/net.py pynlpl/search.py pynlpl/statistics.py pynlpl/tagger.py pynlpl/textprocessors.py pynlpl/build/pynlpl/algorithms.py pynlpl/clients/__init__.py pynlpl/clients/cornetto.py pynlpl/clients/freeling.py pynlpl/clients/frogclient.py pynlpl/docs/conf.py pynlpl/formats/__init__.py pynlpl/formats/cgn.py pynlpl/formats/cql.py pynlpl/formats/dutchsemcor.py pynlpl/formats/folia.py pynlpl/formats/foliaset.py pynlpl/formats/fql.py pynlpl/formats/giza.py pynlpl/formats/imdi.py pynlpl/formats/moses.py pynlpl/formats/sonar.py pynlpl/formats/taggerdata.py pynlpl/formats/timbl.py pynlpl/lm/__init__.py pynlpl/lm/client.py pynlpl/lm/lm.py pynlpl/lm/server.py pynlpl/lm/srilm.py pynlpl/mt/__init__.py pynlpl/mt/wordalign.py pynlpl/tests/__init__.py pynlpl/tests/cgn.py pynlpl/tests/cql.py pynlpl/tests/datatypes.py pynlpl/tests/evaluation.py pynlpl/tests/folia.py pynlpl/tests/folia_benchmark.py pynlpl/tests/formats.py pynlpl/tests/fql.py pynlpl/tests/search.py pynlpl/tests/statistics.py pynlpl/tests/test.sh pynlpl/tests/textprocessors.py pynlpl/tests/evaluation_timbl/test pynlpl/tests/evaluation_timbl/test.IB1.O.gr.k1.out pynlpl/tests/evaluation_timbl/timbltest.sh pynlpl/tests/evaluation_timbl/train pynlpl/tools/__init__.py pynlpl/tools/computepmi.py pynlpl/tools/foliasplitcgnpostags.py pynlpl/tools/freqlist.py pynlpl/tools/frogwrapper.py pynlpl/tools/phrasetableserver.py pynlpl/tools/reflow.py pynlpl/tools/sampler.py pynlpl/tools/sonar2folia.py pynlpl/tools/sonarlemmafreqlist.pyPyNLPl-1.1.2/PyNLPl.egg-info/PKG-INFO0000644000175000001440000001112213024723552017406 0ustar proyconusers00000000000000Metadata-Version: 1.1 Name: PyNLPl Version: 1.1.2 Summary: PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl contains modules for basic tasks, clients for interfacting with server, and modules for parsing several file formats common in NLP, most notably FoLiA. Home-page: https://github.com/proycon/pynlpl Author: Maarten van Gompel Author-email: proycon@anaproy.nl License: GPL Description: PyNLPl - Python Natural Language Processing Library ===================================================== .. image:: https://travis-ci.org/proycon/pynlpl.svg?branch=master :target: https://travis-ci.org/proycon/pynlpl .. image:: http://readthedocs.org/projects/pynlpl/badge/?version=latest :target: http://pynlpl.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status .. image:: http://applejack.science.ru.nl/lamabadge.php/pynlpl :target: http://applejack.science.ru.nl/languagemachines/ PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotatation). The library is a divided into several packages and modules. It works on Python 2.7, as well as Python 3. The following modules are available: - ``pynlpl.datatypes`` - Extra datatypes (priority queues, patterns, tries) - ``pynlpl.evaluation`` - Evaluation & experiment classes (parameter search, wrapped progressive sampling, class evaluation (precision/recall/f-score/auc), sampler, confusion matrix, multithreaded experiment pool) - ``pynlpl.formats.cgn`` - Module for parsing CGN (Corpus Gesproken Nederlands) part-of-speech tags - ``pynlpl.formats.folia`` - Extensive library for reading and manipulating the documents in `FoLiA `_ format (Format for Linguistic Annotation). - ``pynlpl.formats.fql`` - Extensive library for the FoLiA Query Language (FQL), built on top of ``pynlpl.formats.folia``. FQL is currently documented `here `__. - ``pynlpl.formats.cql`` - Parser for the Corpus Query Language (CQL), as also used by Corpus Workbench and Sketch Engine. Contains a convertor to FQL. - ``pynlpl.formats.giza`` - Module for reading GIZA++ word alignment data - ``pynlpl.formats.moses`` - Module for reading Moses phrase-translation tables. - ``pynlpl.formats.sonar`` - Largely obsolete module for pre-releases of the SoNaR corpus, use ``pynlpl.formats.folia`` instead. - ``pynlpl.formats.timbl`` - Module for reading Timbl output (consider using `python-timbl `_ instead though) - ``pynlpl.lm.lm`` - Module for simple language model and reader for ARPA language model data as well (used by SRILM). - ``pynlpl.search`` - Various search algorithms (Breadth-first, depth-first, beam-search, hill climbing, A star, various variants of each) - ``pynlpl.statistics`` - Frequency lists, Levenshtein, common statistics and information theory functions - ``pynlpl.textprocessors`` - Simple tokeniser, n-gram extraction API Documentation can be found `here `__. Keywords: nlp computational_linguistics search ngrams language_models linguistics toolkit Platform: UNKNOWN Classifier: Development Status :: 5 - Production/Stable Classifier: Topic :: Text Processing :: Linguistic Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Operating System :: POSIX Classifier: Intended Audience :: Developers Classifier: Intended Audience :: Science/Research Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) PyNLPl-1.1.2/PyNLPl.egg-info/dependency_links.txt0000644000175000001440000000000113024723552022362 0ustar proyconusers00000000000000 PyNLPl-1.1.2/PyNLPl.egg-info/not-zip-safe0000644000175000001440000000000113024723552020542 0ustar proyconusers00000000000000 PyNLPl-1.1.2/PyNLPl.egg-info/top_level.txt0000644000175000001440000000000713024723552021043 0ustar proyconusers00000000000000pynlpl PyNLPl-1.1.2/PyNLPl.egg-info/requires.txt0000644000175000001440000000004313024723552020711 0ustar proyconusers00000000000000lxml >= 2.2 httplib2 >= 0.6 rdflib PyNLPl-1.1.2/pynlpl/0000755000175000001440000000000013024723552015022 5ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/fsa.py0000644000175000001440000001126112527312166016150 0ustar proyconusers00000000000000#--------------------------------------------------------------- # PyNLPl - Finite State Automata # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # http://proycon.github.com/folia # http://www.github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Partially based/inspired on code by Xiayun Sun (https://github.com/xysun/regex) # # Licensed under GPLv3 # #---------------------------------------------------------------- from __future__ import print_function, unicode_literals, division, absolute_import import sys class State(object): def __init__(self, **kwargs): if 'epsilon' in kwargs: self.epsilon = kwargs['epsilon'] # epsilon-closure (lis of states) else: self.epsilon = [] # epsilon-closure if 'transitions' in kwargs: self.transitions = kwargs['transitions'] else: self.transitions = [] #(matchitem, matchfunction(value), state) if 'final' in kwargs: self.final = bool(kwargs['final']) # ending state else: self.final = False self.transitioned = None #will be a tuple (state, matchitem) indicating how this state was reached class NFA(object): """Non-deterministic finite state automaton. Can be used to model DFAs as well if your state transitions are not ambiguous and epsilon is empty.""" def __init__(self, initialstate): self.initialstate = initialstate def run(self, sequence, mustmatchall=False,debug=False): def add(state, states): """add state and recursively add epsilon transitions""" assert isinstance(state, State) if state in states: return states.add(state) for eps in state.epsilon: #recurse into epsilon transitions add(eps, states) current_states = set() add(self.initialstate, current_states) if debug: print("Starting run, current states: ", repr(current_states),file=sys.stderr) for offset, value in enumerate(sequence): if not current_states: break if debug: print("Value: ", repr(value),file=sys.stderr) next_states = set() for state in current_states: for matchitem, matchfunction, trans_state in state.transitions: if matchfunction(value): trans_state.transitioned = (state, matchitem) add(trans_state, next_states) current_states = next_states if debug: print("Current states: ", repr(current_states),file=sys.stderr) if not mustmatchall: for s in current_states: if s.final: if debug: print("Final state reached",file=sys.stderr) yield offset+1 if mustmatchall: for s in current_states: if s.final: if debug: print("Final state reached",file=sys.stderr) yield offset+1 def match(self, sequence): try: return next(self.run(sequence,True)) == len(sequence) except StopIteration: return False def find(self, sequence, debug=False): l = len(sequence) for i in range(0,l): for length in self.run(sequence[i:], False, debug): yield sequence[i:i+length] def __iter__(self): return iter(self._states(self.initialstate)) def _states(self, state, processedstates=[]): #pylint: disable=dangerous-default-value """Iterate over all states in no particular order""" processedstates.append(state) for nextstate in state.epsilon: if not nextstate in processedstates: self._states(nextstate, processedstates) for _, nextstate in state.transitions: if not nextstate in processedstates: self._states(nextstate, processedstates) return processedstates def __repr__(self): out = [] for state in self: staterep = repr(state) if state is self.initialstate: staterep += " (INITIAL)" for nextstate in state.epsilon: nextstaterep = repr(nextstate) if nextstate.final: nextstaterep += " (FINAL)" out.append( staterep + " -e-> " + nextstaterep ) for item, _, nextstate in state.transitions: nextstaterep = repr(nextstate) if nextstate.final: nextstaterep += " (FINAL)" out.append( staterep + " -(" + repr(item) + ")-> " + nextstaterep ) return "\n".join(out) PyNLPl-1.1.2/pynlpl/lm/0000755000175000001440000000000013024723552015432 5ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/lm/server.py0000644000175000001440000000331712445064173017321 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- #--------------------------------------------------------------- # PyNLPl - Language Models # by Maarten van Gompel, ILK, Universiteit van Tilburg # http://ilk.uvt.nl/~mvgompel # proycon AT anaproy DOT nl # # Generic Server for Language Models # #---------------------------------------------------------------- #No Python 3 support for twisted yet... from twisted.internet import protocol, reactor from twisted.protocols import basic class LMSentenceProtocol(basic.LineReceiver): def lineReceived(self, sentence): try: score = self.factory.lm.scoresentence(sentence) except: score = 0.0 self.sendLine(str(score)) class LMSentenceFactory(protocol.ServerFactory): protocol = LMSentenceProtocol def __init__(self, lm): self.lm = lm class LMNGramProtocol(basic.LineReceiver): def lineReceived(self, ngram): ngram = ngram.split(" ") try: score = self.factory.lm[ngram] except: score = 0.0 self.sendLine(str(score)) class LMNGramFactory(protocol.ServerFactory): protocol = LMNGramProtocol def __init__(self, lm): self.lm = lm class LMServer: """Language Model Server""" def __init__(self, lm, port=12346, n=0): """n indicates the n-gram size, if set to 0 (which is default), the server will expect to only receive whole sentence, if set to a particular value, it will only expect n-grams of that value""" if n == 0: reactor.listenTCP(port, LMSentenceFactory(lm)) else: reactor.listenTCP(port, LMNGramFactory(lm)) reactor.run() PyNLPl-1.1.2/pynlpl/lm/client.py0000644000175000001440000000327012445064173017267 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import socket class LMClient(object): def __init__(self,host= "localhost",port=12346,n = 0): self.BUFSIZE = 1024 self.socket = socket.socket(socket.AF_INET,socket.SOCK_STREAM) #Create the socket self.socket.settimeout(120) assert isinstance(port,int) self.socket.connect((host, port)) #Connect to server assert isinstance(n,int) self.n = n def scoresentence(self, sentence): if self.n > 0: raise Exception("This client instance has been set to send only " + str(self.n) + "-grams") if isinstance(sentence,list) or isinstance(sentence,tuple): sentence = " ".join(sentence) self.socket.send(sentence+ "\r\n") return float(self.socket.recv(self.BUFSIZE).strip()) def __getitem__(self, ngram): if self.n == 0: raise Exception("This client has been set to send only full sentence, not n-grams") if isinstance(ngram,str) or isinstance(ngram,unicode): ngram = ngram.split(" ") if len(ngram) != self.n: raise Exception("This client instance has been set to send only " + str(self.n) + "-grams.") ngram = " ".join(ngram) if (sys.version < '3' and isinstance(ngram,unicode)) or( sys.version == '3' and isinstance(ngram,str)): ngram = ngram.encode('utf-8') self.socket.send(ngram + b"\r\n") return float(self.socket.recv(self.BUFSIZE).strip()) PyNLPl-1.1.2/pynlpl/lm/lm.py0000644000175000001440000002657612445064173016437 0ustar proyconusers00000000000000#--------------------------------------------------------------- # PyNLPl - Language Models # by Maarten van Gompel, ILK, Universiteit van Tilburg # http://ilk.uvt.nl/~mvgompel # proycon AT anaproy DOT nl # # Licensed under GPLv3 # #---------------------------------------------------------------- from __future__ import absolute_import from __future__ import division from __future__ import print_function from __future__ import unicode_literals import io import math import sys from pynlpl.statistics import FrequencyList, product from pynlpl.textprocessors import Windower if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout class SimpleLanguageModel: """This is a simple unsmoothed language model. This class can both hold and compute the model.""" def __init__(self, n=2, casesensitive = True, beginmarker = "", endmarker = ""): self.casesensitive = casesensitive self.freqlistN = FrequencyList(None, self.casesensitive) self.freqlistNm1 = FrequencyList(None, self.casesensitive) assert isinstance(n,int) and n >= 2 self.n = n self.beginmarker = beginmarker self.endmarker = endmarker self.sentences = 0 if self.beginmarker: self._begingram = tuple([self.beginmarker] * (n-1)) if self.endmarker: self._endgram = tuple([self.endmarker] * (n-1)) def append(self, sentence): if isinstance(sentence, str) or isinstance(sentence, unicode): sentence = sentence.strip().split(' ') self.sentences += 1 for ngram in Windower(sentence,self.n, self.beginmarker, self.endmarker): self.freqlistN.count(ngram) for ngram in Windower(sentence,self.n-1, self.beginmarker, self.endmarker): self.freqlistNm1.count(ngram) def load(self, filename): self.freqlistN = FrequencyList(None, self.casesensitive) self.freqlistNm1 = FrequencyList(None, self.casesensitive) f = io.open(filename,'r',encoding='utf-8') mode = False for line in f.readlines(): line = line.strip() if line: if not mode: if line != "[simplelanguagemodel]": raise Exception("File is not a SimpleLanguageModel") else: mode = 1 elif mode == 1: if line[:2] == 'n=': self.n = int(line[2:]) elif line[:12] == 'beginmarker=': self.beginmarker = line[12:] elif line[:10] == 'endmarker=': self.endmarker = line[10:] elif line[:10] == 'sentences=': self.sentences = int(line[10:]) elif line[:14] == 'casesensitive=': self.casesensitive = bool(int(line[14:])) self.freqlistN = FrequencyList(None, self.casesensitive) self.freqlistNm1 = FrequencyList(None, self.casesensitive) elif line == "[freqlistN]": mode = 2 else: raise Exception("Syntax error in language model file: ", line) elif mode == 2: if line == "[freqlistNm1]": mode = 3 else: try: type, count = line.split("\t") self.freqlistN.count(type.split(' '),int(count)) except: print("Warning, could not parse line whilst loading frequency list: ", line,file=stderr) elif mode == 3: try: type, count = line.split("\t") self.freqlistNm1.count(type.split(' '),int(count)) except: print("Warning, could not parse line whilst loading frequency list: ", line,file=stderr) if self.beginmarker: self._begingram = [self.beginmarker] * (self.n-1) if self.endmarker: self._endgram = [self.endmarker] * (self.n-1) def save(self, filename): f = io.open(filename,'w',encoding='utf-8') f.write("[simplelanguagemodel]\n") f.write("n="+str(self.n)+"\n") f.write("sentences="+str(self.sentences)+"\n") f.write("beginmarker="+self.beginmarker+"\n") f.write("endmarker="+self.endmarker+"\n") f.write("casesensitive="+str(int(self.casesensitive))+"\n") f.write("\n") f.write("[freqlistN]\n") for line in self.freqlistN.output(): f.write(line+"\n") f.write("[freqlistNm1]\n") for line in self.freqlistNm1.output(): f.write(line+"\n") f.close() def scoresentence(self, sentence): return product([self[x] for x in Windower(sentence, self.n, self.beginmarker, self.endmarker)]) def __getitem__(self, ngram): assert len(ngram) == self.n nm1gram = ngram[:-1] if (self.beginmarker and nm1gram == self._begingram) or (self.endmarker and nm1gram == self._endgram): return self.freqlistN[ngram] / float(self.sentences) else: return self.freqlistN[ngram] / float(self.freqlistNm1[nm1gram]) class ARPALanguageModel(object): """Full back-off language model, loaded from file in ARPA format. This class does not build the model but allows you to use a pre-computed one. You can use the tool ngram-count from for instance SRILM to actually build the model. """ class NgramsProbs(object): """Store Ngrams with their probabilities and backoffs. This class is used in order to abstract the physical storage layout, and enable memory/speed tradeoffs. """ def __init__(self, data, mode='simple', delim=' '): """Create an ngrams storage with the given method: 'simple' method is a Python dictionary (quick, takes much memory). 'trie' method is more space-efficient (~35% reduction) but slower. data is a dictionary of ngram-tuple => (probability, backoff). delim is the strings which converts ngrams between tuple and unicode string (for saving in trie mode). """ self.delim = delim self.mode = mode if mode == 'simple': self._data = data elif mode == 'trie': import marisa_trie self._data = marisa_trie.RecordTrie("@dd", [(self.delim.join(k), v) for k, v in data.items()]) else: raise ValueError("mode {} is not supported for NgramsProbs".format(mode)) def prob(self, ngram): """Return probability of given ngram tuple""" return self._data[ngram][0] if self.mode == 'simple' else self._data[self.delim.join(ngram)][0][0] def backoff(self, ngram): """Return backoff value of a given ngram tuple""" return self._data[ngram][1] if self.mode == 'simple' else self._data[self.delim.join(ngram)][0][1] def __len__(self): return len(self._data) def __init__(self, filename, encoding='utf-8', encoder=None, base_e=True, dounknown=True, debug=False, mode='simple'): # parameters self.encoder = (lambda x: x) if encoder is None else encoder self.base_e = base_e self.dounknown = dounknown self.debug = debug self.mode = mode # other attributes self.total = {} data = {} with io.open(filename, 'rt', encoding=encoding) as f: order = None for line in f: line = line.strip() if line == '\\data\\': order = 0 elif line == '\\end\\': break elif line.startswith('\\') and line.endswith(':'): for i in range(1, 10): if line == '\\{}-grams:'.format(i): order = i break else: raise ValueError("Order of n-gram is not supported!") elif line: if order == 0: # still in \data\ section if line.startswith('ngram'): n = int(line[6]) v = int(line[8:]) self.total[n] = v elif order > 0: fields = line.split('\t') logprob = float(fields[0]) if base_e: # * log(10) does log10 to log_e conversion logprob *= math.log(10) ngram = self.encoder(tuple(fields[1].split())) if len(fields) > 2: backoffprob = float(fields[2]) if base_e: # * log(10) does log10 to log_e conversion backoffprob *= math.log(10) if self.debug: msg = "Adding to LM: {}\t{}\t{}" print(msg.format(ngram, logprob, backoffprob), file=stderr) else: backoffprob = 0.0 if self.debug: msg = "Adding to LM: {}\t{}" print(msg.format(ngram, logprob), file=stderr) data[ngram] = (logprob, backoffprob) elif self.debug: print("Unable to parse ARPA LM line: " + line, file=stderr) self.order = order self.ngrams = self.NgramsProbs(data, mode) def score(self, data, history=None): result = 0 for word in data: result += self.scoreword(word, history) if history: history += (word,) else: history = (word,) return result def scoreword(self, word, history=None): if isinstance(word, str) or (sys.version < '3' and isinstance(word, unicode)): word = (word,) if history: lookup = history + word else: lookup = word if len(lookup) > self.order: lookup = lookup[-self.order:] try: return self.ngrams.prob(lookup) except KeyError: # not found, back off if not history: if self.dounknown: try: return self.ngrams.prob(('',)) except KeyError: msg = "Word {} not found. And no history specified and model has no ." raise KeyError(msg.format(word)) else: msg = "Word {} not found. And no history specified." raise KeyError(msg.format(word)) else: try: backoffweight = self.ngrams.backoff(history) except KeyError: backoffweight = 0 # backoff weight will be 0 if not found return backoffweight + self.scoreword(word, history[1:]) def __len__(self): return len(self.ngrams) PyNLPl-1.1.2/pynlpl/lm/__init__.py0000644000175000001440000000015712445064173017551 0ustar proyconusers00000000000000"""This package contains modules for Language Models, with a C++/Python module for SRILM by Sander Canisius""" PyNLPl-1.1.2/pynlpl/lm/srilm.py0000644000175000001440000000412313024723323017126 0ustar proyconusers00000000000000#--------------------------------------------------------------- # PyNLPl - SRILM Language Model # by Maarten van Gompel, ILK, Universiteit van Tilburg # http://ilk.uvt.nl/~mvgompel # proycon AT anaproy DOT nl # # Adapted from code by Sander Canisius # # Licensed under GPLv3 # # # This library enables using SRILM as language model # #---------------------------------------------------------------- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import try: import srilmcc except ImportError: import warnings warnings.warn("srilmcc module is not compiled") srilmcc = None from pynlpl.textprocessors import Windower class SRILMException(Exception): """Base Exception for SRILM.""" class SRILM: def __init__(self, filename, n): if not srilmcc: raise SRILMException( "SRILM is not downloaded and compiled." "Please follow the instructions in makesrilmcc") self.model = srilmcc.LanguageModel(filename, n) self.n = n def scoresentence(self, sentence, unknownwordprob=-12): score = 0 for ngram in Windower(sentence, self.n, "", ""): try: score += self.logscore(ngram) except KeyError: score += unknownwordprob return 10**score def __getitem__(self, ngram): return 10**self.logscore(ngram) def __contains__(self, key): return self.model.exists( key ) def logscore(self, ngram): #Bug work-around #if "" in ngram or "_" in ngram or "__" in ngram: # print >> sys.stderr, "WARNING: Invalid word in n-gram! Ignoring", ngram # return -999.9 if len(ngram) == self.n: if all( (self.model.exists(x) for x in ngram) ): #no phrases, basic trigram, compute directly return self.model.wordProb(*ngram) else: raise KeyError else: raise Exception("Not an " + str(self.n) + "-gram") PyNLPl-1.1.2/pynlpl/clients/0000755000175000001440000000000013024723552016463 5ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/clients/freeling.py0000644000175000001440000000755612445064173020650 0ustar proyconusers00000000000000############################################################### # PyNLPl - FreeLing Library # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Radboud University Nijmegen # # Licensed under GPLv3 # # This is a Python library for on-the-fly communication with # a FreeLing server. Allowing on-the-fly lemmatisation and # PoS-tagging. It is recommended to pass your data on a # sentence-by-sentence basis to FreeLingClient.process() # # Make sure to start Freeling (analyzer) with the --server # and --flush flags !!!!! # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u import socket import sys class FreeLingClient(object): def __init__(self, host, port, encoding='utf-8', timeout=120.0): """Initialise the client, set channel to the path and filename where the server's .in and .out pipes are (without extension)""" self.encoding = encoding self.BUFSIZE = 10240 self.socket = socket.socket(socket.AF_INET,socket.SOCK_STREAM) self.socket.settimeout(timeout) self.socket.connect( (host,int(port)) ) self.encoding = encoding self.socket.sendall('RESET_STATS\0') r = self.socket.recv(self.BUFSIZE) if not r.strip('\0') == 'FL-SERVER-READY': raise Exception("Server not ready") def process(self, sourcewords, debug=False): """Process a list of words, passing it to the server and realigning the output with the original words""" if isinstance( sourcewords, list ) or isinstance( sourcewords, tuple ): sourcewords_s = " ".join(sourcewords) else: sourcewords_s = sourcewords sourcewords = sourcewords.split(' ') self.socket.sendall(sourcewords_s.encode(self.encoding) +'\n\0') if debug: print("Sent:",sourcewords_s.encode(self.encoding),file=sys.stderr) results = [] done = False while not done: data = b"" while not data: buffer = self.socket.recv(self.BUFSIZE) if debug: print("Buffer: ["+repr(buffer)+"]",file=sys.stderr) if buffer[-1] == '\0': data += buffer[:-1] done = True break else: data += buffer data = u(data,self.encoding) if debug: print("Received:",data,file=sys.stderr) for i, line in enumerate(data.strip(' \t\0\r\n').split('\n')): if not line.strip(): done = True break else: cols = line.split(" ") subwords = cols[0].lower().split("_") if len(cols) > 2: #this seems a bit odd? for word in subwords: #split multiword expressions results.append( (word, cols[1], cols[2], i, len(subwords) > 1 ) ) #word, lemma, pos, index, multiword? sourcewords = [ w.lower() for w in sourcewords ] alignment = [] for i, sourceword in enumerate(sourcewords): found = False best = 0 distance = 999999 for j, (targetword, lemma, pos, index, multiword) in enumerate(results): if sourceword == targetword and abs(i-j) < distance: found = True best = j distance = abs(i-j) if found: alignment.append(results[best]) else: alignment.append((None,None,None,None,False)) #no alignment found return alignment PyNLPl-1.1.2/pynlpl/clients/__init__.py0000644000175000001440000000011512445064173020574 0ustar proyconusers00000000000000"""This packages contains clients for communicating with specific servers""" PyNLPl-1.1.2/pynlpl/clients/cornetto.py0000644000175000001440000012472112445064173020704 0ustar proyconusers00000000000000# -*- coding: utf-8 -*- ############################################################### # PyNLPl - Remote Cornetto Client # Adapted from code by Fons Laan (ILPS-ISLA, UvA) # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # Licensed under GPLv3 # # This is a Python library for connecting to a Cornetto database. # Originally coded by Fons Laan (ILPS-ISLA, University of Amsterdam) # for DutchSemCor. # # The library currently has only a minimal set of functionality compared # to the original. It will be extended on a need-to basis. # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import sys import httplib2 # version 0.6.0+ if sys.version < '3': import urlparse import httplib else: from urllib import parse as urlparse # renamed to urllib.parse in Python 3.0 import http.client as httplib #renamed in Python 3 import urllib, base64 from sys import stderr #import pickle printf = lambda x: sys.stdout.write(x+ "\n") from lxml import etree class CornettoClient(object): def __init__(self, user='gast',password='gast',host='debvisdic.let.vu.nl', port=9002, path = '/doc', scheme='https',debug=False): self.host = host self.port = port self.path = path self.scheme = scheme self.debug = debug self.userid = user self.passwd = password def connect(self): if self.debug: printf( "cornettodb/views/remote_open()" ) # permission denied on cornetto with apache # http = httplib2.Http( ".cache" ) try: http = httplib2.Http(disable_ssl_certificate_validation=True) except TypeError: print >>stderr, "[CornettoClient] WARNING: Older version of httplib2! Can not disable_ssl_certificate_validation" http = httplib2.Http() #for older httplib2 # VU DEBVisDic authentication http.add_credentials( self.userid, self.passwd ) params = "" # query = "action=init" # obsolete query = "action=connect" fragment = "" db_url_tuple = ( self.scheme, self.host + ':' + str(self.port), self.path, params, query, fragment ) db_url = urlparse.urlunparse( db_url_tuple ) if self.debug: printf( "db_url: %s" % db_url ) printf( "http.request()..." ); try: resp, content = http.request( db_url, "GET" ) if self.debug: printf( "resp:\n%s" % resp ) printf( "content:\n%s" % content ) except: printf( "...failed." ); # when CORNETTO_HOST is off-line, we do not have a response resp = None content = None return http, resp, content def get_syn_ids_by_lemma(self, lemma): """Returns a list of synset IDs based on a lemma""" if not isinstance(lemma,unicode): lemma = unicode(lemma,'utf-8') http, resp, content = self.connect() params = "" fragment = "" path = "cdb_syn" if self.debug: printf( "cornettodb/views/query_remote_syn_lemma: db_opt: %s" % path ) query_opt = "dict_search" if self.debug: printf( "cornettodb/views/query_remote_syn_lemma: query_opt: %s" % query_opt ) qdict = {} qdict[ "action" ] = "queryList" qdict[ "word" ] = lemma.encode('utf-8') query = urllib.urlencode( qdict ) db_url_tuple = ( self.scheme, self.host + ':' + str(self.port), path, params, query, fragment ) db_url = urlparse.urlunparse( db_url_tuple ) if self.debug: printf( "db_url: %s" % db_url ) resp, content = http.request( db_url, "GET" ) if self.debug: printf( "resp:\n%s" % resp ) printf( "content:\n%s" % content ) # printf( "content is of type: %s" % type( content ) ) dict_list = [] dict_list = eval( content ) # string to list synsets = [] items = len( dict_list ) if self.debug: printf( "items: %d" % items ) # syn dict: like lu dict, but without pos: part-of-speech for dict in dict_list: if self.debug: printf( dict ) seq_nr = dict[ "seq_nr" ] # sense number value = dict[ "value" ] # lexical unit identifier form = dict[ "form" ] # lemma label = dict[ "label" ] # label to be shown if self.debug: printf( "seq_nr: %s" % seq_nr ) printf( "value: %s" % value ) printf( "form: %s" % form ) printf( "label: %s" % label ) if value != "": synsets.append( value ) return synsets def get_lu_ids_by_lemma(self, lemma, targetpos = None): """Returns a list of lexical unit IDs based on a lemma and a pos tag""" if not isinstance(lemma,unicode): lemma = unicode(lemma,'utf-8') http, resp, content = self.connect() params = "" fragment = "" path = "cdb_lu" query_opt = "dict_search" qdict = {} qdict[ "action" ] = "queryList" qdict[ "word" ] = lemma.encode('utf-8') query = urllib.urlencode( qdict ) db_url_tuple = ( self.scheme, self.host + ':' + str(self.port), path, params, query, fragment ) db_url = urlparse.urlunparse( db_url_tuple ) if self.debug: printf( "db_url: %s" % db_url ) resp, content = http.request( db_url, "GET" ) if self.debug: printf( "resp:\n%s" % resp ) printf( "content:\n%s" % content ) # printf( "content is of type: %s" % type( content ) ) dict_list = [] dict_list = eval( content ) # string to list ids = [] items = len( dict_list ) if self.debug: printf( "items: %d" % items ) for d in dict_list: if self.debug: printf( d ) seq_nr = d[ "seq_nr" ] # sense number value = d[ "value" ] # lexical unit identifier form = d[ "form" ] # lemma label = d[ "label" ] # label to be shown pos = d[ "pos" ] # label to be shown if self.debug: printf( "seq_nr: %s" % seq_nr ) printf( "value: %s" % value ) printf( "form: %s" % form ) printf( "label: %s" % label ) if value != "" and ((not targetpos) or (targetpos and pos == targetpos)): ids.append( value ) return ids def get_synset_xml(self,syn_id): """ call cdb_syn with synset identifier -> returns the synset xml; """ http, resp, content = self.connect() params = "" fragment = "" path = "cdb_syn" if self.debug: printf( "cornettodb/views/query_remote_syn_id: db_opt: %s" % path ) # output_opt: plain, html, xml # 'xml' is actually xhtml (with markup), but it is not valid xml! # 'plain' is actually valid xml (without markup) output_opt = "plain" if self.debug: printf( "cornettodb/views/query_remote_syn_id: output_opt: %s" % output_opt ) action = "runQuery" if self.debug: printf( "cornettodb/views/query_remote_syn_id: action: %s" % action ) printf( "cornettodb/views/query_remote_syn_id: query: %s" % syn_id ) qdict = {} qdict[ "action" ] = action qdict[ "query" ] = syn_id qdict[ "outtype" ] = output_opt query = urllib.urlencode( qdict ) db_url_tuple = ( self.scheme, self.host + ':' + str(self.port), path, params, query, fragment ) db_url = urlparse.urlunparse( db_url_tuple ) if self.debug: printf( "db_url: %s" % db_url ) resp, content = http.request( db_url, "GET" ) if self.debug: printf( "resp:\n%s" % resp ) # printf( "content:\n%s" % content ) # printf( "content is of type: %s" % type( content ) ) # xml_data = eval( content ) return etree.fromstring( xml_data ) def get_lus_from_synset(self, syn_id): """Returns a list of (word, lu_id) tuples given a synset ID""" root = self.get_synset_xml(syn_id) elem_synonyms = root.find( ".//synonyms" ) lus = [] for elem_synonym in elem_synonyms: synonym_str = elem_synonym.get( "c_lu_id-previewtext" ) # get "c_lu_id-previewtext" attribute # synonym_str ends with ":" synonym = synonym_str.split( ':' )[ 0 ].strip() lus.append( (synonym, elem_synonym.get( "c_lu_id") ) ) return lus def get_lu_from_synset(self, syn_id, lemma = None): """Returns (lu_id, synonyms=[(word, lu_id)] ) tuple given a synset ID and a lemma""" if not lemma: return self.get_lus_from_synset(syn_id) #alias if not isinstance(lemma,unicode): lemma = unicode(lemma,'utf-8') root = self.get_synset_xml(syn_id) elem_synonyms = root.find( ".//synonyms" ) lu_id = None synonyms = [] for elem_synonym in elem_synonyms: synonym_str = elem_synonym.get( "c_lu_id-previewtext" ) # get "c_lu_id-previewtext" attribute # synonym_str ends with ":" synonym = synonym_str.split( ':' )[ 0 ].strip() if synonym != lemma: synonyms.append( (synonym, elem_synonym.get("c_lu_id")) ) if self.debug: printf( "synonym add: %s" % synonym ) else: lu_id = elem_synonym.get( "c_lu_id" ) # get "c_lu_id" attribute if self.debug: printf( "lu_id: %s" % lu_id ) printf( "synonym skip lemma: %s" % synonym ) return lu_id, synonyms ################################################################################################################## # ORIGINAL AND AS-OF-YET UNUSED CODE (included for later porting) ################################################################################################################## """ -------------------------------------------------------------------------------- Original Author: Fons Laan, ILPS-ISLA, University of Amsterdam Original Project: DutchSemCor Original Name: cornettodb/views.py Original Version: 0.2 Goal: Cornetto views definitions Original functions: index( request ) local_open() remote_open( self.debug ) search( request ) search_local( dict_in, search_query ) search_remote( dict_in, search_query ) cornet_check_lusyn( utf8_lemma ) query_remote_lusyn_id( syn_id_self.debug, http, utf8_lemma, syn_id ) query_cornet( keyword, category ) query_remote_syn_lemma( self.debug, http, utf8_lemma ) query_remote_syn_id( self.debug, http, utf8_lemma, syn_id, domains_abbrev ) query_remote_lu_lemma( self.debug, http, utf8_lemma ) query_remote_lu_id( self.debug, http, lu_id ) FL-04-Sep-2009: Created FL-03-Nov-2009: Removed http global: sometimes it was None; missed initialization? FL-01-Feb-2010: Added Category filtering FL-15-Feb-2010: Tag counts -> separate qx query FL-07-Apr-2010: Merge canonical + textual examples FL-10-Jun-2010: Latest Change MvG-29-Sep-2010: Turned into minimal CornettoClient class, some new functions added, many old ones disabled until necessary """ # def query_remote(self, dict_in, search_query ): # if self.debug: printf( "cornettodb/views/query_remote" ) # http, resp, content = self.remote_open() # if resp is None: # raise Exception("No response") # status = int( resp.get( "status" ) ) # if self.debug: printf( "status: %d" % status ) # if status != 200: # # e.g. 400: Bad Request, 404: Not Found # raise Exception("Error in request") # path = dict_in[ 'dbopt' ] # if self.debug: printf( "cornettodb/views/query_remote: db_opt: %s" % path ) # output_opt = dict_in[ 'outputopt' ] # if self.debug: printf( "cornettodb/views/query_remote: output_opt: %s" % output_opt ) # query_opt = dict_in[ 'queryopt' ] # if self.debug: printf( "cornettodb/views/query_remote: query_opt: %s" % query_opt ) # params = "" # fragment = "" # qdict = {} # if query_opt == "dict_search": # # query = "action=queryList&word=" + search_query # qdict[ "action" ] = "queryList" # qdict[ "word" ] = search_query # elif query_opt == "query_entry": # # query = "action=runQuery&query=" + search_query # # query += "&outtype=" + output_opt # qdict[ "action" ] = "runQuery" # qdict[ "query" ] = search_query # qdict[ "outtype" ] = output_opt # # instead of "subtree" there is also "tree" and "full subtree" # elif query_opt == "subtree_entry": # # query = "action=subtree&query=" + search_query # # query += "&arg=ILR" # ILR = Internal Language Relations, RILR = Reversed ... # # query += "&outtype=" + output_opt # qdict[ "action" ] = "subtree" # qdict[ "query" ] = search_query # qdict[ "arg" ] = "ILR" # ILR = Internal Language Relations, RILR = Reversed ... # qdict[ "outtype" ] = output_opt # # More functions, see DEBVisDic docu: # # Save entry # # Delete entry # # Next sense number # # "Translate" synsets # query = urllib.urlencode( qdict ) # db_url_tuple = ( self.scheme, self.host+ ':' + str(self.post), self.path, params, query, fragment ) # db_url = urlparse.urlunparse( db_url_tuple ) # if self.debug: printf( "db_url: %s" % db_url ) # resp, content = http.request( db_url, "GET" ) # printf( "resp:\n%s" % resp ) # if self.debug: printf( "content:\n%s" % content ) # if content.startswith( '[' ) and content.endswith( ']' ): # reply = eval( content ) # string -> list # islist = True # else: # reply = content # islist = False # return reply # def cornet_check_lusyn( self, utf8_lemma ): # http, resp, content = remote_open( self.debug ) # # get the raw (unfiltered) lexical unit identifiers for this lemma # lu_ids_lemma = query_remote_lu_lemma( http, utf8_lemma ) # # get the synset identifiers for this lemma # syn_ids_lemma = query_remote_syn_lemma( http, utf8_lemma ) # lu_ids_syn = [] # for syn_id in syn_ids_lemma: # lu_id = query_remote_lusyn_id( http, utf8_lemma, syn_id ) # lu_ids_syn.append( lu_id ) # return lu_ids_lemma, syn_ids_lemma, lu_ids_syn # def query_remote_lusyn_id( http, utf8_lemma, syn_id ): # """ # query_remote_lusyn_id\ # call cdb_syn with synset identifier -> synset xml -> lu_id lemma # """ # scheme = settings.CORNETTO_PROTOCOL # netloc = settings.CORNETTO_HOST + ':' + str( settings.CORNETTO_PORT ) # params = "" # fragment = "" # path = "cdb_syn" # if self.debug: # printf( "cornettodb/views/query_remote_lusyn_id: db_opt: %s" % path ) # # output_opt: plain, html, xml # # 'xml' is actually xhtml (with markup), but it is not valid xml! # # 'plain' is actually valid xml (without markup) # output_opt = "plain" # if self.debug: # printf( "cornettodb/views/query_remote_lusyn_id: output_opt: %s" % output_opt ) # action = "runQuery" # if self.debug: # printf( "cornettodb/views/query_remote_lusyn_id: action: %s" % action ) # printf( "cornettodb/views/query_remote_lusyn_id: query: %s" % syn_id ) # # qdict = {} # qdict[ "action" ] = action # qdict[ "query" ] = syn_id # qdict[ "outtype" ] = output_opt # query = urllib.urlencode( qdict ) # db_url_tuple = ( scheme, netloc, path, params, query, fragment ) # db_url = urlparse.urlunparse( db_url_tuple ) # if self.debug: # printf( "db_url: %s" % db_url ) # resp, content = http.request( db_url, "GET" ) # if self.debug: # printf( "resp:\n%s" % resp ) # # printf( "content:\n%s" % content ) # # printf( "content is of type: %s" % type( content ) ) # # xml_data = eval( content ) # root = etree.fromstring( xml_data ) # synonyms = [] # # find anywhere in the tree # elem_synonyms = root.find( ".//synonyms" ) # for elem_synonym in elem_synonyms: # synonym_str = elem_synonym.get( "c_lu_id-previewtext" ) # get "c_lu_id-previewtext" attribute # # synonym_str ends with ":" # synonym = synonym_str.split( ':' )[ 0 ].strip() # utf8_synonym = synonym.encode( 'utf-8' ) # if utf8_synonym != utf8_lemma: # synonyms.append( synonym ) # if self.debug: # printf( "synonym add: %s" % synonym ) # else: # lu_id = elem_synonym.get( "c_lu_id" ) # get "c_lu_id" attribute # if self.debug: # printf( "lu_id: %s" % lu_id ) # printf( "synonym skip lemma: %s" % synonym ) # return lu_id # def query_cornet( annotator_id, utf8_lemma, category ): # """\ # cornet_query() # A variant of query_remote(), combining several queries for the dutchsemcor GUI # -1- call cdb_syn with lemma -> syn_ids; # -2- for each syn_id, call cdb_syn ->synset xml # -3- for each synset xml, find lu_id # -4- for each lu_id, call cdb_lu ->lu xml # -5- collect required info from lu & syn xml # """ # self.debug = False # this function # printf( "cornettodb/views/cornet_query()" ) # if utf8_lemma is None or utf8_lemma == "": # printf( "No lemma" ) # return # else: # printf( "lemma: %s" % utf8_lemma.decode( 'utf-8' ).encode( 'latin-1' ) ) # printf( "category: %s" % category ) # http, resp, content = remote_open( self.debug ) # if resp is None: # template = "cornettodb/error.html" # dictionary = { 'DSC_HOME' : settings.DSC_HOME } # return template, dictionary # status = int( resp.get( "status" ) ) # printf( "status: %d" % status ) # if status != 200: # # e.g. 400: Bad Request, 404: Not Found # printf( "status: %d\nreason: %s" % ( resp.status, resp.reason ) ) # dict_err = \ # { # "status" : settings.CORNETTO_HOST + " error: " + str(status), # "msg" : resp.reason # } # return dict_err # # read the domain cvs file, and return the dictionaries # domains_dutch, domains_abbrev = get_domains() # syn_ids = [] # used syn_ids, skipping filtered # lu_ids = [] # used lu_ids, skipping filtered # lu_ids_syn = [] # lu_ids derived from syn_ids, unfiltered # # get the raw (unfiltered) synset identifiers for this lemma # syn_lemma_self.debug = False # syn_ids_raw = query_remote_syn_lemma( syn_lemma_self.debug, http, utf8_lemma ) # # get the raw (unfiltered) lexical unit identifiers for this lemma # lu_lemma_self.debug = False # lu_ids_raw = query_remote_lu_lemma( lu_lemma_self.debug, http, utf8_lemma ) # # required lu info from the lu xml: # resumes_lu = [] # morphos_lu = [] # examplestext_lulist = [] # list-of-lists # examplestype_lulist = [] # list-of-lists # examplessubtype_lulist = [] # list-of-lists # # required syn info from the synset xml: # definitions_syn = [] # list # differentiaes_syn = [] # list # synonyms_synlist = [] # list-of-lists # relations_synlist = [] # list-of-lists # hyperonyms_synlist = [] # list-of-lists # hyponyms_synlist = [] # list-of-lists # relations_synlist = [] # list-of-lists # relnames_synlist = [] # list-of-lists # domains_synlist = [] # list-of-lists # remained = 0 # maybe less than lu_ids because of category filtering # for syn_id in syn_ids_raw: # if self.debug: # printf( "syn_id: %s" % syn_id ) # syn_id_self.debug = False # lu_id, definition, differentiae, synonyms, hyperonyms, hyponyms, relations, relnames, domains = \ # query_remote_syn_id( syn_id_self.debug, http, utf8_lemma, syn_id, domains_abbrev ) # lu_ids_syn.append( lu_id ) # lui_id_self.debug = False # if self.debug: # printf( "lu_id: %s" % lu_id ) # formcat, morpho, resume, examples_text, examples_type, examples_subtype = \ # query_remote_lu_id( lui_id_self.debug, http, lu_id ) # if not ( \ # ( category == '?' ) or \ # ( category == 'a' and formcat == 'adj' ) or \ # ( category == 'n' and formcat == 'noun' ) or \ # ( category == 'v' and formcat == 'verb' ) ): # if self.debug: # printf( "filtered category: formcat=%s, lu_id=%s" % (formcat, lu_id) ) # continue # # collect all information # syn_ids.append( syn_id ) # lu_ids.append( lu_id ) # definitions_syn.append( definition ) # differentiaes_syn.append( differentiae ) # synonyms_synlist.append( synonyms ) # hyperonyms_synlist.append( hyperonyms ) # relations_synlist.append( relations ) # relnames_synlist.append( relnames ) # hyponyms_synlist.append( hyponyms ) # domains_synlist.append( domains ) # resumes_lu.append( resume ) # morphos_lu.append( morpho ) # examplestext_lulist.append( examples_text ) # examplestype_lulist.append( examples_type ) # examplessubtype_lulist.append( examples_subtype ) # if self.debug: # printf( "morpho: %s\nresume: %s\nexamples:" % (morpho, resume) ) # for canoexample in canoexamples: # printf( canoexample.encode('latin-1') ) # otherwise fails with non-ascii chars # for textexample in textexamples: # printf( textexample.encode('latin-1') ) # otherwise fails with non-ascii chars # lusyn_mismatch = False # assume no problem # # Compare number of lu ids with syn_ids # if len( lu_ids_raw ) != len( syn_ids_raw): # length mismatch # lusyn_mismatch = True # printf( "query_cornet: %d lu ids, %d syn ids: NO MATCH" % (len(lu_ids_raw), len(syn_ids_raw) ) ) # # Check lu_ids from syn to lu_ids_raw (from lemma) # for i in range( len( lu_ids_raw ) ): # lu_id_raw = lu_ids_raw[ i ] # try: # idx = lu_ids_syn.index( lu_id_raw ) # if lu_ids_syn.count( lu_id_raw ) != 1: # lusyn_mismatch = True # printf( "query_cornet: %s lu id: DUPLICATES" % lu_id_raw ) # except: # lusyn_mismatch = True # printf( "query_cornet: %s lu id: NOT FOUND" % lu_id_raw ) # dictlist = [] # for i in range( len( syn_ids ) ): # # printf( "i: %d" % i ) # dict = {} # dict[ "no" ] = i # lu_id = lu_ids[ i ] # dict[ "lu_id" ] = lu_id # syn_id = syn_ids[ i ] # dict[ "syn_id" ] = syn_id # dict[ "tag_count" ] = '?' # resume = resumes_lu[ i ] # dict[ "resume" ] = resume # morpho = morphos_lu[ i ] # dict[ "morpho" ] = morpho # examplestext = examplestext_lulist[ i ] # dict[ "examplestext"] = examplestext # examplestype = examplestype_lulist[ i ] # dict[ "examplestype"] = examplestype # examplessubtype = examplessubtype_lulist[ i ] # dict[ "examplessubtype"] = examplessubtype # definition = definitions_syn[ i ] # dict[ "definition" ] = definition # differentiae = differentiaes_syn[ i ] # dict[ "differentiae" ] = differentiae # synonyms = synonyms_synlist[ i ] # dict[ "synonyms"] = synonyms # hyperonyms = hyperonyms_synlist[ i ] # dict[ "hyperonyms"] = hyperonyms # hyponyms = hyponyms_synlist[ i ] # dict[ "hyponyms"] = hyponyms # relations = relations_synlist[ i ] # dict[ "relations"] = relations # relnames = relnames_synlist[ i ] # dict[ "relnames"] = relnames # domains = domains_synlist[ i ] # dict[ "domains"] = domains # dictlist.append( dict ) # # pack in "superdict" # result = \ # { # "status" : "ok", # "source" : "cornetto", # "lusyn_mismatch" : lusyn_mismatch, # "lusyn_retrieved" : len( syn_ids_raw ), # "lusyn_remained" : len( syn_ids ), # "lists_data" : dictlist # } # return result # def query_remote_lu_lemma( utf8_lemma ): # """\ # call cdb_lu with lemma -> yields lexical units # """ # scheme = settings.CORNETTO_PROTOCOL # netloc = settings.CORNETTO_HOST + ':' + str( settings.CORNETTO_PORT ) # params = "" # fragment = "" # path = "cdb_lu" # if self.debug: # printf( "cornettodb/views/query_remote_lu_lemma: db_opt: %s" % path ) # action = "queryList" # if self.debug: # printf( "cornettodb/views/query_remote_lu_lemma: action: %s" % action ) # qdict = {} # qdict[ "action" ] = action # qdict[ "word" ] = utf8_lemma # query = urllib.urlencode( qdict ) # db_url_tuple = ( scheme, netloc, path, params, query, fragment ) # db_url = urlparse.urlunparse( db_url_tuple ) # if self.debug: # printf( "db_url: %s" % db_url ) # resp, content = http.request( db_url, "GET" ) # if self.debug: # printf( "resp:\n%s" % resp ) # printf( "content:\n%s" % content ) # # printf( "content is of type: %s" % type( content ) ) # dict_list = [] # dict_list = eval( content ) # string to list # ids = [] # items = len( dict_list ) # if self.debug: # printf( "items: %d" % items ) # # lu dict: like syn dict, but with pos: part-of-speech # for dict in dict_list: # if self.debug: # printf( dict ) # seq_nr = dict[ "seq_nr" ] # sense number # value = dict[ "value" ] # lexical unit identifier # form = dict[ "form" ] # lemma # pos = dict[ "pos" ] # part of speech # label = dict[ "label" ] # label to be shown # if self.debug: # printf( "seq_nr: %s" % seq_nr ) # printf( "value: %s" % value ) # printf( "form: %s" % form ) # printf( "pos: %s" % pos ) # printf( "label: %s" % label ) # if value != "": # ids.append( value ) # return ids # def lemma2formcats( utf8_lemma ): # """\ # get the form-cats for this lemma. # """ # self.debug = False # http, resp, content = remote_open( self.debug ) # if resp is None: # template = "cornettodb/error.html" # dictionary = { 'DSC_HOME' : settings.DSC_HOME } # return template, dictionary # status = int( resp.get( "status" ) ) # if status != 200: # # e.g. 400: Bad Request, 404: Not Found # printf( "status: %d\nreason: %s" % ( resp.status, resp.reason ) ) # template = "cornettodb/error.html" # message = "Cornetto " + _( "initialization" ) # dict = \ # { # 'DSC_HOME': settings.DSC_HOME, # 'message': message, # 'status': resp.status, # 'reason': resp.reason, \ # } # return template, dictionary # # get the lexical unit identifiers for this lemma # lu_ids = query_remote_lu_lemma( self.debug, http, utf8_lemma ) # scheme = settings.CORNETTO_PROTOCOL # netloc = settings.CORNETTO_HOST + ':' + str( settings.CORNETTO_PORT ) # params = "" # fragment = "" # path = "cdb_lu" # if self.debug: # printf( "cornettodb/views/query_remote_lu_id_formcat: db_opt: %s" % path ) # output_opt = "plain" # if self.debug: # printf( "cornettodb/views/query_remote_lu_id_formcat: output_opt: %s" % output_opt ) # action = "runQuery" # if self.debug: # printf( "cornettodb/views/query_remote_lu_id_formcat: action: %s" % action ) # formcats = [] # for lu_id in lu_ids: # if self.debug: # printf( "cornettodb/views/query_remote_lu_id_formcat: query: %s" % lu_id ) # qdict = {} # qdict[ "action" ] = action # qdict[ "query" ] = lu_id # qdict[ "outtype" ] = output_opt # query = urllib.urlencode( qdict ) # db_url_tuple = ( scheme, netloc, path, params, query, fragment ) # db_url = urlparse.urlunparse( db_url_tuple ) # if self.debug: # printf( "db_url: %s" % db_url ) # resp, content = http.request( db_url, "GET" ) # if self.debug: # printf( "resp:\n%s" % resp ) # xml_data = eval( content ) # root = etree.fromstring( xml_data ) # # morpho # morpho = "" # elem_form = root.find( ".//form" ) # if elem_form is not None: # formcat = elem_form.get( "form-cat" ) # get "form-cat" attribute # if formcat is None: # formcat = '?' # count = formcats.count( formcat ) # if count == 0: # formcats.append( formcat ) # return formcats # def query_remote_lu_id(lu_id ): # """\ # call cdb_lu with lexical unit identifier -> yields the lexical unit xml; # from the xml collect the morpho-syntax, resumes+definitions, examples. # """ # scheme = settings.CORNETTO_PROTOCOL # netloc = settings.CORNETTO_HOST + ':' + str( settings.CORNETTO_PORT ) # params = "" # fragment = "" # path = "cdb_lu" # if self.debug: # printf( "cornettodb/views/query_remote_lu_id: db_opt: %s" % path ) # # output_opt: plain, html, xml # # 'xml' is actually xhtml (with markup), but it is not valid xml! # # 'plain' is actually valid xml (without markup) # output_opt = "plain" # if self.debug: # printf( "cornettodb/views/query_remote_lu_id: output_opt: %s" % output_opt ) # action = "runQuery" # if self.debug: # printf( "cornettodb/views/query_remote_lu_id: action: %s" % action ) # printf( "cornettodb/views/query_remote_lu_id: query: %s" % lu_id ) # # qdict = {} # qdict[ "action" ] = action # qdict[ "query" ] = lu_id # qdict[ "outtype" ] = output_opt # query = urllib.urlencode( qdict ) # db_url_tuple = ( scheme, netloc, path, params, query, fragment ) # db_url = urlparse.urlunparse( db_url_tuple ) # if self.debug: # printf( "db_url: %s" % db_url ) # resp, content = http.request( db_url, "GET" ) # if self.debug: # printf( "resp:\n%s" % resp ) # # printf( "content:\n%s" % content ) # # printf( "content is of type: %s" % type( content ) ) # # xml_data = eval( content ) # root = etree.fromstring( xml_data ) # # morpho # morpho = "" # elem_form = root.find( ".//form" ) # if elem_form is not None: # formcat = elem_form.get( "form-cat" ) # get "form-cat" attribute # if formcat is not None: # if formcat == "adj": # morpho = 'a' # elif formcat == "noun": # morpho = 'n' # elem_article = root.find( ".//sy-article" ) # if elem_article is not None and elem_article.text is not None: # article = elem_article.text # lidwoord # morpho += "-" + article # elem_count = root.find( ".//sem-countability" ) # if elem_count is not None and elem_count.text is not None: # countability = elem_count.text # if countability == "count": # morpho += "-t" # elif countability == "uncount": # morpho += "-nt" # elif formcat == "verb": # morpho = 'v' # elem_trans = root.find( ".//sy-trans" ) # if elem_trans is not None and elem_trans.text is not None: # transitivity = elem_trans.text # if transitivity == "tran": # morpho += "-tr" # elif transitivity == "intr": # morpho += "-intr" # else: # should not occur # morpho += "-" # morpho += transitivity # elem_separ = root.find( ".//sy-separ" ) # if elem_separ is not None and elem_separ.text is not None: # separability = elem_separ.text # if separability == "sch": # morpho += "-sch" # elif separability == "onsch": # morpho += "-onsch" # else: # should not occur # morpho += "-" # morpho += separability # elem_reflexiv = root.find( ".//sy-reflexiv" ) # if elem_reflexiv is not None and elem_reflexiv.text is not None: # reflexivity = elem_reflexiv.text # if reflexivity == "refl": # morpho += "-refl" # elif reflexivity == "nrefl": # morpho += "-nrefl" # else: # should not occur # morpho += "-" # morpho += reflexivity # elif formcat == "adverb": # morpho = 'd' # else: # morpho = '?' # # find anywhere in the tree # elem_resume = root.find( ".//sem-resume" ) # if elem_resume is not None: # resume = elem_resume.text # else: # resume = "" # examples_text = [] # examples_type = [] # examples_subtype = [] # # find anywhere in the tree # examples = root.findall( ".//example" ) # for example in examples: # example_id = example.get( "r_ex_id" ) # elem_type = example.find( "syntax_example/sy-type" ) # if elem_type is not None: # type_text = elem_type.text # if type_text is None: # type_text = "" # else: # type_text = "" # elem_subtype = example.find( "syntax_example/sy-subtype" ) # if elem_subtype is not None: # subtype_text = elem_subtype.text # if subtype_text is None: # subtype_text = "" # else: # subtype_text = "" # # there can be a canonical and/or textual example, # # they share the type and subtype # elem_canonical = example.find( "form_example/canonicalform" ) # find child # if elem_canonical is not None and elem_canonical.text is not None: # example_text = elem_canonical.text # example_out = example_text.encode( "iso-8859-1", "replace" ) # if self.debug: # printf( "subtype, r_ex_id: %s: %s" % ( example_id, example_out ) ) # if subtype_text != "idiom": # examples_text.append( example_text ) # examples_type.append( type_text ) # examples_subtype.append( subtype_text ) # else: # if self.debug: # printf( "filter idiom: %s" % example_out) # # elem_textual = example.find( "form_example/textualform" ) # find child # if elem_textual is not None and elem_textual.text is not None: # example_text = elem_textual.text # example_out = example_text.encode( "iso-8859-1", "replace" ) # if self.debug: # printf( "subtype r_ex_id: %s: %s" % ( example_id, example_out ) ) # if subtype_text != "idiom": # examples_text.append( example_text ) # examples_type.append( type_text ) # examples_subtype.append( subtype_text ) # else: # if self.debug: # printf( "filter idiom: %s" % example_out) # return formcat, morpho, resume, examples_text, examples_type, examples_subtype # def get_synset(self, syn_id, utf8_lemma, domains_abbrev ): # """Parse synset data""" # root = self.get_synset_xml(syn_id) # synonyms = [] # # find anywhere in the tree # elem_synonyms = root.find( ".//synonyms" ) # for elem_synonym in elem_synonyms: # synonym_str = elem_synonym.get( "c_lu_id-previewtext" ) # get "c_lu_id-previewtext" attribute # # synonym_str ends with ":" # synonym = synonym_str.split( ':' )[ 0 ].strip() # utf8_synonym = synonym.encode( 'utf-8' ) # if utf8_synonym != utf8_lemma: # synonyms.append( synonym ) # if self.debug: # printf( "synonym add: %s" % synonym ) # else: # lu_id = elem_synonym.get( "c_lu_id" ) # get "c_lu_id" attribute # if self.debug: # printf( "lu_id: %s" % lu_id ) # printf( "synonym skip lemma: %s" % synonym ) # definition = "" # elem_definition = root.find( ".//definition" ) # if elem_definition is not None and elem_definition.text is not None: # definition = elem_definition.text # differentiae = "" # elem_differentiae = root.find( "./differentiae/" ) # if elem_differentiae is not None and elem_differentiae.text is not None: # differentiae = elem_differentiae.text # if self.debug: # print( "definition: %s" % definition.encode( 'utf-8' ) ) # print( "differentiae: %s" % differentiae.encode( 'utf-8' ) ) # hyperonyms = [] # hyponyms = [] # relations_all = [] # relnames_all = [] # # find internal anywhere in the tree # elem_intrelations = root.find( ".//wn_internal_relations" ) # for elem_relation in elem_intrelations: # relations = [] # relation_str = elem_relation.get( "target-previewtext" ) # get "target-previewtext" attribute # name = elem_relation.get( "relation_name" ) # target = elem_relation.get( "targer" ) # relation_list = relation_str.split( ',' ) # for relation_str in relation_list: # relation = relation_str.split( ':' )[ 0 ].strip() # relations.append( relation ) # relations_all.append( relation ) # relnames_all.append( name ) # if name == "HAS_HYPERONYM": # if self.debug: # printf( "target: %s" % target ) # hyperonyms.append( relations ) # elif name == "HAS_HYPONYM": # if self.debug: # printf( "target: %s" % target ) # hyponyms.append( relations ) # # we could keep the relation sub-lists separate on the basis of their "target" attribute # # but for now we flatten the lists # hyperonyms = flatten( hyperonyms ) # hyponyms = flatten( hyponyms ) # if self.debug: # printf( "hyperonyms: %s" % hyperonyms ) # printf( "hyponyms: %s" % hyponyms ) # domains = [] # # find anywhere in the tree # wn_domains = root.find( ".//wn_domains" ) # for dom_relation in wn_domains: # domains_en = dom_relation.get( "term" ) # get "term" attribute # if self.debug: # if domains_en: # printf( "domain: %s" % domains_en ) # # # use dutch domain name[s], abbreviated # domain_list = domains_en.split( ' ' ) # for domain_en in domain_list: # try: # domain_nl = domains_abbrev[ domain_en ] # if domain_nl.endswith( '.' ): # remove trailing '.' # domain_nl = domain_nl[ : -1] # remove last character # except: # printf( "failed to convert domain: %s" % domain_en ) # domain_nl = domain_en # if domains.count( domain_nl ) == 0: # append if new # domains.append( domain_nl ) # return lu_id, definition, differentiae, synonyms, hyperonyms, hyponyms, relations_all, relnames_all, domains PyNLPl-1.1.2/pynlpl/clients/frogclient.py0000644000175000001440000001243112670325003021165 0ustar proyconusers00000000000000############################################################### # PyNLPl - Frog Client - Version 1.4.1 # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # Derived from code by Rogier Kraf # # Licensed under GPLv3 # # This is a Python library for on-the-fly communication with # a Frog/Tadpole Server. Allowing on-the-fly lemmatisation and # PoS-tagging. It is recommended to pass your data on a # sentence-by-sentence basis to FrogClient.process() # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u import socket class FrogClient: def __init__(self,host="localhost",port=12345, server_encoding="utf-8", returnall=False, timeout=120.0, ner=False): """Create a client connecting to a Frog or Tadpole server.""" self.BUFSIZE = 4096 self.socket = socket.socket(socket.AF_INET,socket.SOCK_STREAM) self.socket.settimeout(timeout) self.socket.connect( (host,int(port)) ) self.server_encoding = server_encoding self.returnall = returnall def process(self,input_data, source_encoding="utf-8", return_unicode = True, oldfrog=False): """Receives input_data in the form of a str or unicode object, passes this to the server, with proper consideration for the encodings, and returns the Frog output as a list of tuples: (word,pos,lemma,morphology), each of these is a proper unicode object unless return_unicode is set to False, in which case raw strings will be returned. Return_unicode is no longer optional, it is fixed to True, parameter is still there only for backwards-compatibility.""" if isinstance(input_data, list) or isinstance(input_data, tuple): input_data = " ".join(input_data) input_data = u(input_data, source_encoding) #decode (or preferably do this in an earlier stage) input_data = input_data.strip(' \t\n') s = input_data.encode(self.server_encoding) +b'\r\n' if not oldfrog: s += b'EOT\r\n' self.socket.sendall(s) #send to socket in desired encoding output = [] done = False while not done: data = b"" while not data.endswith(b'\n'): moredata = self.socket.recv(self.BUFSIZE) if not moredata: break data += moredata data = u(data,self.server_encoding) for line in data.strip(' \t\r\n').split('\n'): if line == "READY": done = True break elif line: line = line.split('\t') #split on tab if len(line) > 4 and line[0].isdigit(): #first column is token number if line[0] == '1' and output: if self.returnall: output.append( (None,None,None,None, None,None,None, None) ) else: output.append( (None,None,None,None) ) fields = line[1:] parse1=parse2=ner=chunk="" word,lemma,morph,pos = fields[0:4] if len(fields) > 5: ner = fields[5] if len(fields) > 6: chunk = fields[6] if len(fields) < 5: raise Exception("Can't process response line from Frog: ", repr(line), " got unexpected number of fields ", str(len(fields) + 1)) if self.returnall: output.append( (word,lemma,morph,pos,ner,chunk,parse1,parse2) ) else: output.append( (word,lemma,morph,pos) ) return output def process_aligned(self,input_data, source_encoding="utf-8", return_unicode = True): output = self.process(input_data, source_encoding, return_unicode) outputwords = [ x[0] for x in output ] inputwords = input_data.strip(' \t\n').split(' ') alignment = self.align(inputwords, outputwords) for i, _ in enumerate(inputwords): targetindex = alignment[i] if targetindex == None: if self.returnall: yield (None,None,None,None,None,None,None,None) else: yield (None,None,None,None) else: yield output[targetindex] def align(self,inputwords, outputwords): """For each inputword, provides the index of the outputword""" alignment = [] cursor = 0 for inputword in inputwords: if len(outputwords) > cursor and outputwords[cursor] == inputword: alignment.append(cursor) cursor += 1 elif len(outputwords) > cursor+1 and outputwords[cursor+1] == inputword: alignment.append(cursor+1) cursor += 2 else: alignment.append(None) cursor += 1 return alignment def __del__(self): self.socket.close() PyNLPl-1.1.2/pynlpl/algorithms.py0000644000175000001440000000320212445064173017545 0ustar proyconusers00000000000000 ###############################################################9 # PyNLPl - Algorithms # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # http://www.github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Licensed under GPLv3 # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import def sum_to_n(n, size, limit=None): #from http://stackoverflow.com/questions/2065553/python-get-all-numbers-that-add-up-to-a-number """Produce all lists of `size` positive integers in decreasing order that add up to `n`.""" if size == 1: yield [n] return if limit is None: limit = n start = (n + size - 1) // size stop = min(limit, n - size + 1) + 1 for i in range(start, stop): for tail in sum_to_n(n - i, size - 1, i): yield [i] + tail def consecutivegaps(n, leftmargin = 0, rightmargin = 0): """Compute all possible single consecutive gaps in any sequence of the specified length. Returns (beginindex, length) tuples. Runs in O(n(n+1) / 2) time. Argument is the length of the sequence rather than the sequence itself""" begin = leftmargin while begin < n: length = (n - rightmargin) - begin while length > 0: yield (begin, length) length -= 1 begin += 1 def bytesize(n): """Return the required size in bytes to encode the specified integer""" for i in range(1, 1000): if n < 2**(8*i): return i PyNLPl-1.1.2/pynlpl/mt/0000755000175000001440000000000013024723552015442 5ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/mt/wordalign.py0000644000175000001440000000622512445064173020012 0ustar proyconusers00000000000000from pynlpl.statistics import FrequencyList, Distribution class WordAlignment(object): def __init__(self, casesensitive = False): self.casesensitive = casesensitive def train(self, sourcefile, targetfile): sourcefile = open(sourcefile) targetfile = open(targetfile) self.sourcefreqlist = FrequencyList(None, self.casesensitive) self.targetfreqlist = FrequencyList(None, self.casesensitive) #frequency lists self.source2target = {} self.target2source = {} for sourceline, targetline in zip(sourcefile, targetfile): sourcetokens = sourceline.split() targettokens = targetline.split() self.sourcefreqlist.append(sourcetokens) self.targetfreqlist.append(targettokens) for sourcetoken in sourcetokens: if not sourcetoken in self.source2target: self.source2target[sourcetoken] = FrequencyList(targettokens,self.casesensitive) else: self.source2target[sourcetoken].append(targettokens) for targettoken in targettokens: if not targettoken in self.target2source: self.target2source[targettoken] = FrequencyList(sourcetokens,self.casesensitive) else: self.target2source[targettoken].append(sourcetokens) sourcefile.close() targetfile.close() def test(self, sourcefile, targetfile): sourcefile = open(sourcefile) targetfile = open(targetfile) #stage 2 for sourceline, targetline in zip(sourcefile, targetfile): sourcetokens = sourceline.split() targettokens = targetline.split() S2Talignment = [] T2Salignment = [] for sourcetoken in sourcetokens: #which of the target-tokens is most frequent? besttoken = None bestscore = -1 for i, targettoken in enumerate(targettokens): if targettoken in self.source2target[sourcetoken]: score = self.source2target[sourcetoken][targettoken] / float(self.targetfreqlist[targettoken]) if score > bestscore: bestscore = self.source2target[sourcetoken][targettoken] besttoken = i S2Talignment.append(besttoken) #TODO: multi-alignment? for targettoken in targettokens: besttoken = None bestscore = -1 for i, sourcetoken in enumerate(sourcetokens): if sourcetoken in self.target2source[targettoken]: score = self.target2source[targettoken][sourcetoken] / float(self.sourcefreqlist[sourcetoken]) if score > bestscore: bestscore = self.target2source[targettoken][sourcetoken] besttoken = i T2Salignment.append(besttoken) #TODO: multi-alignment? yield sourcetokens, targettokens, S2Talignment, T2Salignment sourcefile.close() targetfile.close() PyNLPl-1.1.2/pynlpl/mt/__init__.py0000644000175000001440000000000012445064173017544 0ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/tests/0000755000175000001440000000000013024723552016164 5ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/tests/fql.py0000755000175000001440000013567713024723325017344 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- #--------------------------------------------------------------- # PyNLPl - Test Units for FoLiA Query Language # by Maarten van Gompel, Radboud University Nijmegen # proycon AT anaproy DOT nl # # Licensed under GPLv3 #---------------------------------------------------------------- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u, isstring import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout import sys import os import unittest import io from pynlpl.formats import fql, folia, cql Q1 = 'SELECT pos WHERE class = "n" FOR w WHERE text = "house" AND class != "punct" RETURN focus' Q2 = 'ADD w WITH text "house" (ADD pos WITH class "n") FOR ID sentence' Qselect_focus = "SELECT lemma OF \"lemmas-nl\" WHERE class = \"stamboom\" FOR w RETURN focus" Qselect_target = "SELECT lemma OF \"lemmas-nl\" WHERE class = \"stamboom\" FOR w RETURN target" Qselect_singlefocus = "SELECT lemma OF \"lemmas-nl\" WHERE class = \"hoofdletter\" FOR w RETURN focus FORMAT single-python" Qselect_singletarget = "SELECT lemma OF \"lemmas-nl\" WHERE class = \"hoofdletter\" FOR w RETURN target FORMAT single-python" Qselect_multitarget_focus = "SELECT lemma OF \"lemmas-nl\" FOR ID \"WR-P-E-J-0000000001.p.1.s.4.w.4\" , ID \"WR-P-E-J-0000000001.p.1.s.4.w.5\"" Qselect_multitarget = "SELECT lemma OF \"lemmas-nl\" FOR ID \"WR-P-E-J-0000000001.p.1.s.4.w.4\" , ID \"WR-P-E-J-0000000001.p.1.s.4.w.5\" RETURN target" Qselect_nestedtargets = "SELECT lemma OF \"lemmas-nl\" WHERE class = \"stamboom\" FOR w FOR s ID \"WR-P-E-J-0000000001.p.1.s.2\" RETURN target FORMAT single-python" Qselect_startend = "SELECT FOR w START ID \"WR-P-E-J-0000000001.p.1.s.2.w.2\" END ID \"WR-P-E-J-0000000001.p.1.s.2.w.4\"" #inclusive Qselect_startend2 = "SELECT FOR w START ID \"WR-P-E-J-0000000001.p.1.s.2.w.2\" ENDBEFORE ID \"WR-P-E-J-0000000001.p.1.s.2.w.4\"" #exclusive Qin = "SELECT ph IN w" Qin2 = "SELECT ph IN term" Qin2ref = "SELECT ph FOR term" Qedit = "EDIT lemma OF \"lemmas-nl\" WHERE class = \"stamboom\" WITH class \"blah\" FOR w FOR s ID \"WR-P-E-J-0000000001.p.1.s.2\"" Qeditconfidence = "EDIT lemma OF \"lemmas-nl\" WHERE class = \"stamboom\" WITH class \"blah\" confidence 0.5 FOR w FOR s ID \"WR-P-E-J-0000000001.p.1.s.2\"" Qeditconfidence2 = "EDIT lemma OF \"lemmas-nl\" WHERE class = \"stamboom\" WITH class \"blah\" confidence NONE FOR w FOR s ID \"WR-P-E-J-0000000001.p.1.s.2\"" Qadd = "ADD lemma OF \"lemmas-nl\" WITH class \"hebben\" FOR w ID \"WR-P-E-J-0000000001.sandbox.2.s.1.w.3\"" Qeditadd = "EDIT lemma OF \"lemmas-nl\" WITH class \"hebben\" FOR w ID \"WR-P-E-J-0000000001.sandbox.2.s.1.w.3\"" Qdelete = "DELETE lemma OF \"lemmas-nl\" WHERE class = \"stamboom\" FOR w" Qdelete_target = "DELETE lemma OF \"lemmas-nl\" WHERE class = \"stamboom\" FOR w RETURN target" Qcomplexadd = "APPEND w (ADD t WITH text \"gisteren\" ADD lemma OF \"lemmas-nl\" WITH class \"gisteren\") FOR ID \"WR-P-E-J-0000000001.sandbox.2.s.1.w.3\"" Qedittext = "EDIT w WHERE text = \"terweil\" WITH text \"terwijl\"" Qedittext2 = "EDIT t WITH text \"terwijl\" FOR w WHERE text = \"terweil\" RETURN target" Qedittext3 = "EDIT t WITH text \"de\" FOR w ID \"WR-P-E-J-0000000001.p.1.s.8.w.10\" RETURN target" Qedittext4 = "EDIT t WITH text \"ter\nwijl\" FOR w WHERE text = \"terweil\" RETURN target" Qhas = "SELECT w WHERE (pos HAS class = \"LET()\")" Qhas_shortcut = "SELECT w WHERE :pos = \"LET()\"" Qboolean = "SELECT w WHERE (pos HAS class = \"LET()\") AND ((lemma HAS class = \".\") OR (lemma HAS class = \",\"))" Qcontext = "SELECT w WHERE (PREVIOUS w WHERE text = \"de\")" Qcontext2 = "SELECT FOR SPAN w WHERE (pos HAS class CONTAINS \"LID(\") & w WHERE (pos HAS class CONTAINS \"ADJ(\") & w WHERE (pos HAS class CONTAINS \"N(\")" Qselect_span = "SELECT entity OF \"http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml\" WHERE class = \"per\" FOR ID \"example.table.1.w.3\"" Qselect_span2 = "SELECT entity OF \"http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml\" WHERE class = \"per\" FOR SPAN ID \"example.table.1.w.3\" & ID \"example.table.1.w.4\" & ID \"example.table.1.w.5\"" Qselect_span2_returntarget = "SELECT entity OF \"http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml\" WHERE class = \"per\" FOR SPAN ID \"example.table.1.w.3\" & ID \"example.table.1.w.4\" & ID \"example.table.1.w.5\" RETURN target" Qadd_span = "ADD entity OF \"http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml\" WITH class \"misc\" FOR SPAN ID \"WR-P-E-J-0000000001.p.1.s.4.w.2\" & ID \"WR-P-E-J-0000000001.p.1.s.4.w.3\"" Qadd_span_returntarget = "ADD entity OF \"http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml\" WITH class \"misc\" FOR SPAN ID \"WR-P-E-J-0000000001.p.1.s.4.w.2\" & ID \"WR-P-E-J-0000000001.p.1.s.4.w.3\" RETURN target" Qadd_span_returnancestortarget = "ADD entity OF \"http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml\" WITH class \"misc\" FOR SPAN ID \"WR-P-E-J-0000000001.p.1.s.4.w.2\" & ID \"WR-P-E-J-0000000001.p.1.s.4.w.3\" RETURN ancestor-target" Qadd_span2 = "ADD entity OF \"http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml\" WITH class \"misc\" SPAN ID \"WR-P-E-J-0000000001.p.1.s.4.w.2\" & ID \"WR-P-E-J-0000000001.p.1.s.4.w.3\" FOR ID \"WR-P-E-J-0000000001.p.1.s.4\"" Qadd_span3 = "ADD entity OF \"http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml\" WITH class \"misc\" RESPAN ID \"WR-P-E-J-0000000001.p.1.s.4.w.3\" FOR SPAN ID \"WR-P-E-J-0000000001.p.1.s.4.w.2\" & ID \"WR-P-E-J-0000000001.p.1.s.4.w.3\"" Qadd_span4 = "ADD entity OF \"http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml\" WITH class \"misc\" RESPAN NONE FOR SPAN ID \"WR-P-E-J-0000000001.p.1.s.4.w.2\" & ID \"WR-P-E-J-0000000001.p.1.s.4.w.3\"" Qadd_span_subqueries = "ADD dependency OF alpino-set WITH class \"test\" RESPAN NONE (ADD dep SPAN ID WR-P-E-J-0000000001.p.1.s.2.w.6) (ADD hd SPAN ID WR-P-E-J-0000000001.p.1.s.2.w.7) FOR SPAN ID WR-P-E-J-0000000001.p.1.s.2.w.6 & ID WR-P-E-J-0000000001.p.1.s.2.w.7 RETURN focus" Qedit_spanrole = "EDIT hd SPAN ID \"WR-P-E-J-0000000001.p.1.s.1.w.3\" & ID \"WR-P-E-J-0000000001.p.1.s.1.w.4\" & ID \"WR-P-E-J-0000000001.p.1.s.1.w.5\" FOR dependency ID \"WR-P-E-J-0000000001.p.1.s.1.dep.2\" RETURN target" Qedit_spanrole_id = "EDIT hd ID \"test\" SPAN ID \"WR-P-E-J-0000000001.p.1.s.1.w.3\" & ID \"WR-P-E-J-0000000001.p.1.s.1.w.4\" & ID \"WR-P-E-J-0000000001.p.1.s.1.w.5\" FOR dependency ID \"WR-P-E-J-0000000001.p.1.s.1.dep.2\" RETURN target" Qadd_nested_span = "ADD su OF \"syntax-set\" WITH class \"np\" SPAN ID \"WR-P-E-J-0000000001.p.1.s.1.w.4\" & ID \"WR-P-E-J-0000000001.p.1.s.1.w.5\" FOR ID \"WR-P-E-J-0000000001.p.1.s.1.su.0\"" Qalt = "EDIT lemma WHERE class = \"terweil\" WITH class \"terwijl\" (AS ALTERNATIVE WITH confidence 0.9)" Qdeclare = "DECLARE correction OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH annotator \"me\" annotatortype \"manual\"" #implicitly tests auto-declaration: Qcorrect1 = "EDIT lemma WHERE class = \"terweil\" WITH class \"terwijl\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"nonworderror\" confidence 0.9)" Qcorrect2 = "EDIT lemma WHERE class = \"terweil\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" class \"terwijl\" WITH class \"nonworderror\" confidence 0.9)" Qsuggest1 = "EDIT lemma WHERE class = \"terweil\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"nonworderror\" SUGGESTION class \"terwijl\" WITH confidence 0.9 SUGGESTION class \"gedurende\" WITH confidence 0.1)" Qcorrectsuggest = "EDIT lemma WHERE class = \"terweil\" WITH class \"terwijl\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"nonworderror\" confidence 0.9 SUGGESTION class \"gedurende\" WITH confidence 0.1)" Qcorrect_text = "EDIT t WHERE text = \"terweil\" WITH text \"terwijl\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"nonworderror\" confidence 0.9)" Qsuggest_text = "EDIT t WHERE text = \"terweil\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"nonworderror\" SUGGESTION text \"terwijl\" WITH confidence 0.9 SUGGESTION text \"gedurende\" WITH confidence 0.1)" Qcorrect_span = "EDIT entity OF \"http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml\" WHERE class = \"per\" WITH class \"misc\" (AS CORRECTION OF \"https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/namedentitycorrection.foliaset.xml\" WITH class \"wrongclass\" confidence 0.2) FOR ID \"example.table.1.w.3\"" Qrespan = "EDIT semrole WHERE class = \"actor\" RESPAN ID \"WR-P-E-J-0000000001.p.1.s.7.w.2\" & ID \"WR-P-E-J-0000000001.p.1.s.7.w.3\" FOR SPAN ID \"WR-P-E-J-0000000001.p.1.s.7.w.3\"" Qmerge = "SUBSTITUTE w WITH text \"weertegeven\" FOR SPAN ID \"WR-P-E-J-0000000001.p.1.s.2.w.26\" & ID \"WR-P-E-J-0000000001.p.1.s.2.w.27\" & ID \"WR-P-E-J-0000000001.p.1.s.2.w.28\"" Qsplit = "SUBSTITUTE w WITH text \"weer\" SUBSTITUTE w WITH text \"gegeven\" FOR SPAN ID \"WR-P-E-J-0000000001.p.1.s.6.w.20\"" Qcorrect_merge = "SUBSTITUTE w WITH text \"weertegeven\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"spliterror\") FOR SPAN ID \"WR-P-E-J-0000000001.p.1.s.2.w.26\" & ID \"WR-P-E-J-0000000001.p.1.s.2.w.27\" & ID \"WR-P-E-J-0000000001.p.1.s.2.w.28\"" Qcorrect_split = "SUBSTITUTE w WITH text \"weer\" SUBSTITUTE w WITH text \"gegeven\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"runonerror\") FOR SPAN ID \"WR-P-E-J-0000000001.p.1.s.6.w.20\"" Qsuggest_split = "SUBSTITUTE (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"runonerror\" SUGGESTION (SUBSTITUTE w WITH text \"weer\" SUBSTITUTE w WITH text \"gegeven\")) FOR SPAN ID \"WR-P-E-J-0000000001.p.1.s.6.w.20\"" Qprepend = "PREPEND w WITH text \"heel\" FOR ID \"WR-P-E-J-0000000001.p.1.s.1.w.4\"" Qcorrect_prepend = "PREPEND w WITH text \"heel\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"insertion\") FOR ID \"WR-P-E-J-0000000001.p.1.s.1.w.4\"" Qcorrect_delete = "DELETE w ID \"WR-P-E-J-0000000001.p.1.s.8.w.6\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"redundantword\")" Qcql_context = '"de" [ tag="ADJ\(.*" ] [ tag="N\(.*" & lemma!="blah" ]' Qcql_context2 = '[ pos = "LID\(.*" ]? [ pos = "ADJ\(.*" ]* [ pos = "N\(.*" ]' Qcql_context3 = '[ pos = "N\(.*" ]{2}' Qcql_context4 = '[ pos = "WW\(.*" ]+ [] [ pos = "WW\(.*" ]+' Qcql_context5 = '[ pos = "VG\(.*" ] [ pos = "WW\(.*" ]* []?' Qcql_context6 = '[ pos = "VG\(.*|VZ\.*" ]' #test 4: advanced corrections (higher order corrections): Qsplit2 = "SUBSTITUTE w WITH text \"Ik\" SUBSTITUTE w WITH text \"hoor\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"runonerror\") FOR SPAN ID \"correctionexample.s.4.w.1\"" Qmerge2 = "SUBSTITUTE w WITH text \"onweer\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"spliterror\") FOR SPAN ID \"correctionexample.s.4.w.2\" & ID \"correctionexample.s.4.w.3\"" Qdeletion2 = "DELETE w ID \"correctionexample.s.8.w.3\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"redundantword\")" #Qdeletion2b = "SUBSTITUTE w ID \"correctionexample.s.8.w.3\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"redundantword\") FOR SPAN ID \"correctionexample.s.8.correction.1\"" #insertions when there is an existing suggestion, SUBSTITUTE insead of APPEND/PREPEND: Qinsertion2 = "SUBSTITUTE w WITH text \".\" (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"missingpunctuation\") FOR SPAN ID \"correctionexample.s.9.correction.1\"" Qsuggest_insertion = "PREPEND (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"insertion\" SUGGESTION (ADD w WITH text \"heel\")) FOR ID \"WR-P-E-J-0000000001.p.1.s.1.w.4\"" Qsuggest_insertion2 = "APPEND (AS CORRECTION OF \"http://raw.github.com/proycon/folia/master/setdefinitions/spellingcorrection.foliaset.xml\" WITH class \"insertion\" SUGGESTION (ADD w WITH text \"heel\")) FOR ID \"WR-P-E-J-0000000001.p.1.s.1.w.3\"" Qcomment = "ADD comment WITH text \"This is our university!\" FOR entity ID \"example.radboud.university.nijmegen.org\"" Qfeat = "SELECT feat WHERE subset = \"wvorm\" FOR pos WHERE class = \"WW(pv,tgw,met-t)\" FOR ID \"WR-P-E-J-0000000001.p.1.s.2.w.5\"" Qfeat2 = "EDIT feat WHERE subset = \"wvorm\" WITH class \"inf\" FOR pos WHERE class = \"WW(pv,tgw,met-t)\" FOR ID \"WR-P-E-J-0000000001.p.1.s.2.w.5\"" Qfeat3 = "ADD feat WITH subset \"wvorm\" class \"inf\" FOR pos WHERE class = \"WW(inf,vrij,zonder)\" FOR ID \"WR-P-E-J-0000000001.p.1.s.2.w.28\"" Qfeat4 = "EDIT feat WHERE subset = \"strength\" AND class = \"strong\" WITH class \"verystrong\" FOR ID \"WR-P-E-J-0000000001.text.sentiment.1\"" class Test1UnparsedQuery(unittest.TestCase): def test1_basic(self): """Basic query with some literals""" qs = Q1 qu = fql.UnparsedQuery(qs) self.assertEqual( qu.q, ['SELECT','pos','WHERE','class','=','n','FOR','w','WHERE','text','=','house','AND','class','!=','punct','RETURN','focus']) self.assertEqual( qu.mask, [0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0] ) def test2_paren(self): """Query with parentheses""" qs = Q2 qu = fql.UnparsedQuery(qs) self.assertEqual( len(qu), 9 ) self.assertTrue( isinstance(qu.q[5], fql.UnparsedQuery)) self.assertEqual( qu.mask, [0,0,0,0,1,2,0,0,0] ) def test3_complex(self): """Query with parentheses""" qu = fql.UnparsedQuery(Qboolean) self.assertEqual( len(qu.q), 6) class Test2ParseQuery(unittest.TestCase): def test01_parse(self): """Parsing """ + Q1 q = fql.Query(Q1) def test02_parse(self): """Parsing """ + Q2 q = fql.Query(Q2) def test03_parse(self): """Parsing """ + Qselect_target q = fql.Query(Qselect_target) def test04_parse(self): """Parsing """ + Qcomplexadd q = fql.Query(Qcomplexadd) self.assertEqual( len(q.action.subactions), 1) #test whether subaction is parsed self.assertTrue( isinstance(q.action.subactions[0].nextaction, fql.Action) ) #test whether subaction has proper chain of two actions def test05_parse(self): """Parsing """ + Qhas q = fql.Query(Qhas) def test06_parse(self): """Parsing """ + Qhas_shortcut q = fql.Query(Qhas_shortcut) def test07_parse(self): """Parsing """ + Qboolean q = fql.Query(Qboolean) def test08_parse(self): """Parsing """ + Qcontext q = fql.Query(Qcontext) def test09_parse(self): """Parsing """ + Qalt q = fql.Query(Qalt) def test10_parse(self): """Parsing """ + Qcorrect1 q = fql.Query(Qcorrect1) def test11_parse(self): """Parsing """ + Qcorrect2 q = fql.Query(Qcorrect2) def test12_parse(self): """Parsing """ + Qsuggest_split q = fql.Query(Qsuggest_split) self.assertIsInstance(q.action.form, fql.Correction) self.assertEqual( len(q.action.form.suggestions),1) class Test3Evaluation(unittest.TestCase): def setUp(self): self.doc = folia.Document(string=FOLIAEXAMPLE) def test01_evaluate_select_focus(self): q = fql.Query(Qselect_focus) results = q(self.doc) self.assertTrue(isinstance(results[0], folia.LemmaAnnotation)) self.assertEqual(len(results),2) self.assertTrue(isinstance(results[1], folia.LemmaAnnotation)) def test02_evaluate_select_singlefocus(self): q = fql.Query(Qselect_singlefocus) result = q(self.doc) self.assertTrue(isinstance(result, folia.LemmaAnnotation)) def test03_evaluate_select_target(self): q = fql.Query(Qselect_target) results = q(self.doc) self.assertTrue(isinstance(results[0], folia.Word)) self.assertEqual(len(results),2) self.assertTrue(isinstance(results[1], folia.Word)) def test04_evaluate_select_singletarget(self): q = fql.Query(Qselect_singletarget) result = q(self.doc) self.assertTrue(isinstance(result, folia.Word)) def test05_evaluate_select_nestedtargets(self): q = fql.Query(Qselect_nestedtargets) result = q(self.doc) self.assertTrue(isinstance(result, folia.Word)) def test05a_evaluate_select_multitarget_focus(self): q = fql.Query(Qselect_multitarget_focus) results = q(self.doc) self.assertTrue(isinstance(results[0], folia.LemmaAnnotation)) self.assertEqual(len(results),2) self.assertTrue(isinstance(results[1], folia.LemmaAnnotation)) def test05b_evaluate_select_multitarget(self): q = fql.Query(Qselect_multitarget) results = q(self.doc) self.assertTrue(isinstance(results[0], folia.Word)) self.assertEqual(len(results),2) self.assertTrue(isinstance(results[1], folia.Word)) def test06_evaluate_edit(self): q = fql.Query(Qedit) results = q(self.doc) self.assertTrue(isinstance(results[0], folia.LemmaAnnotation)) self.assertEqual(results[0].cls, "blah") def test06a_evaluate_editconfidence(self): q = fql.Query(Qeditconfidence) results = q(self.doc) self.assertTrue(isinstance(results[0], folia.LemmaAnnotation)) self.assertEqual(results[0].cls, "blah") self.assertEqual(results[0].confidence, 0.5) def test06b_evaluate_editconfidence2(self): q = fql.Query(Qeditconfidence2) results = q(self.doc) self.assertTrue(isinstance(results[0], folia.LemmaAnnotation)) self.assertEqual(results[0].cls, "blah") self.assertEqual(results[0].confidence, None) def test07_evaluate_add(self): q = fql.Query(Qadd) results = q(self.doc) self.assertTrue(isinstance(results[0], folia.LemmaAnnotation)) self.assertEqual(results[0].cls, "hebben") def test08_evaluate_editadd(self): q = fql.Query(Qeditadd) results = q(self.doc) self.assertTrue(isinstance(results[0], folia.LemmaAnnotation)) self.assertEqual(results[0].cls, "hebben") def test09_evaluate_delete(self): q = fql.Query(Qdelete) results = q(self.doc) self.assertEqual(len(results),2) #returns that what was deleted def test10_evaluate_delete(self): q = fql.Query(Qdelete_target) results = q(self.doc) self.assertTrue(isinstance(results[0], folia.Word)) self.assertEqual(len(results),2) self.assertTrue(isinstance(results[1], folia.Word)) def test11_complexadd(self): q = fql.Query(Qcomplexadd) results = q(self.doc) self.assertIsInstance(results[0], folia.Word) self.assertIsInstance(results[0][0], folia.TextContent) self.assertIsInstance(results[0][1], folia.LemmaAnnotation) def test12_edittext(self): q = fql.Query(Qedittext) results = q(self.doc) self.assertEqual(results[0].text(), "terwijl") def test12b_edittext(self): q = fql.Query(Qedittext2) results = q(self.doc) self.assertEqual(results[0].text(), "terwijl") def test12c_edittext(self): q = fql.Query(Qedittext3) results = q(self.doc) self.assertEqual(results[0].text(), "de") def test12d_edittext(self): q = fql.Query(Qedittext4) results = q(self.doc) self.assertEqual(results[0].text(), "ter\nwijl") self.assertEqual(results[0].xmlstring(), "ter\nwijl") def test13_subfilter(self): q = fql.Query(Qhas) results = q(self.doc) for result in results: self.assertIn(result.text(), (".",",","(",")")) def test14_subfilter_shortcut(self): q = fql.Query(Qhas_shortcut) results = q(self.doc) self.assertTrue( len(results) > 0 ) for result in results: self.assertIn(result.text(), (".",",","(",")")) def test15_boolean(self): q = fql.Query(Qboolean) results = q(self.doc) self.assertTrue( len(results) > 0 ) for result in results: self.assertIn(result.text(), (".",",")) def test16_context(self): """Obtaining all words following 'de'""" q = fql.Query(Qcontext) results = q(self.doc) self.assertTrue( len(results) > 0 ) self.assertEqual(results[0].text(), "historische") self.assertEqual(results[1].text(), "naam") self.assertEqual(results[2].text(), "verwantschap") self.assertEqual(results[3].text(), "handschriften") self.assertEqual(results[4].text(), "juiste") self.assertEqual(results[5].text(), "laatste") self.assertEqual(results[6].text(), "verwantschap") self.assertEqual(results[7].text(), "handschriften") def test16b_context(self): """Obtaining LID ADJ N sequences""" q = fql.Query(Qcontext2) results = q(self.doc) self.assertTrue( len(results) > 0 ) for result in results: self.assertIsInstance(result, fql.SpanSet) #print("RESULT: ", [w.text() for w in result]) self.assertEqual(len(result), 3) self.assertIsInstance(result[0], folia.Word) self.assertIsInstance(result[1], folia.Word) self.assertIsInstance(result[2], folia.Word) self.assertEqual(result[0].pos()[:4], "LID(") self.assertEqual(result[1].pos()[:4], "ADJ(") self.assertEqual(result[2].pos()[:2], "N(") def test17_select_span(self): """Select span""" q = fql.Query(Qselect_span) results = q(self.doc) self.assertIsInstance(results[0], folia.Entity) self.assertEqual(results[0].cls, 'per') self.assertEqual(len(list(results[0].wrefs())), 3) def test18_select_span2(self): """Select span""" q = fql.Query(Qselect_span2) results = q(self.doc) self.assertIsInstance(results[0], folia.Entity) results = list(results[0].wrefs()) self.assertIsInstance(results[0], folia.Word) self.assertEqual(results[0].text(), "Maarten") self.assertIsInstance(results[1], folia.Word) self.assertEqual(results[1].text(), "van") self.assertIsInstance(results[2], folia.Word) self.assertEqual(results[2].text(), "Gompel") def test19_select_span2_returntarget(self): """Select span""" q = fql.Query(Qselect_span2_returntarget) results = q(self.doc) self.assertIsInstance(results[0], fql.SpanSet) results = results[0] self.assertIsInstance(results[0], folia.Word) self.assertEqual(results[0].text(), "Maarten") self.assertIsInstance(results[1], folia.Word) self.assertEqual(results[1].text(), "van") self.assertIsInstance(results[2], folia.Word) self.assertEqual(results[2].text(), "Gompel") def test20a_add_span(self): """Add span""" q = fql.Query(Qadd_span) results = q(self.doc) self.assertIsInstance(results[0], folia.Entity) self.assertEqual(results[0].cls, 'misc') results = list(results[0].wrefs()) self.assertIsInstance(results[0], folia.Word) self.assertEqual(results[0].text(), "hoofdletter") self.assertIsInstance(results[1], folia.Word) self.assertEqual(results[1].text(), "A") def test20b_add_span_returntarget(self): """Add span (return target)""" q = fql.Query(Qadd_span_returntarget) results = q(self.doc) self.assertIsInstance(results[0], fql.SpanSet ) results = results[0] self.assertIsInstance(results[0], folia.Word) self.assertEqual(results[0].text(), "hoofdletter") self.assertIsInstance(results[1], folia.Word) self.assertEqual(results[1].text(), "A") def test20c_add_span_returnancestortarget(self): """Add span (return ancestor target)""" q = fql.Query(Qadd_span_returnancestortarget) results = q(self.doc) self.assertIsInstance(results[0], folia.Part ) def test20d_add_span(self): """Add span (using SPAN instead of FOR SPAN)""" q = fql.Query(Qadd_span2) results = q(self.doc) self.assertIsInstance(results[0], folia.Entity) self.assertEqual(results[0].cls, 'misc') results = list(results[0].wrefs()) self.assertIsInstance(results[0], folia.Word) self.assertEqual(results[0].text(), "hoofdletter") self.assertIsInstance(results[1], folia.Word) self.assertEqual(results[1].text(), "A") self.assertEqual(len(results), 2) def test20e_add_span(self): """Add span (using RESPAN and FOR SPAN, immediately respanning)""" q = fql.Query(Qadd_span3) results = q(self.doc) self.assertIsInstance(results[0], folia.Entity) self.assertEqual(results[0].cls, 'misc') results = list(results[0].wrefs()) self.assertIsInstance(results[0], folia.Word) self.assertEqual(len(results), 1) self.assertEqual(results[0].text(), "A") def test20f_add_span(self): """Add span (using RESPAN NONE, immediately respanning)""" q = fql.Query(Qadd_span4) results = q(self.doc) self.assertIsInstance(results[0], folia.Entity) self.assertEqual(results[0].cls, 'misc') results = list(results[0].wrefs()) self.assertEqual(len(results), 0) def test21_edit_alt(self): """Add alternative token annotation""" q = fql.Query(Qalt) results = q(self.doc) self.assertIsInstance(results[0], folia.Alternative) self.assertIsInstance(results[0][0], folia.LemmaAnnotation) self.assertEqual(results[0][0].cls, "terwijl") def test22_declare(self): """Explicit declaration""" q = fql.Query(Qdeclare) results = q(self.doc) def test23a_edit_correct(self): """Add correction on token annotation""" q = fql.Query(Qcorrect1) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "nonworderror") self.assertEqual(results[0].confidence, 0.9) self.assertIsInstance(results[0].new(0), folia.LemmaAnnotation) self.assertEqual(results[0].new(0).cls, "terwijl") def test23b_edit_correct(self): """Add correction on token annotation (2)""" q = fql.Query(Qcorrect2) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "nonworderror") self.assertEqual(results[0].confidence, 0.9) self.assertIsInstance(results[0].new(0), folia.LemmaAnnotation) self.assertEqual(results[0].new(0).cls, "terwijl") def test24a_edit_suggest(self): """Add suggestions for correction on token annotation""" q = fql.Query(Qsuggest1) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "nonworderror") self.assertEqual(results[0].parent.lemma(),"terweil") self.assertIsInstance(results[0].suggestions(0), folia.Suggestion) self.assertEqual(results[0].suggestions(0).confidence, 0.9) self.assertIsInstance(results[0].suggestions(0)[0], folia.LemmaAnnotation) self.assertEqual(results[0].suggestions(0)[0].cls, "terwijl") self.assertIsInstance(results[0].suggestions(1), folia.Suggestion) self.assertEqual(results[0].suggestions(1).confidence, 0.1) self.assertIsInstance(results[0].suggestions(1)[0], folia.LemmaAnnotation) self.assertEqual(results[0].suggestions(1)[0].cls, "gedurende") def test24b_edit_correctsuggest(self): """Add correction as well as suggestions on token annotation""" q = fql.Query(Qcorrectsuggest) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "nonworderror") self.assertEqual(results[0].confidence, 0.9) self.assertIsInstance(results[0].new(0), folia.LemmaAnnotation) self.assertEqual(results[0].new(0).cls, "terwijl") self.assertIsInstance(results[0].suggestions(0), folia.Suggestion) self.assertEqual(results[0].suggestions(0).confidence, 0.1) self.assertIsInstance(results[0].suggestions(0)[0], folia.LemmaAnnotation) self.assertEqual(results[0].suggestions(0)[0].cls, "gedurende") def test25a_edit_correct_text(self): """Add correction on text""" q = fql.Query(Qcorrect_text) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "nonworderror") self.assertEqual(results[0].confidence, 0.9) self.assertIsInstance(results[0].new(0), folia.TextContent) self.assertEqual(results[0].new(0).text(), "terwijl") def test25b_edit_suggest_text(self): """Add suggestion for correction on text""" q = fql.Query(Qsuggest_text) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "nonworderror") self.assertEqual(results[0].parent.text(),"terweil") #original self.assertIsInstance(results[0].suggestions(0), folia.Suggestion) self.assertEqual(results[0].suggestions(0).confidence, 0.9) self.assertIsInstance(results[0].suggestions(0)[0], folia.TextContent) self.assertEqual(results[0].suggestions(0)[0].text(), "terwijl") self.assertIsInstance(results[0].suggestions(1), folia.Suggestion) self.assertEqual(results[0].suggestions(1).confidence, 0.1) self.assertIsInstance(results[0].suggestions(1)[0], folia.TextContent) self.assertEqual(results[0].suggestions(1)[0].text(), "gedurende") def test26_correct_span(self): """Correction of span annotation""" q = fql.Query(Qcorrect_span) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertIsInstance(results[0].new(0), folia.Entity) self.assertEqual(results[0].new(0).cls, 'misc') self.assertEqual(len(list(results[0].new(0).wrefs())), 3) def test27_edit_respan(self): """Re-spanning""" q = fql.Query(Qrespan) results = q(self.doc) self.assertIsInstance(results[0], folia.SemanticRole) self.assertEqual(results[0].cls, "actor") results = list(results[0].wrefs()) self.assertIsInstance(results[0], folia.Word) self.assertEqual(results[0].text(), "gaat") #yes, this is not a proper semantic role for class 'actor', I know.. but I had to make up a test self.assertIsInstance(results[1], folia.Word) self.assertEqual(results[1].text(), "men") def test28a_merge(self): """Substitute - Merging""" q = fql.Query(Qmerge) results = q(self.doc) self.assertIsInstance(results[0], folia.Word) self.assertEqual(results[0].text(), "weertegeven") def test28b_split(self): """Substitute - Split""" q = fql.Query(Qsplit) results = q(self.doc) self.assertIsInstance(results[0], folia.Word) self.assertIsInstance(results[1], folia.Word) self.assertEqual(results[0].text(), "weer") self.assertEqual(results[1].text(), "gegeven") def test28a_correct_merge(self): """Merge Correction""" q = fql.Query(Qcorrect_merge) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "spliterror") self.assertEqual(results[0].new(0).text(), "weertegeven") def test28b_correct_split(self): """Split Correction""" q = fql.Query(Qcorrect_split) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "runonerror") self.assertIsInstance(results[0].new(0), folia.Word) self.assertIsInstance(results[0].new(1), folia.Word) self.assertEqual(results[0].new(0).text(), "weer") self.assertEqual(results[0].new(1).text(), "gegeven") self.assertEqual(results[0].original(0).text(), "weergegeven") def test28b_suggest_split(self): """Split Suggestion for Correction""" q = fql.Query(Qsuggest_split) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "runonerror") self.assertIsInstance(results[0].suggestions(0)[0], folia.Word) self.assertIsInstance(results[0].suggestions(0)[1], folia.Word) self.assertEqual(results[0].suggestions(0)[0].text(), "weer") self.assertEqual(results[0].suggestions(0)[1].text(), "gegeven") self.assertEqual(results[0].current(0).text(), "weergegeven") def test29a_prepend(self): """Insertion using prepend""" q = fql.Query(Qprepend) results = q(self.doc) self.assertIsInstance(results[0], folia.Word) self.assertEqual(results[0].text(), "heel") self.assertEqual(results[0].next(folia.Word).text(), "ander") def test29b_correct_prepend(self): """Insertion as correction (prepend)""" q = fql.Query(Qcorrect_prepend) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "insertion") self.assertEqual(results[0].text(), "heel") self.assertEqual(results[0].next(folia.Word).text(), "ander") def test30_select_startend(self): q = fql.Query(Qselect_startend) results = q(self.doc) self.assertEqual(len(results), 3) self.assertEqual(results[0].text(), "de") self.assertEqual(results[1].text(), "historische") self.assertEqual(results[2].text(), "wetenschap") def test30_select_startend2(self): q = fql.Query(Qselect_startend2) results = q(self.doc) self.assertEqual(len(results), 2) self.assertEqual(results[0].text(), "de") self.assertEqual(results[1].text(), "historische") def test31_in(self): q = fql.Query(Qin) results = q(self.doc) self.assertEqual(len(results), 2) def test31b_in2(self): q = fql.Query(Qin2) results = q(self.doc) self.assertEqual(len(results), 0) def test31c_in2ref(self): q = fql.Query(Qin2ref) results = q(self.doc) self.assertEqual(len(results), 6) #includes ph under phoneme def test31d_tfor(self): q = fql.Query("SELECT t FOR w ID \"WR-P-E-J-0000000001.sandbox.2.s.1.w.2\"") results = q(self.doc) self.assertEqual(len(results), 3) #includes t under morpheme def test31e_tin(self): q = fql.Query("SELECT t IN w ID \"WR-P-E-J-0000000001.sandbox.2.s.1.w.2\"") results = q(self.doc) self.assertEqual(len(results), 1) def test32_correct_delete(self): """Deletion as correction""" q = fql.Query(Qcorrect_delete) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "redundantword") self.assertEqual(results[0].hastext(), False) self.assertEqual(results[0].text(correctionhandling=folia.CorrectionHandling.ORIGINAL), "een") def test33_suggest_insertion(self): """Insertion as suggestion (prepend)""" q = fql.Query(Qsuggest_insertion) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "insertion") self.assertEqual(results[0].suggestions(0).text(), "heel") self.assertEqual(results[0].next(folia.Word,None).text(), "ander") def test34_suggest_insertion2(self): """Insertion as suggestion (append)""" q = fql.Query(Qsuggest_insertion2) results = q(self.doc) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].cls, "insertion") self.assertEqual(results[0].suggestions(0).text(), "heel") self.assertEqual(results[0].next(folia.Word,None).text(), "ander") def test35_comment(self): """Adding a comment""" q = fql.Query(Qcomment) results = q(self.doc) self.assertIsInstance(results[0], folia.Comment) self.assertEqual(results[0].value, "This is our university!") self.assertEqual(results[0].parent.id, "example.radboud.university.nijmegen.org") def test36_feature(self): """Selecting a feature""" q = fql.Query(Qfeat) results = q(self.doc) self.assertIsInstance(results[0], folia.Feature) self.assertEqual(results[0].subset, "wvorm") self.assertEqual(results[0].cls, "pv") def test36b_feature(self): """Editing a feature""" q = fql.Query(Qfeat2) results = q(self.doc) self.assertIsInstance(results[0], folia.Feature) self.assertEqual(results[0].subset, "wvorm") self.assertEqual(results[0].cls, "inf") def test36c_feature(self): """Adding a feature""" q = fql.Query(Qfeat3) results = q(self.doc) self.assertIsInstance(results[0], folia.Feature) self.assertEqual(results[0].subset, "wvorm") self.assertEqual(results[0].cls, "inf") def test36d_feature(self): """Editing a feature that has a predefined subset""" q = fql.Query(Qfeat4) results = q(self.doc) self.assertIsInstance(results[0], folia.Feature) self.assertEqual(results[0].subset, "strength") self.assertEqual(results[0].cls, "verystrong") def test37_subqueries(self): """Adding a complex span annotation with span roles, using subqueries""" q = fql.Query(Qadd_span_subqueries) results = q(self.doc) self.assertIsInstance(results[0], folia.Dependency) self.assertEqual(results[0].cls, "test") self.assertEqual(list(results[0].annotation(folia.Headspan).wrefs()), [ results[0].doc['WR-P-E-J-0000000001.p.1.s.2.w.7'] ] ) self.assertEqual(list(results[0].annotation(folia.DependencyDependent).wrefs()), [ results[0].doc['WR-P-E-J-0000000001.p.1.s.2.w.6'] ] ) self.assertEqual(results[0].ancestor(folia.AbstractStructureElement).id, 'WR-P-E-J-0000000001.p.1.s.2') def test38_nested_span(self): """Adding a nested span""" q = fql.Query(Qadd_nested_span) results = q(self.doc) self.assertIsInstance(results[0], folia.SyntacticUnit) self.assertIsInstance(results[0].parent, folia.SyntacticUnit) self.assertEqual(results[0].parent.id, "WR-P-E-J-0000000001.p.1.s.1.su.0") self.assertEqual(list(results[0].wrefs()), [ results[0].doc['WR-P-E-J-0000000001.p.1.s.1.w.4'],results[0].doc['WR-P-E-J-0000000001.p.1.s.1.w.5'] ] ) def test39_edit_spanrole(self): """Editing a spanrole""" q = fql.Query(Qedit_spanrole) results = q(self.doc) self.assertIsInstance(results[0], folia.Dependency) self.assertEqual(list(results[0].annotation(folia.Headspan).wrefs()), [ results[0].doc['WR-P-E-J-0000000001.p.1.s.1.w.3'], results[0].doc['WR-P-E-J-0000000001.p.1.s.1.w.4'], results[0].doc['WR-P-E-J-0000000001.p.1.s.1.w.5'] ] ) self.assertEqual(results[0].ancestor(folia.AbstractStructureElement).id, 'WR-P-E-J-0000000001.p.1.s.1') def test39b_edit_spanrole(self): """Editing a spanrole (with ID)""" #ID does not exist yet, we add it first: q = fql.Query("SELECT hd FOR ID \"WR-P-E-J-0000000001.p.1.s.1.dep.2\"") hd = q(self.doc)[0] hd.id = "test" self.doc.index["test"] = hd #now the actual test: q = fql.Query(Qedit_spanrole_id) results = q(self.doc) self.assertIsInstance(results[0], folia.Dependency) self.assertEqual(list(results[0].annotation(folia.Headspan).wrefs()), [ results[0].doc['WR-P-E-J-0000000001.p.1.s.1.w.3'], results[0].doc['WR-P-E-J-0000000001.p.1.s.1.w.4'], results[0].doc['WR-P-E-J-0000000001.p.1.s.1.w.5'] ] ) self.assertEqual(results[0].ancestor(folia.AbstractStructureElement).id, 'WR-P-E-J-0000000001.p.1.s.1') class Test4CQL(unittest.TestCase): def setUp(self): self.doc = folia.Document(string=FOLIAEXAMPLE) def test01_context(self): q = fql.Query(cql.cql2fql(Qcql_context)) results = q(self.doc) self.assertTrue( len(results) > 0 ) for result in results: self.assertIsInstance(result, fql.SpanSet) #print("RESULT: ", [w.text() for w in result]) self.assertEqual(len(result), 3) self.assertIsInstance(result[0], folia.Word) self.assertIsInstance(result[1], folia.Word) self.assertIsInstance(result[2], folia.Word) self.assertEqual(result[0].text(), "de") self.assertEqual(result[1].pos()[:4], "ADJ(") self.assertEqual(result[2].pos()[:2], "N(") def test02_context(self): q = fql.Query(cql.cql2fql(Qcql_context2)) results = q(self.doc) self.assertTrue( len(results) > 0 ) textresults = [] for result in results: self.assertIsInstance(result, fql.SpanSet) textresults.append( tuple([w.text() for w in result]) ) self.assertTrue( ('het','alfabet') in textresults ) self.assertTrue( ('vierkante','haken') in textresults ) self.assertTrue( ('plaats',) in textresults ) self.assertTrue( ('het','originele','handschrift') in textresults ) self.assertTrue( ('Een','volle','lijn') in textresults ) def test03_context(self): q = fql.Query(cql.cql2fql(Qcql_context3)) results = q(self.doc) self.assertEqual( len(results), 2 ) textresults = [] for result in results: self.assertIsInstance(result, fql.SpanSet) self.assertEqual(len(result), 2) textresults.append( tuple([w.text() for w in result]) ) #print(textresults,file=sys.stderr) self.assertTrue( ('naam','stemma') in textresults ) self.assertTrue( ('stemma','codicum') in textresults ) def test04_context(self): q = fql.Query(cql.cql2fql(Qcql_context4)) results = q(self.doc) self.assertEqual( len(results),2 ) textresults = [] for result in results: self.assertIsInstance(result, fql.SpanSet) textresults.append( tuple([w.text() for w in result]) ) #print(textresults,file=sys.stderr) self.assertTrue( ('genummerd','en','gedateerd') in textresults ) self.assertTrue( ('opgenomen','en','worden','weergegeven') in textresults ) def test05_context(self): q = fql.Query(cql.cql2fql(Qcql_context5)) results = q(self.doc) self.assertTrue( len(results) > 0 ) textresults = [] for result in results: self.assertIsInstance(result, fql.SpanSet) textresults.append( tuple([w.text() for w in result]) ) #print(textresults,file=sys.stderr) self.assertTrue( ('en','gedateerd','zodat') in textresults ) self.assertTrue( ('en','worden','weergegeven','door') in textresults ) self.assertTrue( ('zodat','ze') in textresults ) self.assertTrue( ('en','worden','tussen') in textresults ) self.assertTrue( ('terweil','een') in textresults ) def test06_context(self): q = fql.Query(cql.cql2fql(Qcql_context6)) results = q(self.doc) self.assertTrue( len(results) > 0 ) for result in results: self.assertIsInstance(result, fql.SpanSet) self.assertEqual(len(result), 1) self.assertTrue(result[0].pos()[:2] == "VZ" or result[0].pos()[:2] == "VG" ) class Test4Evaluation(unittest.TestCase): """Higher-order corrections (corrections on corrections)""" def setUp(self): self.doc = folia.Document(string=FOLIACORRECTIONEXAMPLE) def test1_split2(self): """Substitute - Split (higher-order)""" q = fql.Query(Qsplit2) results = q(self.doc) self.assertEqual(len(results), 1) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].text(), "Ik hoor") def test2_merge2(self): """Substitute - Merge (higher-order)""" q = fql.Query(Qmerge2) results = q(self.doc) self.assertEqual(len(results), 1) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].text(), "onweer") def test3_deletion2(self): """Deletion (higher-order)""" q = fql.Query(Qdeletion2) results = q(self.doc) self.assertEqual(len(results), 1) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].hastext(), False) self.assertEqual(results[0].original().text(), "een") self.assertEqual(results[0].previous(None).id, "correctionexample.s.8.w.2") self.assertEqual(results[0].next(None).id, "correctionexample.s.8.w.4") def test3_insertion2(self): """Substitute - Insertion (higher-order)""" q = fql.Query(Qinsertion2) results = q(self.doc) self.assertEqual(len(results), 1) self.assertIsInstance(results[0], folia.Correction) self.assertEqual(results[0].text(), '.') self.assertIsInstance(results[0].original()[0], folia.Correction) if os.path.exists('../../FoLiA'): FOLIAPATH = '../../FoLiA/' elif os.path.exists('../FoLiA'): FOLIAPATH = '../FoLiA/' else: FOLIAPATH = 'FoLiA' print("Downloading FoLiA",file=sys.stderr) os.system("git clone https://github.com/proycon/folia.git FoLiA") f = io.open(FOLIAPATH + '/test/example.xml', 'r',encoding='utf-8') FOLIAEXAMPLE = f.read() f.close() f = io.open(FOLIAPATH + '/test/correctionexample.xml', 'r',encoding='utf-8') FOLIACORRECTIONEXAMPLE = f.read() f.close() if __name__ == '__main__': unittest.main() PyNLPl-1.1.2/pynlpl/tests/cgn.py0000755000175000001440000002567112445064173017326 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- #--------------------------------------------------------------- # PyNLPl - Test Units for CGN # by Maarten van Gompel, ILK, Universiteit van Tilburg # http://ilk.uvt.nl/~mvgompel # proycon AT anaproy DOT nl # # Licensed under GPLv3 # #---------------------------------------------------------------- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import sys import os import unittest if sys.version < '3': from StringIO import StringIO else: from io import StringIO import lxml.etree from pynlpl.formats import cgn class CGNtest(unittest.TestCase): def test(self): """CGN - Splitting PoS tags into features""" global CLASSES #Do it again, but supress exceptions (only stderr output for missing features so we have one big list) for poscls in CLASSES.split('\n'): if poscls: cgn.parse_cgn_postag(poscls, False) #Do it again, raising an exception this time for poscls in CLASSES.split('\n'): if poscls: cgn.parse_cgn_postag(poscls, True) CLASSES = """ TSW(dial) N(soort,dial) N(eigen,dial) ADJ(dial) WW(dial) TW(hoofd,dial) TW(rang,dial) VNW(pers,pron,dial) VNW(refl,pron,dial) VNW(recip,pron,dial) VNW(bez,det,dial) VNW(vrag,pron,dial) VNW(vrag,det,dial) VNW(betr,pron,dial) VNW(betr,det,dial) VNW(excl,pron,dial) VNW(excl,det,dial) VNW(aanw,pron,dial) VNW(aanw,det,dial) VNW(onbep,pron,dial) VNW(onbep,det,dial) LID(bep,dial) LID(onbep,dial) VZ(init,dial) VZ(fin,dial) VG(neven,dial) VG(onder,dial) BW(dial) TSW() SPEC(afgebr) SPEC(onverst) SPEC(vreemd) SPEC(deeleigen) SPEC(meta) LET() SPEC(comment) SPEC(achter) SPEC(afk) SPEC(symb) N(soort,ev,basis,zijd,stan) N(soort,ev,basis,onz,stan) N(soort,ev,dim,onz,stan) N(soort,ev,basis,gen) N(soort,ev,dim,gen) N(soort,ev,basis,dat) N(soort,mv,basis) N(soort,mv,dim) N(eigen,ev,basis,zijd,stan) N(eigen,ev,basis,onz,stan) N(eigen,ev,dim,onz,stan) N(eigen,ev,basis,gen) N(eigen,ev,dim,gen) N(eigen,ev,basis,dat) N(eigen,mv,basis) N(eigen,mv,dim) ADJ(prenom,basis,zonder) ADJ(prenom,basis,met-e,stan) ADJ(prenom,basis,met-e,bijz) ADJ(prenom,comp,zonder) ADJ(prenom,comp,met-e,stan) ADJ(prenom,comp,met-e,bijz) ADJ(prenom,sup,zonder) ADJ(prenom,sup,met-e,stan) ADJ(prenom,sup,met-e,bijz) ADJ(nom,basis,zonder,zonder-n) ADJ(nom,basis,zonder,mv-n) ADJ(nom,basis,met-e,zonder-n,stan) ADJ(nom,basis,met-e,zonder-n,bijz) ADJ(nom,basis,met-e,mv-n) ADJ(nom,comp,zonder,zonder-n) ADJ(nom,comp,met-e,zonder-n,stan) ADJ(nom,comp,met-e,zonder-n,bijz) ADJ(nom,comp,met-e,mv-n) ADJ(nom,sup,zonder,zonder-n) ADJ(nom,sup,met-e,zonder-n,stan) ADJ(nom,sup,met-e,zonder-n,bijz) ADJ(nom,sup,met-e,mv-n) ADJ(postnom,basis,zonder) ADJ(postnom,basis,met-s) ADJ(postnom,comp,zonder) ADJ(postnom,comp,met-s) ADJ(vrij,basis,zonder) ADJ(vrij,comp,zonder) ADJ(vrij,sup,zonder) ADJ(vrij,dim,zonder) WW(pv,tgw,ev) WW(pv,tgw,mv) WW(pv,tgw,met-t) WW(pv,verl,ev) WW(pv,verl,mv) WW(pv,verl,met-t) WW(pv,conj,ev) WW(inf,prenom,zonder) WW(inf,prenom,met-e) WW(inf,nom,zonder,zonder-n) WW(inf,vrij,zonder) WW(vd,prenom,zonder) WW(vd,prenom,met-e) WW(vd,nom,met-e,zonder-n) WW(vd,nom,met-e,mv-n) WW(vd,vrij,zonder) WW(od,prenom,zonder) WW(od,prenom,met-e) WW(od,nom,met-e,zonder-n) WW(od,nom,met-e,mv-n) WW(od,vrij,zonder) TW(hoofd,prenom,stan) TW(hoofd,prenom,bijz) TW(hoofd,nom,zonder-n,basis) TW(hoofd,nom,mv-n,basis) TW(hoofd,nom,zonder-n,dim) TW(hoofd,nom,mv-n,dim) TW(hoofd,vrij) TW(rang,prenom,stan) TW(rang,prenom,bijz) TW(rang,nom,zonder-n) TW(rang,nom,mv-n) VNW(pers,pron,nomin,vol,1,ev) VNW(pers,pron,nomin,nadr,1,ev) VNW(pers,pron,nomin,red,1,ev) VNW(pers,pron,nomin,vol,1,mv) VNW(pers,pron,nomin,nadr,1,mv) VNW(pers,pron,nomin,red,1,mv) VNW(pers,pron,nomin,vol,2v,ev) VNW(pers,pron,nomin,nadr,2v,ev) VNW(pers,pron,nomin,red,2v,ev) VNW(pers,pron,nomin,nadr,3m,ev,masc) VNW(pers,pron,nomin,vol,3v,ev,fem) VNW(pers,pron,nomin,nadr,3v,ev,fem) VNW(pers,pron,obl,vol,2v,ev) VNW(pers,pron,obl,nadr,3m,ev,masc) VNW(pers,pron,gen,vol,1,ev) VNW(pers,pron,gen,vol,1,mv) VNW(pers,pron,gen,vol,3m,ev) VNW(bez,det,gen,vol,1,ev,prenom,zonder,evmo) VNW(bez,det,gen,vol,1,mv,prenom,met-e,evmo) VNW(bez,det,gen,vol,3v,ev,prenom,zonder,evmo) VNW(bez,det,dat,vol,1,ev,prenom,met-e,evmo) VNW(bez,det,dat,vol,1,ev,prenom,met-e,evf) VNW(bez,det,dat,vol,1,mv,prenom,met-e,evmo) VNW(bez,det,dat,vol,1,mv,prenom,met-e,evf) VNW(bez,det,dat,vol,2v,ev,prenom,met-e,evf) VNW(bez,det,dat,vol,3v,ev,prenom,met-e,evmo) VNW(bez,det,dat,vol,3v,ev,prenom,met-e,evf) VNW(bez,det,dat,vol,1,ev,nom,met-e,zonder-n) VNW(bez,det,dat,vol,1,mv,nom,met-e,zonder-n) VNW(bez,det,dat,vol,3m,ev,nom,met-e,zonder-n) VNW(bez,det,dat,vol,3v,ev,nom,met-e,zonder-n) VNW(betr,pron,gen,vol,3o,ev) VNW(aanw,pron,gen,vol,3m,ev) VNW(aanw,pron,gen,vol,3o,ev) VNW(aanw,det,dat,prenom,met-e,evmo) VNW(aanw,det,dat,prenom,met-e,evf) VNW(aanw,det,gen,nom,met-e,zonder-n) VNW(aanw,det,dat,nom,met-e,zonder-n) VNW(onbep,det,gen,prenom,met-e,mv) VNW(onbep,det,dat,prenom,met-e,evmo) VNW(onbep,det,dat,prenom,met-e,evf) VNW(onbep,det,gen,nom,met-e,mv-n) VNW(onbep,grad,gen,nom,met-e,mv-n,basis) LID(bep,stan,evon) LID(bep,stan,rest) LID(bep,gen,evmo) LID(bep,dat,evmo) LID(bep,dat,evf) LID(bep,dat,mv) LID(onbep,gen,evf) VZ(init) VZ(fin) VZ(versm) VG(neven) VG(onder) BW() N(soort,ev,basis,genus,stan) N(eigen,ev,basis,genus,stan) VNW(pers,pron,nomin,vol,2b,getal) VNW(pers,pron,nomin,nadr,2b,getal) VNW(pers,pron,nomin,vol,2,getal) VNW(pers,pron,nomin,nadr,2,getal) VNW(pers,pron,nomin,red,2,getal) VNW(pers,pron,nomin,vol,3,ev,masc) VNW(pers,pron,nomin,red,3,ev,masc) VNW(pers,pron,nomin,red,3p,ev,masc) VNW(pers,pron,nomin,vol,3p,mv) VNW(pers,pron,nomin,nadr,3p,mv) VNW(pers,pron,obl,vol,3,ev,masc) VNW(pers,pron,obl,red,3,ev,masc) VNW(pers,pron,obl,vol,3,getal,fem) VNW(pers,pron,obl,nadr,3v,getal,fem) VNW(pers,pron,obl,red,3v,getal,fem) VNW(pers,pron,obl,vol,3p,mv) VNW(pers,pron,obl,nadr,3p,mv) VNW(pers,pron,stan,nadr,2v,mv) VNW(pers,pron,stan,red,3,ev,onz) VNW(pers,pron,stan,red,3,ev,fem) VNW(pers,pron,stan,red,3,mv) VNW(pers,pron,gen,vol,2,getal) VNW(pers,pron,gen,vol,3v,getal) VNW(pers,pron,gen,vol,3p,mv) VNW(pr,pron,obl,vol,1,ev) VNW(pr,pron,obl,nadr,1,ev) VNW(pr,pron,obl,red,1,ev) VNW(pr,pron,obl,vol,1,mv) VNW(pr,pron,obl,nadr,1,mv) VNW(pr,pron,obl,red,2v,getal) VNW(pr,pron,obl,nadr,2v,getal) VNW(pr,pron,obl,vol,2,getal) VNW(pr,pron,obl,nadr,2,getal) VNW(refl,pron,obl,red,3,getal) VNW(refl,pron,obl,nadr,3,getal) VNW(recip,pron,obl,vol,persoon,mv) VNW(recip,pron,gen,vol,persoon,mv) VNW(bez,det,stan,vol,1,ev,prenom,zonder,agr) VNW(bez,det,stan,vol,1,ev,prenom,met-e,rest) VNW(bez,det,stan,red,1,ev,prenom,zonder,agr) VNW(bez,det,stan,vol,1,mv,prenom,zonder,evon) VNW(bez,det,stan,vol,1,mv,prenom,met-e,rest) VNW(bez,det,stan,vol,2,getal,prenom,zonder,agr) VNW(bez,det,stan,vol,2,getal,prenom,met-e,rest) VNW(bez,det,stan,vol,2v,ev,prenom,zonder,agr) VNW(bez,det,stan,red,2v,ev,prenom,zonder,agr) VNW(bez,det,stan,nadr,2v,mv,prenom,zonder,agr) VNW(bez,det,stan,vol,3,ev,prenom,zonder,agr) VNW(bez,det,stan,vol,3m,ev,prenom,met-e,rest) VNW(bez,det,stan,vol,3v,ev,prenom,met-e,rest) VNW(bez,det,stan,red,3,ev,prenom,zonder,agr) VNW(bez,det,stan,vol,3,mv,prenom,zonder,agr) VNW(bez,det,stan,vol,3p,mv,prenom,met-e,rest) VNW(bez,det,stan,red,3,getal,prenom,zonder,agr) VNW(bez,det,gen,vol,1,ev,prenom,met-e,rest3) VNW(bez,det,gen,vol,1,mv,prenom,met-e,rest3) VNW(bez,det,gen,vol,2,getal,prenom,zonder,evmo) VNW(bez,det,gen,vol,2,getal,prenom,met-e,rest3) VNW(bez,det,gen,vol,2v,ev,prenom,met-e,rest3) VNW(bez,det,gen,vol,3,ev,prenom,zonder,evmo) VNW(bez,det,gen,vol,3,ev,prenom,met-e,rest3) VNW(bez,det,gen,vol,3v,ev,prenom,met-e,rest3) VNW(bez,det,gen,vol,3p,mv,prenom,zonder,evmo) VNW(bez,det,gen,vol,3p,mv,prenom,met-e,rest3) VNW(bez,det,dat,vol,2,getal,prenom,met-e,evmo) VNW(bez,det,dat,vol,2,getal,prenom,met-e,evf) VNW(bez,det,dat,vol,3,ev,prenom,met-e,evmo) VNW(bez,det,dat,vol,3,ev,prenom,met-e,evf) VNW(bez,det,dat,vol,3p,mv,prenom,met-e,evmo) VNW(bez,det,dat,vol,3p,mv,prenom,met-e,evf) VNW(bez,det,stan,vol,1,ev,nom,met-e,zonder-n) VNW(bez,det,stan,vol,1,mv,nom,met-e,zonder-n) VNW(bez,det,stan,vol,2,getal,nom,met-e,zonder-n) VNW(bez,det,stan,vol,2v,ev,nom,met-e,zonder-n) VNW(bez,det,stan,vol,3m,ev,nom,met-e,zonder-n) VNW(bez,det,stan,vol,3v,ev,nom,met-e,zonder-n) VNW(bez,det,stan,vol,3p,mv,nom,met-e,zonder-n) VNW(bez,det,stan,vol,1,ev,nom,met-e,mv-n) VNW(bez,det,stan,vol,1,mv,nom,met-e,mv-n) VNW(bez,det,stan,vol,2,getal,nom,met-e,mv-n) VNW(bez,det,stan,vol,2v,ev,nom,met-e,mv-n) VNW(bez,det,stan,vol,3m,ev,nom,met-e,mv-n) VNW(bez,det,stan,vol,3v,ev,nom,met-e,mv-n) VNW(bez,det,stan,vol,3p,mv,nom,met-e,mv-n) VNW(bez,det,dat,vol,2,getal,nom,met-e,zonder-n) VNW(bez,det,dat,vol,3p,mv,nom,met-e,zonder-n) VNW(vrag,pron,stan,nadr,3o,ev) VNW(betr,pron,stan,vol,persoon,getal) VNW(betr,pron,stan,vol,3,ev) VNW(betr,det,stan,nom,zonder,zonder-n) VNW(betr,det,stan,nom,met-e,zonder-n) VNW(betr,pron,gen,vol,3o,getal) VNW(vb,pron,stan,vol,3p,getal) VNW(vb,pron,stan,vol,3o,ev) VNW(vb,pron,gen,vol,3m,ev) VNW(vb,pron,gen,vol,3v,ev) VNW(vb,pron,gen,vol,3p,mv) VNW(vb,adv-pron,obl,vol,3o,getal) VNW(excl,pron,stan,vol,3,getal) VNW(vb,det,stan,prenom,zonder,evon) VNW(vb,det,stan,prenom,met-e,rest) VNW(vb,det,stan,nom,met-e,zonder-n) VNW(excl,det,stan,vrij,zonder) VNW(aanw,pron,stan,vol,3o,ev) VNW(aanw,pron,stan,nadr,3o,ev) VNW(aanw,pron,stan,vol,3,getal) VNW(aanw,adv-pron,obl,vol,3o,getal) VNW(aanw,adv-pron,stan,red,3,getal) VNW(aanw,det,stan,prenom,zonder,evon) VNW(aanw,det,stan,prenom,zonder,rest) VNW(aanw,det,stan,prenom,zonder,agr) VNW(aanw,det,stan,prenom,met-e,rest) VNW(aanw,det,gen,prenom,met-e,rest3) VNW(aanw,det,stan,nom,met-e,zonder-n) VNW(aanw,det,stan,nom,met-e,mv-n) VNW(aanw,det,stan,vrij,zonder) VNW(onbep,pron,stan,vol,3p,ev) VNW(onbep,pron,stan,vol,3o,ev) VNW(onbep,pron,gen,vol,3p,ev) VNW(onbep,adv-pron,obl,vol,3o,getal) VNW(onbep,adv-pron,gen,red,3,getal) VNW(onbep,det,stan,prenom,zonder,evon) VNW(onbep,det,stan,prenom,zonder,agr) VNW(onbep,det,stan,prenom,met-e,evz) VNW(onbep,det,stan,prenom,met-e,mv) VNW(onbep,det,stan,prenom,met-e,rest) VNW(onbep,det,stan,prenom,met-e,agr) VNW(onbep,grad,stan,prenom,zonder,agr,basis) VNW(onbep,grad,stan,prenom,met-e,agr,basis) VNW(onbep,grad,stan,prenom,met-e,mv,basis) VNW(onbep,grad,stan,prenom,zonder,agr,comp) VNW(onbep,grad,stan,prenom,met-e,agr,sup) VNW(onbep,grad,stan,prenom,met-e,agr,comp) VNW(onbep,det,stan,nom,met-e,mv-n) VNW(onbep,det,stan,nom,met-e,zonder-n) VNW(onbep,det,stan,nom,zonder,zonder-n) VNW(onbep,grad,stan,nom,met-e,zonder-n,basis) VNW(onbep,grad,stan,nom,met-e,mv-n,basis) VNW(onbep,grad,stan,nom,met-e,zonder-n,sup) VNW(onbep,grad,stan,nom,met-e,mv-n,sup) VNW(onbep,grad,stan,nom,zonder,mv-n,dim) VNW(onbep,det,stan,vrij,zonder) VNW(onbep,grad,stan,vrij,zonder,basis) VNW(onbep,grad,stan,vrij,zonder,sup) VNW(onbep,grad,stan,vrij,zonder,comp) LID(bep,gen,rest3) LID(onbep,stan,agr) VNW(onbep,grad,stan,nom,zonder,zonder-n,sup) SPEC(enof) """ if __name__ == '__main__': unittest.main() PyNLPl-1.1.2/pynlpl/tests/folia_benchmark.py0000755000175000001440000001440412506555722021656 0ustar proyconusers00000000000000#!/usr/bin/env python from __future__ import print_function, unicode_literals, division, absolute_import from pynlpl.formats import folia, fql, cql import time import sys import os import glob try: from pympler import asizeof except ImportError: print("An extra dependency called pympler is required: install using pip install pympler (or other means)",file=sys.stderr) raise repetitions = 0 def timeit(f): def f_timer(*args, **kwargs): if 'filename' in kwargs: label = "on file " + kwargs['filename'] elif 'dirname' in kwargs: label = "on directory " + kwargs['dirname'] elif 'doc' in kwargs: label = "on document " + kwargs['doc'].id else: label = "" print(f.__name__ + " -- " + f.__doc__ + " -- " + label + " ...", end="") times = [] for i in range(0, repetitions): start = time.time() try: result = f(*args, **kwargs) except Exception as e: print(" -- ERROR! -- ", e) return None times.append(time.time() - start) if times: d = round(sum(times) / len(times),4) print('took ' + str(d) + 's (averaged over ' + str(len(times)) + ' runs)') else: d = 0 return result return f_timer @timeit def loadfile(**kwargs): """Loading file""" doc = folia.Document(file=kwargs['filename'],bypassleak=False) @timeit def savefile(**kwargs): #careful with SSDs """Saving file""" kwargs['doc'].save("/tmp/test.xml") @timeit def xml(**kwargs): """XML serialisation""" kwargs['doc'].xml() @timeit def json(**kwargs): """JSON serialisation""" kwargs['doc'].json() @timeit def text(**kwargs): """text serialisation""" kwargs['doc'].text() @timeit def countwords(**kwargs): """Counting words""" kwargs['doc'].count(folia.Word,None, True,[folia.AbstractAnnotationLayer]) @timeit def selectwords(**kwargs): """Selecting words""" for word in kwargs['doc'].words(): pass @timeit def selectwordsfql(**kwargs): """Selecting words using FQL""" query = fql.Query("SELECT w") for word in query(kwargs['doc']): pass @timeit def selectwordsfqlforp(**kwargs): """Selecting words in paragraphs using FQL""" query = fql.Query("SELECT w FOR p") for word in query(kwargs['doc']): pass @timeit def selectwordsfqlxml(**kwargs): """Selecting words using FQL (XML output)""" query = fql.Query("SELECT w FORMAT xml") for wordxml in query(kwargs['doc']): pass @timeit def selectwordsfqlwhere(**kwargs): """Selecting words using FQL (with WHERE clause)""" query = fql.Query("SELECT w WHERE text != \"blah\"") for word in query(kwargs['doc']): pass @timeit def editwordsfql(**kwargs): """Editing the text of words using FQL (with WHERE clause)""" query = fql.Query("EDIT w WITH text \"blah\"") for word in query(kwargs['doc']): pass @timeit def nextwords(**kwargs): """Find neighbour of each word""" for word in kwargs['doc'].words(): word.next() @timeit def addelement(**kwargs): """Adding a simple annotation (desc) to each word""" for word in kwargs['doc'].words(): try: word.append(folia.Description, value="test") except folia.DuplicateAnnotationError: pass @timeit def ancestors(**kwargs): """Iterating over the ancestors of each word""" for word in kwargs['doc'].words(): for ancestor in word.ancestors(): pass @timeit def readerwords(**kwargs): """Iterating over words using Reader""" reader = folia.Reader(kwargs['filename'], folia.Word) for word in reader: pass def main(): global repetitions, target files = [] try: begin = 1 if os.path.exists(sys.argv[1]): begin = 1 selectedtests = "all" repetitions = 1 else: selectedtests = sys.argv[1].split(',') if os.path.exists(sys.argv[2]): repetitions = 1 begin = 2 else: repetitions = int(sys.argv[2]) begin = 3 filesordirs = sys.argv[begin:] except: print("Syntax: folia_benchmark [testfunctions [repetitions]] files-or-directories+",file=sys.stderr) print(" testfunctions is a comma separated list of function names, or the special keyword 'all'", file=sys.stderr) print(" directories are recursively searched for files with the extension folia.xml, +gz and +bz2 is supported too.", file=sys.stderr) sys.exit(2) for fd in filesordirs: if not os.path.exists(fd): raise Exception("No such file or directory" + fd) if os.path.isfile(fd): files.append(fd) elif os.path.isdir(fd): dirs = [fd] while dirs: dir = dirs.pop(0) for filename in glob.glob(dir + "/*"): if os.path.isdir(filename): dirs.append(filename) elif filename.endswith('.folia.xml') or filename.endswith('.folia.xml.gz') or filename.endswith('.folia.xml.bz2'): files.append(filename) for f in ('loadfile','loadfileleakbypass','readerwords'): if f in selectedtests or 'all' in selectedtests: for filename in files: globals()[f](filename=filename) for f in ('xml','text','json','countwords','selectwords','nextwords','ancestors','selectwordsfql','selectwordsfqlforp','selectwordsfqlxml','selectwordsfqlwhere','editwordsfql', 'addelement' ): if f in selectedtests or 'all' in selectedtests: for filename in files: doc = folia.Document(file=filename) globals()[f](doc=doc) for f in ('memtest',): if f in selectedtests or 'all' in selectedtests: for filename in files: doc = folia.Document(file=filename) print("memtest -- Memory test on document " + filename + " -- memory consumption estimated at " + str(round(asizeof.asizeof(doc) / 1024 / 1024,2)) + " MB" + " (filesize " + str(round(os.path.getsize(filename)/1024/1024,2)) + " MB)") if __name__ == '__main__': main() PyNLPl-1.1.2/pynlpl/tests/evaluation_timbl/0000755000175000001440000000000013024723552021522 5ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/tests/evaluation_timbl/timbltest.sh0000755000175000001440000000005412445064173024072 0ustar proyconusers00000000000000#!/bin/bash timbl -f train -t test +v+cm+cs PyNLPl-1.1.2/pynlpl/tests/evaluation_timbl/test.IB1.O.gr.k1.out0000644000175000001440000000066312445064173024732 0ustar proyconusers00000000000000cat cat cat cat cat cat cat cat cat cat cat cat cat cat cat dog cat dog dog cat dog dog cat dog cat dog cat cat dog cat rabbit dog rabbit dog dog dog dog dog dog dog dog dog dog rabbit dog dog rabbit dog rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit PyNLPl-1.1.2/pynlpl/tests/evaluation_timbl/train0000644000175000001440000000003612445064173022564 0ustar proyconusers00000000000000cat cat dog dog rabbit rabbit PyNLPl-1.1.2/pynlpl/tests/evaluation_timbl/test0000644000175000001440000000044612445064173022433 0ustar proyconusers00000000000000cat cat cat cat cat cat cat cat cat cat dog cat dog cat dog cat cat dog cat dog rabbit dog dog dog dog dog dog dog dog rabbit dog rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit rabbit PyNLPl-1.1.2/pynlpl/tests/test.sh0000755000175000001440000000262712764005332017510 0ustar proyconusers00000000000000#!/bin/bash if [ ! -z "$1" ]; then PYTHON=$1 else PYTHON=python fi if [ ! -z "$2" ]; then TESTDIR="$2" else TESTDIR=`dirname $0` fi cd $TESTDIR GOOD=1 echo "Testing CGN">&2 $PYTHON cgn.py if [ $? -ne 0 ]; then echo "Test failed!!!" >&2 GOOD=0 fi echo "Testing datatypes">&2 $PYTHON datatypes.py if [ $? -ne 0 ]; then echo "Test failed!!!" >&2 GOOD=0 fi echo "Testing evaluation">&2 $PYTHON evaluation.py if [ $? -ne 0 ]; then echo "Test failed!!!" >&2 GOOD=0 fi echo "Testing search">&2 $PYTHON search.py if [ $? -ne 0 ]; then echo "Test failed!!!" >&2 GOOD=0 fi echo "Testing textprocessors">&2 $PYTHON textprocessors.py if [ $? -ne 0 ]; then echo "Test failed!!!" >&2 GOOD=0 fi echo "Testing statistics">&2 $PYTHON statistics.py if [ $? -ne 0 ]; then echo "Test failed!!!" >&2 GOOD=0 fi echo "Testing formats">&2 $PYTHON formats.py if [ $? -ne 0 ]; then echo "Test failed!!!" >&2 GOOD=0 fi echo "Testing folia">&2 $PYTHON folia.py if [ $? -ne 0 ]; then echo "Test failed!!!" >&2 GOOD=0 fi echo "Testing FQL">&2 $PYTHON fql.py if [ $? -ne 0 ]; then echo "Test failed!!!" >&2 GOOD=0 fi echo "Testing CQL">&2 $PYTHON cql.py if [ $? -ne 0 ]; then echo "Test failed!!!" >&2 GOOD=0 fi cd .. if [ $GOOD -eq 1 ]; then echo "Done, all tests passed!" >&2 exit 0 else echo "TESTS FAILED!!!!" >&2 exit 1 fi PyNLPl-1.1.2/pynlpl/tests/datatypes.py0000755000175000001440000000447612445064173020555 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u import os import sys import unittest from pynlpl.datatypes import PriorityQueue values = [3,6,6,1,8,2] mintomax = sorted(values) maxtomin = list(reversed(mintomax)) class PriorityQueueTest(unittest.TestCase): def test_append_minimized(self): """Minimized PriorityQueue""" global values pq = PriorityQueue(values, lambda x: x, True,0,False,False) result = list(iter(pq)) self.assertEqual(result, mintomax) def test_append_maximized(self): """Maximized PriorityQueue""" global values pq = PriorityQueue(values, lambda x: x, False,0,False,False) result = list(iter(pq)) self.assertEqual(result, maxtomin) def test_append_maximized_blockworse(self): """Maximized PriorityQueue (with blockworse)""" global values pq = PriorityQueue(values, lambda x: x, False,0,True,False) result = list(iter(pq)) self.assertEqual(result, [8,6,6,3]) def test_append_maximized_blockworse_blockequal(self): """Maximized PriorityQueue (with blockworse + blockequal)""" global values pq = PriorityQueue(values, lambda x: x, False,0,True,True) result = list(iter(pq)) self.assertEqual(result, [8,6,3]) def test_append_minimized_blockworse(self): """Minimized PriorityQueue (with blockworse)""" global values pq = PriorityQueue(values, lambda x: x, True,0,True,False) result = list(iter(pq)) self.assertEqual(result, [1,3]) def test_append_minimized_fixedlength(self): """Fixed-length priority queue (min)""" global values pq = PriorityQueue(values, lambda x: x, True,4, False,False) result = list(iter(pq)) self.assertEqual(result, mintomax[:4]) def test_append_maximized_fixedlength(self): """Fixed-length priority queue (max)""" global values pq = PriorityQueue(values, lambda x: x, False,4,False,False) result = list(iter(pq)) self.assertEqual(result, maxtomin[:4]) if __name__ == '__main__': unittest.main() PyNLPl-1.1.2/pynlpl/tests/__init__.py0000644000175000001440000000000012764005332020262 0ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/tests/evaluation.py0000755000175000001440000001115112445064173020712 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- #--------------------------------------------------------------- # PyNLPl - Test Units for Evaluation # by Maarten van Gompel, ILK, Universiteit van Tilburg # http://ilk.uvt.nl/~mvgompel # proycon AT anaproy DOT nl # # Licensed under GPLv3 # #------------------------------------------------------------- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u import sys import os import unittest import random from pynlpl.evaluation import AbstractExperiment, WPSParamSearch, ExperimentPool, ClassEvaluation class ParamExperiment(AbstractExperiment): def defaultparameters(self): return {'a':1,'b':1,'c':1} def run(self): self.result = 0 for line in self.inputdata: self.result += int(line) * self.parameters['a'] * self.parameters['b'] - self.parameters['c'] def score(self): return self.result @staticmethod def sample(inputdata,n): n = int(n) if n > len(inputdata): return inputdata else: return random.sample(inputdata,int(n)) class PoolExperiment(AbstractExperiment): def start(self): self.startcommand('sleep',None,None,None,str(self.parameters['duration'])) print("STARTING: sleep " + str(self.parameters['duration'])) class WPSTest(unittest.TestCase): def test_wps(self): inputdata = [ 1,2,3,4,5,6 ] parameterscope = [ ('a',[2,4]), ('b',[2,5,8]), ('c',[3,6,9]) ] search = WPSParamSearch(ParamExperiment, inputdata, len(inputdata), parameterscope) solution = search.searchbest() self.assertEqual(solution, (('a', 4), ('b', 8), ('c', 3)) ) class ExperimentPoolTest(unittest.TestCase): def test_pool(self): pool = ExperimentPool(4) for i in range(0,15): pool.append( PoolExperiment(None, duration=random.randint(1,6)) ) for experiment in pool.run(): print("DONE: sleep " + str(experiment.parameters['duration'])) self.assertTrue(True) #if we got here, no exceptions were raised and it's okay class ClassEvaluationTest2(unittest.TestCase): def setUp(self): self.goals = ['sun','sun','rain','cloudy','sun','rain'] self.observations = ['cloudy','cloudy','cloudy','rain','sun','sun'] def test001(self): e = ClassEvaluation(self.goals, self.observations) print() print(e) print(e.confusionmatrix()) class ClassEvaluationTest(unittest.TestCase): def setUp(self): self.goals = ['cat','cat','cat','cat','cat','cat','cat','cat', 'dog', 'dog','dog','dog','dog','dog' ,'rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit'] self.observations = ['cat','cat','cat','cat','cat','dog','dog','dog', 'cat','cat','rabbit','dog','dog','dog' ,'rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','rabbit','dog','dog'] def test001(self): """Class evaluation test -- (See also http://en.wikipedia.org/wiki/Confusion_matrix , using same data)""" e = ClassEvaluation(self.goals, self.observations) print print(e) print(e.confusionmatrix()) self.assertEqual(e.tp['cat'], 5) self.assertEqual(e.fp['cat'], 2) self.assertEqual(e.tn['cat'], 17) self.assertEqual(e.fn['cat'], 3) self.assertEqual(e.tp['rabbit'], 11) self.assertEqual(e.fp['rabbit'], 1) self.assertEqual(e.tn['rabbit'], 13) self.assertEqual(e.fn['rabbit'], 2) self.assertEqual(e.tp['dog'], 3) self.assertEqual(e.fp['dog'], 5) self.assertEqual(e.tn['dog'], 16) self.assertEqual(e.fn['dog'], 3) self.assertEqual( round(e.precision('cat'),6), 0.714286) self.assertEqual( round(e.precision('rabbit'),6), 0.916667) self.assertEqual( round(e.precision('dog'),6), 0.375000) self.assertEqual( round(e.recall('cat'),6), 0.625000) self.assertEqual( round(e.recall('rabbit'),6), 0.846154) self.assertEqual( round(e.recall('dog'),6),0.500000) self.assertEqual( round(e.fscore('cat'),6), 0.666667) self.assertEqual( round(e.fscore('rabbit'),6), 0.880000) self.assertEqual( round(e.fscore('dog'),6),0.428571) self.assertEqual( round(e.accuracy(),6), 0.703704) if __name__ == '__main__': unittest.main() PyNLPl-1.1.2/pynlpl/tests/statistics.py0000755000175000001440000000635112445064173020743 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- #--------------------------------------------------------------- # PyNLPl - Test Units for Statistics and Information Theory # by Maarten van Gompel, ILK, Universiteit van Tilburg # http://ilk.uvt.nl/~mvgompel # proycon AT anaproy DOT nl # # Licensed under GPLv3 # #---------------------------------------------------------------- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import sys import os import unittest from pynlpl.statistics import FrequencyList, HiddenMarkovModel from pynlpl.textprocessors import Windower sentences = ["This is a sentence .".split(' '),"Moreover , this sentence is a test .".split(' ')] class FrequencyListTest(unittest.TestCase): def test_freqlist_casesens(self): """Frequency List (case sensitive)""" global sentences f= FrequencyList() for sentence in sentences: f.append(sentence) self.assertTrue(( f['sentence'] == 2 and f['this'] == 1 and f['test'] == 1 )) def test_freqlist_caseinsens(self): """Frequency List (case insensitive)""" global sentences f= FrequencyList(None, False) for sentence in sentences: f.append(sentence) self.assertTrue(( f['sentence'] == 2 and f['this'] == 2 and f['Test'] == 1 )) def test_freqlist_tokencount(self): """Frequency List (count tokens)""" global sentences f= FrequencyList() for sentence in sentences: f.append(sentence) self.assertEqual(f.total,13) def test_freqlist_typecount(self): """Frequency List (count types)""" global sentences f= FrequencyList() for sentence in sentences: f.append(sentence) self.assertEqual(len(f),9) class BigramFrequencyListTest(unittest.TestCase): def test_freqlist_casesens(self): """Bigram Frequency List (case sensitive)""" global sentences f= FrequencyList() for sentence in sentences: f.append(Windower(sentence,2)) self.assertTrue(( f[('is','a')] == 2 and f[('This','is')] == 1)) def test_freqlist_caseinsens(self): """Bigram Frequency List (case insensitive)""" global sentences f= FrequencyList(None, False) for sentence in sentences: f.append(Windower(sentence,2)) self.assertTrue(( f[('is','a')] == 2 and f[('this','is')] == 1)) class HMMTest(unittest.TestCase): def test_viterbi(self): """Viterbi decode run on Hidden Markov Model""" hmm = HiddenMarkovModel('start') hmm.settransitions('start',{'rainy':0.6,'sunny':0.4}) hmm.settransitions('rainy',{'rainy':0.7,'sunny':0.3}) hmm.settransitions('sunny',{'rainy':0.4,'sunny':0.6}) hmm.setemission('rainy', {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}) hmm.setemission('sunny', {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}) observations = ['walk', 'shop', 'clean'] prob, path = hmm.viterbi(observations) self.assertEqual( path, ['sunny', 'rainy', 'rainy']) self.assertEqual( prob, 0.01344) if __name__ == '__main__': unittest.main() PyNLPl-1.1.2/pynlpl/tests/formats.py0000755000175000001440000000717512445064173020231 0ustar proyconusers00000000000000import sys import os import unittest sys.path.append(sys.path[0] + '/../../') os.environ['PYTHONPATH'] = sys.path[0] + '/../../' from pynlpl.formats.timbl import TimblOutput if sys.version < '3': from StringIO import StringIO else: from io import StringIO class TimblTest(unittest.TestCase): def test1_simple(self): """Timbl - simple output""" s = StringIO("a b ? c\nc d ? e\n") for i, (features, referenceclass, predictedclass, distribution, distance) in enumerate(TimblOutput(s)): if i == 0: self.assertEqual(features,['a','b']) self.assertEqual(referenceclass,'?') self.assertEqual(predictedclass,'c') self.assertEqual(distribution,None) self.assertEqual(distance,None) elif i == 1: self.assertEqual(features,['c','d']) self.assertEqual(referenceclass,'?') self.assertEqual(predictedclass,'e') self.assertEqual(distribution,None) self.assertEqual(distance,None) def test2_db(self): """Timbl - Distribution output""" s = StringIO("a c ? c { c 1.00000, d 1.00000 }\na b ? c { c 1.00000 }\na d ? c { c 1.00000, e 1.00000 }") for i, (features, referenceclass, predictedclass, distribution, distance) in enumerate(TimblOutput(s)): if i == 0: self.assertEqual(features,['a','c']) self.assertEqual(referenceclass,'?') self.assertEqual(predictedclass,'c') self.assertEqual(distribution['c'], 0.5) self.assertEqual(distribution['d'], 0.5) self.assertEqual(distance,None) elif i == 1: self.assertEqual(features,['a','b']) self.assertEqual(referenceclass,'?') self.assertEqual(predictedclass,'c') self.assertEqual(distribution['c'], 1) self.assertEqual(distance,None) elif i == 2: self.assertEqual(features,['a','d']) self.assertEqual(referenceclass,'?') self.assertEqual(predictedclass,'c') self.assertEqual(distribution['c'], 0.5) self.assertEqual(distribution['e'], 0.5) self.assertEqual(distance,None) def test3_dbdi(self): """Timbl - Distribution + Distance output""" s = StringIO("a c ? c { c 1.00000, d 1.00000 } 1.0000000000000\na b ? c { c 1.00000 } 0.0000000000000\na d ? c { c 1.00000, e 1.00000 } 1.0000000000000") for i, (features, referenceclass, predictedclass, distribution, distance) in enumerate(TimblOutput(s)): if i == 0: self.assertEqual(features,['a','c']) self.assertEqual(referenceclass,'?') self.assertEqual(predictedclass,'c') self.assertEqual(distribution['c'], 0.5) self.assertEqual(distribution['d'], 0.5) self.assertEqual(distance,1.0) elif i == 1: self.assertEqual(features,['a','b']) self.assertEqual(referenceclass,'?') self.assertEqual(predictedclass,'c') self.assertEqual(distribution['c'], 1) self.assertEqual(distance,0.0) elif i == 2: self.assertEqual(features,['a','d']) self.assertEqual(referenceclass,'?') self.assertEqual(predictedclass,'c') self.assertEqual(distribution['c'], 0.5) self.assertEqual(distribution['e'], 0.5) self.assertEqual(distance,1.0) PyNLPl-1.1.2/pynlpl/tests/textprocessors.py0000755000175000001440000001234712445064173021662 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- #--------------------------------------------------------------- # PyNLPl - Test Units for Text Processors # by Maarten van Gompel, ILK, Universiteit van Tilburg # http://ilk.uvt.nl/~mvgompel # proycon AT anaproy DOT nl # # Licensed under GPLv3 # #---------------------------------------------------------------- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import sys import os import unittest from pynlpl.textprocessors import Windower, tokenise, strip_accents, calculate_overlap text = "This is a test .".split(" ") class WindowerTest(unittest.TestCase): def test_unigrams(self): """Windower (unigrams)""" global text result = list(iter(Windower(text,1))) self.assertEqual(result,[("This",),("is",),("a",),("test",),(".",)]) def test_bigrams(self): """Windower (bigrams)""" global text result = list(iter(Windower(text,2))) self.assertEqual(result,[("","This"),("This","is"),("is","a"),("a","test"),("test","."),(".","")]) def test_trigrams(self): """Windower (trigrams)""" global text result = list(iter(Windower(text,3))) self.assertEqual(result,[('', '', 'This'), ('', 'This', 'is'), ('This', 'is', 'a'), ('is', 'a', 'test'), ('a', 'test', '.'), ('test', '.', ''), ('.', '', '')]) def test_trigrams_word(self): """Windower (trigrams) (on single word)""" global text result = list(iter(Windower(["hi"],3))) self.assertEqual(result,[('', '', 'hi'), ('', 'hi', ''), ('hi', '', '')]) class TokenizerTest(unittest.TestCase): def test_tokenize(self): """Tokeniser - One sentence""" self.assertEqual(tokenise("This is a test."),"This is a test .".split(" ")) def test_tokenize_sentences(self): """Tokeniser - Multiple sentences""" self.assertEqual(tokenise("This, is the first sentence! This is the second sentence."),"This , is the first sentence ! This is the second sentence .".split(" ")) def test_tokenize_noeos(self): """Tokeniser - Missing EOS Marker""" self.assertEqual(tokenise("This is a test"),"This is a test".split(" ")) def test_tokenize_url(self): """Tokeniser - URL""" global text self.assertEqual(tokenise("I go to http://www.google.com when I need to find something."),"I go to http://www.google.com when I need to find something .".split(" ")) def test_tokenize_mail(self): """Tokeniser - Mail""" global text self.assertEqual(tokenise("Write me at proycon@anaproy.nl."),"Write me at proycon@anaproy.nl .".split(" ")) def test_tokenize_numeric(self): """Tokeniser - numeric""" global text self.assertEqual(tokenise("I won € 300,000.00!"),"I won € 300,000.00 !".split(" ")) def test_tokenize_quotes(self): """Tokeniser - quotes""" global text self.assertEqual(tokenise("Hij zegt: \"Wat een lief baby'tje is dat!\""),"Hij zegt : \" Wat een lief baby'tje is dat ! \"".split(" ")) class StripAccentTest(unittest.TestCase): def test_strip_accents(self): """Strip Accents""" self.assertEqual(strip_accents("áàâãāĝŭçñßt"),"aaaaagucnt") class OverlapTest(unittest.TestCase): def test_overlap_subset(self): """Overlap - Subset""" h = [4,5,6,7] n = [5,6] self.assertEqual(calculate_overlap(h,n), [((5,6),0)]) def test_overlap_equal(self): """Overlap - Equal""" h = [4,5,6,7] n = [4,5,6,7] self.assertEqual(calculate_overlap(h,n), [((4,5,6,7),2)]) def test_overlap_none(self): """Overlap - None""" h = [4,5,6,7] n = [8,9,10] self.assertEqual(calculate_overlap(h,n), []) def test_overlap_leftpartial(self): """Overlap - Left partial""" h = [4,5,6,7] n = [1,2,3,4,5] self.assertEqual(calculate_overlap(h,n), [((4,5),-1)] ) def test_overlap_rightpartial(self): """Overlap - Right partial""" h = [4,5,6,7] n = [6,7,8,9] self.assertEqual(calculate_overlap(h,n), [((6,7),1)] ) def test_overlap_leftpartial2(self): """Overlap - Left partial (2)""" h = [1,2,3,4,5] n = [0,1,2] self.assertEqual(calculate_overlap(h,n), [((1,2),-1)] ) def test_overlap_rightpartial2(self): """Overlap - Right partial (2)""" h = [1,2,3,4,5] n = [4,5,6] self.assertEqual(calculate_overlap(h,n), [((4,5),1)] ) def test_overlap_leftfull(self): """Overlap - Left full""" h = [1,2,3,4,5] n = [1,2] self.assertEqual(calculate_overlap(h,n), [((1,2),-1)] ) def test_overlap_rightfull(self): """Overlap - Right full""" h = [1,2,3,4,5] n = [4,5] self.assertEqual(calculate_overlap(h,n), [((4,5),1)] ) if __name__ == '__main__': unittest.main() PyNLPl-1.1.2/pynlpl/tests/folia.py0000755000175000001440000113064413024723325017642 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- #--------------------------------------------------------------- # PyNLPl - Test Units for FoLiA # by Maarten van Gompel, ILK, Universiteit van Tilburg # http://ilk.uvt.nl/~mvgompel # proycon AT anaproy DOT nl # # Licensed under GPLv3 # #---------------------------------------------------------------- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u, isstring import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout import sys import os import unittest import io import gzip import bz2 import re FOLIARELEASE = "v1.4.0.53" if os.path.exists('../../FoLiA'): FOLIAPATH = '../../FoLiA/' elif os.path.exists('../FoLiA'): FOLIAPATH = '../FoLiA/' else: FOLIAPATH = 'FoLiA' print("Downloading FoLiA",file=sys.stderr) os.system("git clone https://github.com/proycon/folia.git FoLiA && cd FoLiA && git checkout tags/" + FOLIARELEASE + ' && cd ..') if 'TMPDIR' in os.environ: TMPDIR = os.environ['TMPDIR'] else: TMPDIR = '/tmp/' if sys.version < '3': from StringIO import StringIO else: from io import StringIO, BytesIO from datetime import datetime import lxml.objectify from pynlpl.formats import folia if folia.LXE: from lxml import etree as ElementTree else: import xml.etree.cElementTree as ElementTree def xmlcheck(xml,expect): #obj1 = lxml.objectify.fromstring(expect) #expect = lxml.etree.tostring(obj1) f = io.open(os.path.join(TMPDIR, 'foliatest.fragment.expect.xml'),'w',encoding='utf-8') f.write(expect) f.close() f = io.open(os.path.join(TMPDIR , 'foliatest.fragment.out.xml'),'w', encoding='utf-8') f.write(xml) f.close() retcode = os.system('xmldiff -c ' + os.path.join(TMPDIR, 'foliatest.fragment.expect.xml') + ' ' + os.path.join(TMPDIR,'foliatest.fragment.out.xml')) passed = (retcode == 0) #obj2 = lxml.objectify.fromstring(xml) #xml = lxml.etree.tostring(obj2) #passed = (expect == xml) if not passed: print("XML fragments don't match:",file=stderr) print("--------------------------REFERENCE-------------------------------------",file=stderr) print(expect,file=stderr) print("--------------------------ACTUAL RESULT---------------------------------",file=stderr) print(xml,file=stderr) print("------------------------------------------------------------------------",file=stderr) return passed class Test1Read(unittest.TestCase): def test1_readfromfile(self): """Reading from file""" global FOLIAEXAMPLE #write example to file f = io.open(os.path.join(TMPDIR,'foliatest.xml'),'w',encoding='utf-8') f.write(FOLIAEXAMPLE) f.close() doc = folia.Document(file=os.path.join(TMPDIR,'foliatest.xml')) self.assertTrue(isinstance(doc,folia.Document)) #sanity check: reading from file must yield the exact same data as reading from string doc2 = folia.Document(string=FOLIAEXAMPLE) self.assertEqual( doc, doc2) def test1a_readfromfile(self): """Reading from GZ file""" global FOLIAEXAMPLE #write example to file f = gzip.GzipFile(os.path.join(TMPDIR,'foliatest.xml.gz'),'w') f.write(FOLIAEXAMPLE.encode('utf-8')) f.close() doc = folia.Document(file=os.path.join(TMPDIR,'foliatest.xml.gz')) self.assertTrue(isinstance(doc,folia.Document)) #sanity check: reading from file must yield the exact same data as reading from string doc2 = folia.Document(string=FOLIAEXAMPLE) self.assertEqual( doc, doc2) def test1b_readfromfile(self): """Reading from BZ2 file""" global FOLIAEXAMPLE #write example to file f = bz2.BZ2File(os.path.join(TMPDIR,'foliatest.xml.bz2'),'w') f.write(FOLIAEXAMPLE.encode('utf-8')) f.close() doc = folia.Document(file=os.path.join(TMPDIR,'foliatest.xml.bz2')) self.assertTrue(isinstance(doc,folia.Document)) #sanity check: reading from file must yield the exact same data as reading from string doc2 = folia.Document(string=FOLIAEXAMPLE) self.assertEqual( doc, doc2) def test2_readfromstring(self): """Reading from string (unicode)""" global FOLIAEXAMPLE doc = folia.Document(string=FOLIAEXAMPLE) self.assertTrue(isinstance(doc,folia.Document)) def test2_readfromstring(self): """Reading from string (bytes)""" global FOLIAEXAMPLE doc = folia.Document(string=FOLIAEXAMPLE.encode('utf-8')) self.assertTrue(isinstance(doc,folia.Document)) def test3_readfromstring(self): """Reading from pre-parsed XML tree (as unicode(Py2)/str(Py3) obj)""" global FOLIAEXAMPLE if sys.version < '3': doc = folia.Document(tree=ElementTree.parse(StringIO(FOLIAEXAMPLE.encode('utf-8')))) else: doc = folia.Document(tree=ElementTree.parse(BytesIO(FOLIAEXAMPLE.encode('utf-8')))) self.assertTrue(isinstance(doc,folia.Document)) def test4_readdcoi(self): """Reading D-Coi file""" global DCOIEXAMPLE doc = folia.Document(string=DCOIEXAMPLE) #doc = folia.Document(tree=lxml.etree.parse(StringIO(DCOIEXAMPLE.encode('iso-8859-15')))) self.assertTrue(isinstance(doc,folia.Document)) self.assertEqual(len(list(doc.words())),1465) class Test2Sanity(unittest.TestCase): def setUp(self): self.doc = folia.Document(string=FOLIAEXAMPLE) def test000_count_text(self): """Sanity check - One text """ self.assertEqual( len(self.doc), 1) self.assertTrue( isinstance( self.doc[0], folia.Text )) def test001_count_paragraphs(self): """Sanity check - Paragraph count""" self.assertEqual( len(list(self.doc.paragraphs())) , 3) def test002_count_sentences(self): """Sanity check - Sentences count""" self.assertEqual( len(list(self.doc.sentences())) , 17) def test003a_count_words(self): """Sanity check - Word count""" self.assertEqual( len(list(self.doc.words())) , 190) def test003b_iter_words(self): """Sanity check - Words""" self.assertEqual( [x.id for x in self.doc.words() ], ['WR-P-E-J-0000000001.head.1.s.1.w.1', 'WR-P-E-J-0000000001.p.1.s.1.w.1', 'WR-P-E-J-0000000001.p.1.s.1.w.2', 'WR-P-E-J-0000000001.p.1.s.1.w.3', 'WR-P-E-J-0000000001.p.1.s.1.w.4', 'WR-P-E-J-0000000001.p.1.s.1.w.5', 'WR-P-E-J-0000000001.p.1.s.1.w.6', 'WR-P-E-J-0000000001.p.1.s.1.w.7', 'WR-P-E-J-0000000001.p.1.s.1.w.8', 'WR-P-E-J-0000000001.p.1.s.2.w.1', 'WR-P-E-J-0000000001.p.1.s.2.w.2', 'WR-P-E-J-0000000001.p.1.s.2.w.3', 'WR-P-E-J-0000000001.p.1.s.2.w.4', 'WR-P-E-J-0000000001.p.1.s.2.w.5', 'WR-P-E-J-0000000001.p.1.s.2.w.6', 'WR-P-E-J-0000000001.p.1.s.2.w.7', 'WR-P-E-J-0000000001.p.1.s.2.w.8', 'WR-P-E-J-0000000001.p.1.s.2.w.9', 'WR-P-E-J-0000000001.p.1.s.2.w.10', 'WR-P-E-J-0000000001.p.1.s.2.w.11', 'WR-P-E-J-0000000001.p.1.s.2.w.12', 'WR-P-E-J-0000000001.p.1.s.2.w.13', 'WR-P-E-J-0000000001.p.1.s.2.w.14', 'WR-P-E-J-0000000001.p.1.s.2.w.15', 'WR-P-E-J-0000000001.p.1.s.2.w.16', 'WR-P-E-J-0000000001.p.1.s.2.w.17', 'WR-P-E-J-0000000001.p.1.s.2.w.18', 'WR-P-E-J-0000000001.p.1.s.2.w.19', 'WR-P-E-J-0000000001.p.1.s.2.w.20', 'WR-P-E-J-0000000001.p.1.s.2.w.21', 'WR-P-E-J-0000000001.p.1.s.2.w.22', 'WR-P-E-J-0000000001.p.1.s.2.w.23', 'WR-P-E-J-0000000001.p.1.s.2.w.24-25', 'WR-P-E-J-0000000001.p.1.s.2.w.26', 'WR-P-E-J-0000000001.p.1.s.2.w.27', 'WR-P-E-J-0000000001.p.1.s.2.w.28', 'WR-P-E-J-0000000001.p.1.s.2.w.29', 'WR-P-E-J-0000000001.p.1.s.3.w.1', 'WR-P-E-J-0000000001.p.1.s.3.w.2', 'WR-P-E-J-0000000001.p.1.s.3.w.3', 'WR-P-E-J-0000000001.p.1.s.3.w.4', 'WR-P-E-J-0000000001.p.1.s.3.w.5', 'WR-P-E-J-0000000001.p.1.s.3.w.6', 'WR-P-E-J-0000000001.p.1.s.3.w.7', 'WR-P-E-J-0000000001.p.1.s.3.w.8', 'WR-P-E-J-0000000001.p.1.s.3.w.9', 'WR-P-E-J-0000000001.p.1.s.3.w.10', 'WR-P-E-J-0000000001.p.1.s.3.w.11', 'WR-P-E-J-0000000001.p.1.s.3.w.12', 'WR-P-E-J-0000000001.p.1.s.3.w.13', 'WR-P-E-J-0000000001.p.1.s.3.w.14', 'WR-P-E-J-0000000001.p.1.s.3.w.15', 'WR-P-E-J-0000000001.p.1.s.3.w.16', 'WR-P-E-J-0000000001.p.1.s.3.w.17', 'WR-P-E-J-0000000001.p.1.s.3.w.18', 'WR-P-E-J-0000000001.p.1.s.3.w.19', 'WR-P-E-J-0000000001.p.1.s.3.w.20', 'WR-P-E-J-0000000001.p.1.s.3.w.21', 'WR-P-E-J-0000000001.p.1.s.4.w.1', 'WR-P-E-J-0000000001.p.1.s.4.w.2', 'WR-P-E-J-0000000001.p.1.s.4.w.3', 'WR-P-E-J-0000000001.p.1.s.4.w.4', 'WR-P-E-J-0000000001.p.1.s.4.w.5', 'WR-P-E-J-0000000001.p.1.s.4.w.6', 'WR-P-E-J-0000000001.p.1.s.4.w.7', 'WR-P-E-J-0000000001.p.1.s.4.w.8', 'WR-P-E-J-0000000001.p.1.s.4.w.9', 'WR-P-E-J-0000000001.p.1.s.4.w.10', 'WR-P-E-J-0000000001.p.1.s.5.w.1', 'WR-P-E-J-0000000001.p.1.s.5.w.2', 'WR-P-E-J-0000000001.p.1.s.5.w.3', 'WR-P-E-J-0000000001.p.1.s.5.w.4', 'WR-P-E-J-0000000001.p.1.s.5.w.5', 'WR-P-E-J-0000000001.p.1.s.5.w.6', 'WR-P-E-J-0000000001.p.1.s.5.w.7', 'WR-P-E-J-0000000001.p.1.s.5.w.8', 'WR-P-E-J-0000000001.p.1.s.5.w.9', 'WR-P-E-J-0000000001.p.1.s.5.w.10', 'WR-P-E-J-0000000001.p.1.s.5.w.11', 'WR-P-E-J-0000000001.p.1.s.5.w.12', 'WR-P-E-J-0000000001.p.1.s.5.w.13', 'WR-P-E-J-0000000001.p.1.s.5.w.14', 'WR-P-E-J-0000000001.p.1.s.5.w.15', 'WR-P-E-J-0000000001.p.1.s.5.w.16', 'WR-P-E-J-0000000001.p.1.s.5.w.17', 'WR-P-E-J-0000000001.p.1.s.5.w.18', 'WR-P-E-J-0000000001.p.1.s.5.w.19', 'WR-P-E-J-0000000001.p.1.s.5.w.20', 'WR-P-E-J-0000000001.p.1.s.5.w.21', 'WR-P-E-J-0000000001.p.1.s.6.w.1', 'WR-P-E-J-0000000001.p.1.s.6.w.2', 'WR-P-E-J-0000000001.p.1.s.6.w.3', 'WR-P-E-J-0000000001.p.1.s.6.w.4', 'WR-P-E-J-0000000001.p.1.s.6.w.5', 'WR-P-E-J-0000000001.p.1.s.6.w.6', 'WR-P-E-J-0000000001.p.1.s.6.w.7', 'WR-P-E-J-0000000001.p.1.s.6.w.8', 'WR-P-E-J-0000000001.p.1.s.6.w.9', 'WR-P-E-J-0000000001.p.1.s.6.w.10', 'WR-P-E-J-0000000001.p.1.s.6.w.11', 'WR-P-E-J-0000000001.p.1.s.6.w.12', 'WR-P-E-J-0000000001.p.1.s.6.w.13', 'WR-P-E-J-0000000001.p.1.s.6.w.14', 'WR-P-E-J-0000000001.p.1.s.6.w.15', 'WR-P-E-J-0000000001.p.1.s.6.w.16', 'WR-P-E-J-0000000001.p.1.s.6.w.17', 'WR-P-E-J-0000000001.p.1.s.6.w.18', 'WR-P-E-J-0000000001.p.1.s.6.w.19', 'WR-P-E-J-0000000001.p.1.s.6.w.20', 'WR-P-E-J-0000000001.p.1.s.6.w.21', 'WR-P-E-J-0000000001.p.1.s.6.w.22', 'WR-P-E-J-0000000001.p.1.s.6.w.23', 'WR-P-E-J-0000000001.p.1.s.6.w.24', 'WR-P-E-J-0000000001.p.1.s.6.w.25', 'WR-P-E-J-0000000001.p.1.s.6.w.26', 'WR-P-E-J-0000000001.p.1.s.6.w.27', 'WR-P-E-J-0000000001.p.1.s.6.w.28', 'WR-P-E-J-0000000001.p.1.s.6.w.29', 'WR-P-E-J-0000000001.p.1.s.6.w.30', 'WR-P-E-J-0000000001.p.1.s.6.w.31', 'WR-P-E-J-0000000001.p.1.s.6.w.32', 'WR-P-E-J-0000000001.p.1.s.6.w.33', 'WR-P-E-J-0000000001.p.1.s.6.w.34', 'WR-P-E-J-0000000001.p.1.s.7.w.1', 'WR-P-E-J-0000000001.p.1.s.7.w.2', 'WR-P-E-J-0000000001.p.1.s.7.w.3', 'WR-P-E-J-0000000001.p.1.s.7.w.4', 'WR-P-E-J-0000000001.p.1.s.7.w.5', 'WR-P-E-J-0000000001.p.1.s.7.w.6', 'WR-P-E-J-0000000001.p.1.s.7.w.7', 'WR-P-E-J-0000000001.p.1.s.7.w.8', 'WR-P-E-J-0000000001.p.1.s.7.w.9', 'WR-P-E-J-0000000001.p.1.s.7.w.10', 'WR-P-E-J-0000000001.p.1.s.8.w.1', 'WR-P-E-J-0000000001.p.1.s.8.w.2', 'WR-P-E-J-0000000001.p.1.s.8.w.3', 'WR-P-E-J-0000000001.p.1.s.8.w.4', 'WR-P-E-J-0000000001.p.1.s.8.w.5', 'WR-P-E-J-0000000001.p.1.s.8.w.6', 'WR-P-E-J-0000000001.p.1.s.8.w.7', 'WR-P-E-J-0000000001.p.1.s.8.w.8', 'WR-P-E-J-0000000001.p.1.s.8.w.9', 'WR-P-E-J-0000000001.p.1.s.8.w.10', 'WR-P-E-J-0000000001.p.1.s.8.w.11', 'WR-P-E-J-0000000001.p.1.s.8.w.12', 'WR-P-E-J-0000000001.p.1.s.8.w.13', 'WR-P-E-J-0000000001.p.1.s.8.w.14', 'WR-P-E-J-0000000001.p.1.s.8.w.15', 'WR-P-E-J-0000000001.p.1.s.8.w.16', 'WR-P-E-J-0000000001.p.1.s.8.w.17', 'entry.1.term.1.w.1', 'sandbox.list.1.listitem.1.s.1.w.1', 'sandbox.list.1.listitem.1.s.1.w.2', 'sandbox.list.1.listitem.2.s.1.w.1', 'sandbox.list.1.listitem.2.s.1.w.2', 'sandbox.figure.1.caption.s.1.w.1', 'sandbox.figure.1.caption.s.1.w.2', 'WR-P-E-J-0000000001.sandbox.2.s.1.w.1', 'WR-P-E-J-0000000001.sandbox.2.s.1.w.2', 'WR-P-E-J-0000000001.sandbox.2.s.1.w.3', 'WR-P-E-J-0000000001.sandbox.2.s.1.w.4', 'WR-P-E-J-0000000001.sandbox.2.s.1.w.5', 'WR-P-E-J-0000000001.sandbox.2.s.1.w.6', 'WR-P-E-J-0000000001.sandbox.2.s.2.w.1', 'WR-P-E-J-0000000001.sandbox.2.s.2.w.2', 'WR-P-E-J-0000000001.sandbox.2.s.2.w.3', 'WR-P-E-J-0000000001.sandbox.2.s.2.w.4', 'WR-P-E-J-0000000001.sandbox.2.s.2.w.5', 'WR-P-E-J-0000000001.sandbox.2.s.2.w.6', 'WR-P-E-J-0000000001.sandbox.2.s.2.w.7', 'WR-P-E-J-0000000001.sandbox.2.s.2.w.8', 'WR-P-E-J-0000000001.sandbox.2.s.3.w.1', 'WR-P-E-J-0000000001.sandbox.2.s.3.w.2', 'WR-P-E-J-0000000001.sandbox.2.s.3.w.3', 'WR-P-E-J-0000000001.sandbox.2.s.3.w.4', 'WR-P-E-J-0000000001.sandbox.2.s.3.w.6', 'example.table.1.w.1', 'example.table.1.w.2', 'example.table.1.w.3', 'example.table.1.w.4', 'example.table.1.w.5', 'example.table.1.w.6', 'example.table.1.w.7', 'example.table.1.w.8', 'example.table.1.w.9', 'example.table.1.w.10', 'example.table.1.w.11', 'example.table.1.w.12', 'example.table.1.w.13', 'example.table.1.w.14'] ) def test004_first_word(self): """Sanity check - First word""" #grab first word w = self.doc.words(0) # shortcut for doc.words()[0] self.assertTrue( isinstance(w, folia.Word) ) self.assertEqual( w.id , 'WR-P-E-J-0000000001.head.1.s.1.w.1' ) self.assertEqual( w.text() , "Stemma" ) self.assertEqual( str(w) , "Stemma" ) #should be unicode object also in Py2! if sys.version < '3': self.assertEqual( unicode(w) , "Stemma" ) def test005_last_word(self): """Sanity check - Last word""" #grab last word w = self.doc.words(-1) # shortcut for doc.words()[0] self.assertTrue( isinstance(w, folia.Word) ) self.assertEqual( w.id , "example.table.1.w.14" ) self.assertEqual( w.text() , "University" ) self.assertEqual( str(w) , "University" ) def test006_second_sentence(self): """Sanity check - Sentence""" #grab second sentence s = self.doc.sentences(1) self.assertTrue( isinstance(s, folia.Sentence) ) self.assertEqual( s.id, 'WR-P-E-J-0000000001.p.1.s.1' ) self.assertFalse( s.hastext() ) self.assertEqual( str(s), "Stemma is een ander woord voor stamboom ." ) def test006b_sentencetest(self): """Sanity check - Sentence text (including retaining tokenisation)""" #grab second sentence s = self.doc['WR-P-E-J-0000000001.p.1.s.5'] self.assertTrue( isinstance(s, folia.Sentence) ) self.assertFalse( s.hastext() ) self.assertEqual( s.text(), "De andere handschriften krijgen ook een letter die verband kan houden met hun plaats van oorsprong óf plaats van bewaring.") self.assertEqual( s.text('current',True), "De andere handschriften krijgen ook een letter die verband kan houden met hun plaats van oorsprong óf plaats van bewaring .") #not detokenised self.assertEqual( s.toktext(), "De andere handschriften krijgen ook een letter die verband kan houden met hun plaats van oorsprong óf plaats van bewaring .") #just an alias for the above def test007_index(self): """Sanity check - Index""" #grab something using the index w = self.doc['WR-P-E-J-0000000001.p.1.s.2.w.7'] self.assertTrue( isinstance(w, folia.Word) ) self.assertEqual( self.doc['WR-P-E-J-0000000001.p.1.s.2.w.7'] , self.doc.index['WR-P-E-J-0000000001.p.1.s.2.w.7'] ) self.assertEqual( w.id , 'WR-P-E-J-0000000001.p.1.s.2.w.7' ) self.assertEqual( w.text() , "stamboom" ) def test008_division(self): """Sanity check - Division + head""" #grab something using the index div = self.doc['WR-P-E-J-0000000001.div0.1'] self.assertTrue( isinstance(div, folia.Division) ) self.assertEqual( div.head() , self.doc['WR-P-E-J-0000000001.head.1'] ) self.assertEqual( len(div.head()) ,1 ) #Head contains one element (one sentence) def test009_pos(self): """Sanity check - Token Annotation - Pos""" #grab first word w = self.doc.words(0) self.assertEqual( w.annotation(folia.PosAnnotation), next(w.select(folia.PosAnnotation)) ) #w.annotation() selects the single first annotation of that type, select is the generic method to retrieve pretty much everything self.assertTrue( isinstance(w.annotation(folia.PosAnnotation), folia.PosAnnotation) ) self.assertTrue( issubclass(folia.PosAnnotation, folia.AbstractTokenAnnotation) ) self.assertEqual( w.annotation(folia.PosAnnotation).cls, 'N(soort,ev,basis,onz,stan)' ) #cls is used everywhere instead of class, since class is a reserved keyword in python self.assertEqual( w.pos(),'N(soort,ev,basis,onz,stan)' ) #w.pos() is just a direct shortcut for getting the class self.assertEqual( w.annotation(folia.PosAnnotation).set, 'https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn' ) self.assertEqual( w.annotation(folia.PosAnnotation).annotator, 'frog' ) self.assertEqual( w.annotation(folia.PosAnnotation).annotatortype, folia.AnnotatorType.AUTO ) def test010_lemma(self): """Sanity check - Token Annotation - Lemma""" #grab first word w = self.doc.words(0) self.assertEqual( w.annotation(folia.LemmaAnnotation), w.annotation(folia.LemmaAnnotation) ) #w.lemma() is just a shortcut self.assertEqual( w.annotation(folia.LemmaAnnotation), next(w.select(folia.LemmaAnnotation)) ) #w.annotation() selects the single first annotation of that type, select is the generic method to retrieve pretty much everything self.assertTrue( isinstance(w.annotation(folia.LemmaAnnotation), folia.LemmaAnnotation)) self.assertEqual( w.annotation(folia.LemmaAnnotation).cls, 'stemma' ) self.assertEqual( w.lemma(),'stemma' ) #w.lemma() is just a direct shortcut for getting the class self.assertEqual( w.annotation(folia.LemmaAnnotation).set, 'lemmas-nl' ) self.assertEqual( w.annotation(folia.LemmaAnnotation).annotator, 'tadpole' ) self.assertEqual( w.annotation(folia.LemmaAnnotation).annotatortype, folia.AnnotatorType.AUTO ) def test011_tokenannot_notexist(self): """Sanity check - Token Annotation - Non-existing element""" #grab first word w = self.doc.words(0) self.assertEqual( w.count(folia.SenseAnnotation), 0) #list self.assertRaises( folia.NoSuchAnnotation, w.annotation, folia.SenseAnnotation) #exception def test012_correction(self): """Sanity check - Correction - Text""" w = self.doc['WR-P-E-J-0000000001.p.1.s.6.w.31'] c = w.annotation(folia.Correction) self.assertEqual( len(list(c.new())), 1) self.assertEqual( len(list(c.original())), 1) self.assertEqual( w.text(), 'vierkante') self.assertEqual( c.new(0), 'vierkante') self.assertEqual( c.original(0) , 'vierkant') def test013_correction(self): """Sanity check - Correction - Token Annotation""" w = self.doc['WR-P-E-J-0000000001.p.1.s.6.w.32'] c = w.annotation(folia.Correction) self.assertEqual( len(list(c.new())), 1) self.assertEqual( len(list(c.original())), 1) self.assertEqual( w.annotation(folia.LemmaAnnotation).cls , 'haak') self.assertEqual( c.new(0).cls, 'haak') self.assertEqual( c.original(0).cls, 'haaak') def test014_correction(self): """Sanity check - Correction - Suggestions (text)""" #grab first word w = self.doc['WR-P-E-J-0000000001.p.1.s.8.w.14'] c = w.annotation(folia.Correction) self.assertTrue( isinstance(c, folia.Correction) ) self.assertEqual( len(list(c.suggestions())), 2 ) self.assertEqual( str(c.suggestions(0).text()), 'twijfelachtige' ) self.assertEqual( str(c.suggestions(1).text()), 'ongewisse' ) def test015_parenttest(self): """Sanity check - Checking if all elements know who's their daddy""" def check(parent, indent = ''): for child in parent: if isinstance(child, folia.AbstractElement) and not (isinstance(parent, folia.AbstractSpanAnnotation) and (isinstance(child, folia.Word) or isinstance(child, folia.Morpheme))): #words and morphemes are exempted in abstractspanannotation #print indent + repr(child), child.id, child.cls self.assertTrue( child.parent is parent) check(child, indent + ' ') return True self.assertTrue( check(self.doc.data[0],' ') ) def test016a_description(self): """Sanity Check - Description""" w = self.doc['WR-P-E-J-0000000001.p.1.s.1.w.6'] self.assertEqual( w.description(), 'Dit woordje is een voorzetsel, het is maar dat je het weet...') def test016b_description(self): """Sanity Check - Error on non-existing description""" w = self.doc['WR-P-E-J-0000000001.p.1.s.1.w.7'] self.assertRaises( folia.NoSuchAnnotation, w.description) def test017_gap(self): """Sanity Check - Gap""" gap = self.doc["WR-P-E-J-0000000001.gap.1"] self.assertEqual( gap.content().strip()[:11], 'De tekst is') self.assertEqual( gap.cls, 'backmatter') self.assertEqual( gap.description(), 'Backmatter') def test018_subtokenannot(self): """Sanity Check - Subtoken annotation (part of speech)""" w= self.doc['WR-P-E-J-0000000001.p.1.s.2.w.5'] p = w.annotation(folia.PosAnnotation) self.assertEqual( p.feat('wvorm'), 'pv' ) self.assertEqual( p.feat('pvtijd'), 'tgw' ) self.assertEqual( p.feat('pvagr'), 'met-t' ) def test019_alignment(self): """Sanity Check - Alignment in same document""" w = self.doc['WR-P-E-J-0000000001.p.1.s.3.w.10'] a = w.annotation(folia.Alignment) target = next(a.resolve()) self.assertEqual( target, self.doc['WR-P-E-J-0000000001.p.1.s.3.w.5'] ) def test020a_spanannotation(self): """Sanity Check - Span Annotation (Syntax)""" s = self.doc['WR-P-E-J-0000000001.p.1.s.1'] l = s.annotation(folia.SyntaxLayer) self.assertTrue( isinstance(l[0], folia.SyntacticUnit ) ) self.assertEqual( l[0].cls, 'sentence' ) self.assertEqual( l[0][0].cls, 'subject' ) self.assertEqual( l[0][0].text(), 'Stemma' ) self.assertEqual( l[0][1].cls, 'verb' ) self.assertEqual( l[0][2].cls, 'predicate' ) self.assertEqual( l[0][2][0].cls, 'np' ) self.assertEqual( l[0][2][1].cls, 'pp' ) self.assertEqual( l[0][2][1].text(), 'voor stamboom' ) self.assertEqual( l[0][2].text(), 'een ander woord voor stamboom' ) def test020b_spanannotation(self): """Sanity Check - Span Annotation (Chunking)""" s = self.doc['WR-P-E-J-0000000001.p.1.s.1'] l = s.annotation(folia.ChunkingLayer) self.assertTrue( isinstance(l[0], folia.Chunk ) ) self.assertEqual( l[0].text(), 'een ander woord' ) self.assertEqual( l[1].text(), 'voor stamboom' ) def test020c_spanannotation(self): """Sanity Check - Span Annotation (Entities)""" s = self.doc['WR-P-E-J-0000000001.p.1.s.1'] l = s.annotation(folia.EntitiesLayer) self.assertTrue( isinstance(l[0], folia.Entity) ) self.assertEqual( l[0].text(), 'ander woord' ) def test020d_spanannotation(self): """Sanity Check - Span Annotation (Dependencies)""" s = self.doc['WR-P-E-J-0000000001.p.1.s.1'] l = s.annotation(folia.DependenciesLayer) self.assertTrue( isinstance(l[0], folia.Dependency) ) self.assertEqual( l[0].head().text(), 'is' ) self.assertEqual( l[0].dependent().text(), 'Stemma' ) self.assertEqual( l[0].cls, 'su' ) self.assertTrue( isinstance(l[1], folia.Dependency) ) self.assertEqual( l[1].head().text(), 'is' ) self.assertEqual( l[1].dependent().text(), 'woord' ) self.assertEqual( l[1].cls,'predc' ) self.assertTrue( isinstance(l[2], folia.Dependency) ) self.assertEqual( l[2].head().text(), 'woord' ) self.assertEqual( l[2].dependent().text(), 'een' ) self.assertEqual( l[2].cls,'det' ) self.assertTrue( isinstance(l[3], folia.Dependency) ) self.assertEqual( l[3].head().text(), 'woord' ) self.assertEqual( l[3].dependent().text(), 'ander' ) self.assertEqual( l[3].cls,'mod' ) self.assertTrue( isinstance(l[4], folia.Dependency) ) self.assertEqual( l[4].head().text(), 'woord' ) self.assertEqual( l[4].dependent().text(), 'voor' ) self.assertEqual( l[4].cls,'mod' ) self.assertTrue( isinstance(l[5], folia.Dependency) ) self.assertEqual( l[5].head().text(), 'voor' ) self.assertEqual( l[5].dependent().text(), 'stamboom' ) self.assertEqual( l[5].cls,'obj1' ) def test020e_spanannotation(self): """Sanity Check - Span Annotation (Timedevent)""" s = self.doc['WR-P-E-J-0000000001.p.1.s.1'] l = s.annotation(folia.TimingLayer) self.assertTrue( isinstance(l[0], folia.TimeSegment ) ) self.assertEqual( l[0].text(), 'een ander woord' ) self.assertEqual( l[1].cls, 'cough' ) self.assertEqual( l[2].text(), 'voor stamboom' ) def test020f_spanannotation(self): """Sanity Check - Co-Reference""" div = self.doc["WR-P-E-J-0000000001.div0.1"] deplayer = div.annotation(folia.DependenciesLayer) deps = list(deplayer.annotations(folia.Dependency)) self.assertEqual( deps[0].cls, 'su' ) self.assertEqual( deps[1].cls, 'predc' ) self.assertEqual( deps[2].cls, 'det' ) self.assertEqual( deps[3].cls, 'mod' ) self.assertEqual( deps[4].cls, 'mod' ) self.assertEqual( deps[5].cls, 'obj1' ) self.assertEqual( deps[2].head().wrefs(0), self.doc['WR-P-E-J-0000000001.p.1.s.1.w.5'] ) self.assertEqual( deps[2].dependent().wrefs(0), self.doc['WR-P-E-J-0000000001.p.1.s.1.w.3'] ) def test020g_spanannotation(self): """Sanity Check - Semantic Role Labelling""" s = self.doc['WR-P-E-J-0000000001.p.1.s.7'] semrolelayer = s.annotation(folia.SemanticRolesLayer) predicate = semrolelayer.annotation(folia.Predicate) self.assertEqual( predicate.cls, 'aanduiden' ) roles = list(predicate.annotations(folia.SemanticRole)) self.assertEqual( roles[0].cls, 'actor' ) self.assertEqual( roles[1].cls, 'patient' ) self.assertEqual( roles[0].wrefs(0), self.doc['WR-P-E-J-0000000001.p.1.s.7.w.3'] ) self.assertEqual( roles[1].wrefs(0), self.doc['WR-P-E-J-0000000001.p.1.s.7.w.4'] ) self.assertEqual( roles[1].wrefs(1), self.doc['WR-P-E-J-0000000001.p.1.s.7.w.5'] ) def test021_previousword(self): """Sanity Check - Obtaining previous word""" w = self.doc['WR-P-E-J-0000000001.p.1.s.2.w.7'] prevw = w.previous() self.assertTrue( isinstance(prevw, folia.Word) ) self.assertEqual( prevw.text(), "zo'n" ) def test021b_previousword_noscope(self): """Sanity Check - Obtaining previous word without scope constraint""" w = self.doc['WR-P-E-J-0000000001.p.1.s.4.w.1'] prevw = w.previous(folia.Word, None) self.assertTrue( isinstance(prevw, folia.Word) ) self.assertEqual( prevw.text(), "." ) def test021c_previousword_constrained(self): """Sanity Check - Obtaining non-existing previous word with scope constraint""" w = self.doc['WR-P-E-J-0000000001.p.1.s.4.w.1'] prevw = w.previous(folia.Word, [folia.Sentence]) self.assertEqual(prevw, None) def test022_nextword(self): """Sanity Check - Obtaining next word""" w = self.doc['WR-P-E-J-0000000001.p.1.s.2.w.7'] nextw = w.next() self.assertTrue( isinstance(nextw, folia.Word) ) self.assertEqual( nextw.text(), "," ) def test023_leftcontext(self): """Sanity Check - Obtaining left context""" w = self.doc['WR-P-E-J-0000000001.p.1.s.2.w.7'] context = w.leftcontext(3) self.assertEqual( [ x.text() for x in context ], ['wetenschap','wordt',"zo'n"] ) def test024_rightcontext(self): """Sanity Check - Obtaining right context""" w = self.doc['WR-P-E-J-0000000001.p.1.s.2.w.7'] context = w.rightcontext(3) self.assertEqual( [ x.text() for x in context ], [',','onder','de'] ) def test025_fullcontext(self): """Sanity Check - Obtaining full context""" w = self.doc['WR-P-E-J-0000000001.p.1.s.2.w.7'] context = w.context(3) self.assertEqual( [ x.text() for x in context ], ['wetenschap','wordt',"zo'n",'stamboom',',','onder','de'] ) def test026_feature(self): """Sanity Check - Features""" w = self.doc['WR-P-E-J-0000000001.p.1.s.6.w.1'] pos = w.annotation(folia.PosAnnotation) self.assertTrue( isinstance(pos, folia.PosAnnotation) ) self.assertEqual(pos.cls,'WW(vd,prenom,zonder)') self.assertEqual( len(pos), 1) features = list(pos.select(folia.Feature)) self.assertEqual( len(features), 1) self.assertTrue( isinstance(features[0], folia.Feature)) self.assertEqual( features[0].subset, 'head') self.assertEqual( features[0].cls, 'WW') def test027_datetime(self): """Sanity Check - Time stamp""" w = self.doc['WR-P-E-J-0000000001.p.1.s.8.w.15'] pos = w.annotation(folia.PosAnnotation) self.assertEqual( pos.datetime, datetime(2011, 7, 20, 19, 0, 1) ) self.assertTrue( xmlcheck(pos.xmlstring(), '') ) def test028_wordparents(self): """Sanity Check - Finding parents of word""" w = self.doc['WR-P-E-J-0000000001.p.1.s.8.w.15'] s = w.sentence() self.assertTrue( isinstance(s, folia.Sentence) ) self.assertEqual( s.id, 'WR-P-E-J-0000000001.p.1.s.8') p = w.paragraph() self.assertTrue( isinstance(p, folia.Paragraph) ) self.assertEqual( p.id, 'WR-P-E-J-0000000001.p.1') div = w.division() self.assertTrue( isinstance(div, folia.Division) ) self.assertEqual( div.id, 'WR-P-E-J-0000000001.div0.1') self.assertEqual( w.incorrection(), None) def test0029_quote(self): """Sanity Check - Quote""" q = self.doc['WR-P-E-J-0000000001.p.1.s.8.q.1'] self.assertTrue( isinstance(q, folia.Quote) ) self.assertEqual(q.text(), 'volle lijn') s = self.doc['WR-P-E-J-0000000001.p.1.s.8'] self.assertEqual(s.text(), 'Een volle lijn duidt op een verwantschap , terweil een stippelijn op een onzekere verwantschap duidt .') #(spelling errors are present in sentence) #a word from the quote w = self.doc['WR-P-E-J-0000000001.p.1.s.8.w.2'] #check if sentence matches self.assertTrue( (w.sentence() is s) ) def test030_textcontent(self): """Sanity check - Text Content""" s = self.doc['WR-P-E-J-0000000001.p.1.s.4'] self.assertEqual( s.text(), 'De hoofdletter A wordt gebruikt voor het originele handschrift .') self.assertEqual( s.stricttext(), 'De hoofdletter A wordt gebruikt voor het originele handschrift.') self.assertEqual( s.textcontent().text(), 'De hoofdletter A wordt gebruikt voor het originele handschrift.') self.assertEqual( s.textcontent('original').text(), 'De hoofdletter A wordt gebruikt voor het originele handschrift.') self.assertRaises( folia.NoSuchText, s.text, 'BLAH' ) w = self.doc['WR-P-E-J-0000000001.p.1.s.4.w.2'] self.assertEqual( w.text(), 'hoofdletter') self.assertEqual( w.textcontent().text(), 'hoofdletter') self.assertEqual( w.textcontent().offset, 3) w2 = self.doc['WR-P-E-J-0000000001.p.1.s.6.w.31'] self.assertEqual( w2.text(), 'vierkante') self.assertEqual( w2.stricttext(), 'vierkante') def test030b_textcontent(self): """Sanity check - Text Content (2)""" s = self.doc['sandbox.3.head'] t = s.textcontent() self.assertEqual( len(t), 3) self.assertEqual( t.text(), "De FoLiA developers zijn:") self.assertEqual( t[0], "De ") self.assertTrue( isinstance(t[1], folia.TextMarkupString) ) self.assertEqual( t[1].text(), "FoLiA developers") self.assertEqual( t[2], " zijn:") def test031_sense(self): """Sanity Check - Lexical Semantic Sense Annotation""" w = self.doc['sandbox.list.1.listitem.1.s.1.w.1'] sense = w.annotation(folia.SenseAnnotation) self.assertEqual( sense.cls , 'some.sense.id') self.assertEqual( sense.feat('synset') , 'some.synset.id') def test032_event(self): """Sanity Check - Events""" l= self.doc['sandbox'] event = l.annotation(folia.Event) self.assertEqual( event.cls , 'applause') self.assertEqual( event.feat('actor') , 'audience') def test033_list(self): """Sanity Check - List""" l = self.doc['sandbox.list.1'] self.assertTrue( isinstance( l[0], folia.ListItem) ) self.assertEqual( l[0].n, '1' ) #testing common n attribute self.assertEqual( l[0].text(), 'Eerste testitem') self.assertTrue( isinstance( l[-1], folia.ListItem) ) self.assertEqual( l[1].text(), 'Tweede testitem') self.assertEqual( l[1].n, '2' ) def test034_figure(self): """Sanity Check - Figure""" fig = self.doc['sandbox.figure.1'] self.assertEqual( fig.src, "http://upload.wikimedia.org/wikipedia/commons/8/8e/Family_tree.svg") self.assertEqual( fig.caption(), 'Een stamboom') def test035_event(self): """Sanity Check - Event""" e = self.doc['sandbox.event.1'] self.assertEqual( e.feat('actor'), 'proycon') self.assertEqual( e.feat('begindatetime'), '2011-12-15T19:01') self.assertEqual( e.feat('enddatetime'), '2011-12-15T19:05') def test036_parsen(self): """Sanity Check - Paragraph and Sentence annotation""" p = self.doc['WR-P-E-J-0000000001.p.1'] self.assertEqual( p.cls, 'firstparagraph' ) s = self.doc['WR-P-E-J-0000000001.p.1.s.6'] self.assertEqual( s.cls, 'sentence' ) def test037a_feat(self): """Sanity Check - Feature test (including shortcut)""" xml = """
blah

blah

""".format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( doc['head.1.s.1.w.1'].pos() , 'NN(blah)') self.assertEqual( doc['head.1.s.1.w.1'].annotation(folia.PosAnnotation).feat('head') , 'NN') self.assertEqual( doc['p.1.s.1.w.1'].pos() , 'NN(blah)') self.assertEqual( doc['p.1.s.1.w.1'].annotation(folia.PosAnnotation).feat('head') , 'NN') def test037b_multiclassfeat(self): """Sanity Check - Multiclass feature""" xml = """

blah

""".format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( doc['p.1.s.1.w.1'].pos() , 'NN(a,b,c)') self.assertEqual( doc['p.1.s.1.w.1'].annotation(folia.PosAnnotation).feat('x') , ['a','b','c'] ) def test038a_morphemeboundary(self): """Sanity check - Obtaining annotation should not descend into morphology layer""" self.assertRaises( folia.NoSuchAnnotation, self.doc['WR-P-E-J-0000000001.sandbox.2.s.1.w.2'].annotation , folia.PosAnnotation) def test038b_morphemeboundary(self): """Sanity check - Obtaining morphemes and token annotation under morphemes""" w = self.doc['WR-P-E-J-0000000001.sandbox.2.s.1.w.2'] l = list(w.morphemes()) #get all morphemes self.assertEqual(len(l), 2) m = w.morpheme(1) #get second morpheme self.assertEqual(m.annotation(folia.PosAnnotation).cls, 'n') def test039_findspan(self): """Sanity Check - Find span on layer""" s = self.doc['WR-P-E-J-0000000001.p.1.s.7'] semrolelayer = s.annotation(folia.SemanticRolesLayer) roles = list(semrolelayer.annotations(folia.SemanticRole)) self.assertEqual(semrolelayer.findspan( self.doc['WR-P-E-J-0000000001.p.1.s.7.w.4'], self.doc['WR-P-E-J-0000000001.p.1.s.7.w.5']), roles[1] ) def test040_spaniter(self): """Sanity Check - Iteration over spans""" t = [] sentence = self.doc["WR-P-E-J-0000000001.p.1.s.1"] for layer in sentence.select(folia.EntitiesLayer): for entity in layer.select(folia.Entity): for word in entity.wrefs(): t.append(word.text()) self.assertEqual(t, ['ander','woord']) def test041_findspans(self): """Sanity check - Find spans given words (no set)""" t = [] word = self.doc["WR-P-E-J-0000000001.p.1.s.1.w.4"] for entity in word.findspans(folia.EntitiesLayer): for word in entity.wrefs(): t.append(word.text()) self.assertEqual(t, ['ander','woord']) def test041b_findspans(self): """Sanity check - Find spans given words (specific set)""" t = [] word = self.doc["example.table.1.w.3"] for entity in word.findspans(folia.EntitiesLayer, "http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml"): for word in entity.wrefs(): t.append(word.text()) self.assertEqual(t, ['Maarten','van','Gompel']) def test041c_findspans(self): """Sanity check - Find spans given words (specific set, by SpanAnnotation class)""" t = [] word = self.doc["example.table.1.w.3"] for entity in word.findspans(folia.Entity, "http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml"): for word in entity.wrefs(): t.append(word.text()) self.assertEqual(t, ['Maarten','van','Gompel']) def test042_table(self): """Sanity check - Table""" table = self.doc["example.table.1"] self.assertTrue( isinstance(table, folia.Table)) self.assertTrue( isinstance(table[0], folia.TableHead)) self.assertTrue( isinstance(table[0][0], folia.Row)) self.assertEqual( len(table[0][0]), 2) #two cells self.assertTrue( isinstance(table[0][0][0], folia.Cell)) self.assertEqual( table[0][0][0].text(), "Naam" ) self.assertEqual( table[0][0].text(), "Naam | Universiteit" ) #text of whole row def test043_string(self): """Sanity check - String""" s = self.doc["sandbox.3.head"] self.assertTrue( s.hasannotation(folia.String) ) st = next(s.select(folia.String)) self.assertEqual( st.text(), "FoLiA developers") self.assertEqual( st.annotation(folia.LangAnnotation).cls, "eng") def test044_textmarkup(self): """Sanity check - Text Markup""" s = self.doc["sandbox.3.head"] t = s.textcontent() self.assertEqual( s.count(folia.TextMarkupString), 1) self.assertEqual( t.count(folia.TextMarkupString), 1) st = next(t.select(folia.TextMarkupString)) self.assertEqual( st.text(), "FoLiA developers" ) #testing value (full text value) self.assertEqual( st.resolve(), self.doc['sandbox.3.str']) #testing resolving references self.assertTrue( isinstance( self.doc['WR-P-E-J-0000000001.p.1.s.6'].textcontent()[-1], folia.Linebreak) ) #did we get the linebreak properly? #testing nesting self.assertEqual( len(st), 2) self.assertEqual( st[0], self.doc['sandbox.3.str.bold']) #testing TextMarkup.text() self.assertEqual( st[0].text(), 'FoLiA' ) #resolving returns self if it's not a reference self.assertEqual( self.doc['sandbox.3.str.bold'].resolve(), self.doc['sandbox.3.str.bold']) def test045_spancorrection(self): """Sanity Check - Corrections over span elements""" s = self.doc['example.last.cell'] entities = list(s.select(folia.Entity,set="http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml")) self.assertEqual( len(entities),1 ) self.assertEqual( entities[0].id , "example.tilburg.university.org" ) def test046_entry(self): """Sanity Check - Checking entry, term, definition and example""" entry = self.doc['entry.1'] terms = list(entry.select(folia.Term)) self.assertEqual( len(terms),1 ) self.assertEqual( terms[0].text() ,"Stemma" ) definitions = list(entry.select(folia.Definition)) self.assertEqual( len(definitions),2 ) examples = list(entry.select(folia.Example)) self.assertEqual( len(examples),1 ) def test046a_text(self): """Sanity Check - Text serialisation test with linebreaks and whitespaces""" p = self.doc['WR-P-E-J-0000000001.p.1'] #this is a bit of a malformed paragraph due to the explicit whitespace and linebreaks in it, but makes for a nice test: self.assertEqual( p.text(), "Stemma is een ander woord voor stamboom . In de historische wetenschap wordt zo'n stamboom , onder de naam stemma codicum ( handschriftelijke genealogie ) , gebruikt om de verwantschap tussen handschriften weer te geven . \n\nWerkwijze\n\nHiervoor worden de handschriften genummerd en gedateerd zodat ze op de juiste plaats van hun afstammingsgeschiedenis geplaatst kunnen worden . De hoofdletter A wordt gebruikt voor het originele handschrift . De andere handschriften krijgen ook een letter die verband kan houden met hun plaats van oorsprong óf plaats van bewaring. Verdwenen handschriften waarvan men toch vermoedt dat ze ooit bestaan hebben worden ook in het stemma opgenomen en worden weergegeven door de laatste letters van het alfabet en worden tussen vierkante haken geplaatst .\nTenslotte gaat men de verwantschap tussen de handschriften aanduiden . Een volle lijn duidt op een verwantschap , terweil een stippelijn op een onzekere verwantschap duidt .") def test046b_text(self): """Sanity Check - Text serialisation on lists""" l = self.doc['sandbox.list.1'] #this is a bit of a malformed paragraph due to the explicit whitespace and linebreaks in it, but makes for a nice test: self.assertEqual( l.text(), "Eerste testitem\nTweede testitem") def test047_alignment(self): """Sanity check - Alignment""" word = self.doc['WR-P-E-J-0000000001.p.1.s.3.w.10'] a = word.annotation(folia.Alignment) self.assertEqual( a.cls, "reference") aref = next(a.select(folia.AlignReference,ignore=False)) self.assertEqual( aref.id,"WR-P-E-J-0000000001.p.1.s.3.w.5" ) self.assertEqual( aref.type, 'w' ) self.assertEqual( aref.t,"handschriften" ) def test048_observations(self): """Sanity check - Observations""" word = self.doc['WR-P-E-J-0000000001.p.1.s.8.w.9'] observation = list(word.findspans(folia.ObservationLayer))[0] self.assertEqual( observation.cls , "ei_ij_error") self.assertEqual( observation.description() , "Confusion between EI and IJ diphtongues") def test049_sentiment(self): """Sanity check - Sentiments""" sentence = self.doc['WR-P-E-J-0000000001.sandbox.2.s.3'] sentiments = sentence.annotation(folia.SentimentLayer) sentiment = sentiments.annotation(folia.Sentiment) self.assertEqual( sentiment.cls , "disappointment") self.assertEqual( sentiment.feat('polarity') , "negative") self.assertEqual( sentiment.feat('strength') , "strong") self.assertEqual( sentiment.annotation(folia.Source).text(), "Hij") self.assertEqual( sentiment.annotation(folia.Headspan).text(), "erg teleurgesteld") def test050_statement(self): """Sanity check - Statements""" sentence = self.doc['WR-P-E-J-0000000001.sandbox.2.s.2'] sentiments = sentence.annotation(folia.StatementLayer) sentiment = sentiments.annotation(folia.Statement) self.assertEqual( sentiment.cls , "promise") self.assertEqual( sentiment.annotation(folia.Source).text(), "Hij") self.assertEqual( sentiment.annotation(folia.Relation).text(), "had beloofd") self.assertEqual( sentiment.annotation(folia.Headspan).text(), "hij zou winnen") def test099_write(self): """Sanity Check - Writing to file""" self.doc.save(os.path.join(TMPDIR,'foliasavetest.xml')) def test099b_write(self): """Sanity Check - Writing to GZ file""" self.doc.save(os.path.join(TMPDIR,'foliasavetest.xml.gz')) def test099c_write(self): """Sanity Check - Writing to BZ2 file""" self.doc.save(os.path.join(TMPDIR,'foliasavetest.xml.bz2')) def test100a_sanity(self): """Sanity Check - A - Checking output file against input (should be equal)""" f = io.open(os.path.join(TMPDIR,'foliatest.xml'),'w',encoding='utf-8') f.write(FOLIAEXAMPLE) f.close() self.doc.save(os.path.join(TMPDIR,'foliatest100.xml')) self.assertEqual( folia.Document(file=os.path.join(TMPDIR,'foliatest100.xml'),debug=False), self.doc ) def test100b_sanity_xmldiff(self): """Sanity Check - B - Checking output file against input using xmldiff (should be equal)""" f = io.open(os.path.join(TMPDIR,'foliatest.xml'),'w',encoding='utf-8') f.write(FOLIAEXAMPLE) f.close() #use xmldiff to compare the two: self.doc.save(os.path.join(TMPDIR,'foliatest100.xml')) retcode = os.system('xmldiff -c ' + os.path.join(TMPDIR,'foliatest.xml') + ' ' + os.path.join(TMPDIR,'foliatest100.xml')) #retcode = 1 #disabled (memory hog) self.assertEqual( retcode, 0) def test101a_metadataextref(self): """Sanity Check - Metadata external reference (CMDI)""" xml = """ """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( doc.metadatatype, folia.MetaDataType.CMDI ) self.assertEqual( doc.metadatafile, 'test.cmdi.xml' ) def test101b_metadataextref2(self): """Sanity Check - Metadata external reference (IMDI)""" xml = """ """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( doc.metadatatype, folia.MetaDataType.IMDI ) self.assertEqual( doc.metadatafile, 'test.imdi.xml' ) def test101c_metadatainternal(self): """Sanity Check - Metadata internal (foreign data) (Dublin Core)""" xml = """ mydoc text/xml Example proycon proycon en Radboud University public Domain """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( doc.metadatatype, "dc" ) self.assertEqual( doc.metadata.node.xpath('//dc:creator', namespaces={'dc':'http://purl.org/dc/elements/1.1/'})[0].text , 'proycon' ) xmlcheck(doc.xmlstring(), xml) def test101d_metadatainternal(self): """Sanity Check - Metadata internal (double)""" xml = """ mydoc text/xml Example proycon proycon en Radboud University public Domain """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( doc.metadatatype, "dc" ) self.assertEqual( doc.metadata.node.xpath('//dc:creator', namespaces={'dc':'http://purl.org/dc/elements/1.1/'})[0].text , 'proycon' ) xmlcheck(doc.xmlstring(), xml) def test102a_declarations(self): """Sanity Check - Declarations - Default set""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( next(doc['example.text.1'].select(folia.Gap)).set, 'gap-set' ) def test102a2_declarations(self): """Sanity Check - Declarations - Default set, no further defaults""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( next(doc['example.text.1'].select(folia.Gap)).set, 'gap-set' ) self.assertEqual( next(doc['example.text.1'].select(folia.Gap)).annotator, 'proycon' ) self.assertEqual( next(doc['example.text.1'].select(folia.Gap)).annotatortype, folia.AnnotatorType.MANUAL) def test102b_declarations(self): """Sanity Check - Declarations - Set mismatching """ xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) self.assertRaises( ValueError, folia.Document, string=xml) def test102c_declarations(self): """Sanity Check - Declarations - Multiple sets for the same annotation type""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( next(doc['example.text.1'].select(folia.Gap)).set, 'gap-set' ) self.assertEqual( list(doc['example.text.1'].select(folia.Gap))[1].set, 'extended-gap-set' ) def test102d1_declarations(self): """Sanity Check - Declarations - Multiple sets for the same annotation type (testing failure)""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) self.assertRaises(ValueError, folia.Document, string=xml ) def test102d2_declarations(self): """Sanity Check - Declarations - Multiple sets for the same annotation type (testing failure)""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) self.assertRaises(ValueError, folia.Document, string=xml ) def test102d3_declarations(self): """Sanity Check - Declarations - Ignore Duplicates""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( doc.defaultset(folia.AnnotationType.GAP), 'gap-set' ) self.assertEqual( doc.defaultannotator(folia.AnnotationType.GAP), "sloot" ) def test102e_declarations(self): """Sanity Check - Declarations - Missing declaration""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) self.assertRaises( ValueError, folia.Document, string=xml) def test102f_declarations(self): """Sanity Check - Declarations - Declaration not needed""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) def test102g_declarations(self): """Sanity Check - Declarations - 'Undefined' set in declaration""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( next(doc['example.text.1'].select(folia.Gap)).set, 'undefined' ) def test102h_declarations(self): """Sanity Check - Declarations - Double ambiguous declarations unset default""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertRaises(folia.NoDefaultError, doc.defaultannotator, folia.AnnotationType.GAP) def test102i_declarations(self): """Sanity Check - Declarations - miscellanious trouble""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( doc.defaultannotator(folia.AnnotationType.GAP,"gap1-set"), "sloot" ) doc.declare(folia.AnnotationType.GAP, "gap1-set", annotator='proycon' ) #slightly different behaviour from libfolia: here this overrides the earlier default self.assertEqual( doc.defaultannotator(folia.AnnotationType.GAP,"gap1-set"), "proycon" ) self.assertEqual( doc.defaultannotator(folia.AnnotationType.GAP,"gap2-set"), "sloot" ) text = doc["example.text.1"] text.append( folia.Gap(doc, set='gap1-set', cls='Y', annotator='proycon') ) text.append( folia.Gap(doc, set='gap1-set', cls='Z1' ) ) text.append( folia.Gap(doc, set='gap2-set', cls='Z2' ) ) text.append( folia.Gap(doc, set='gap2-set', cls='Y2', annotator='onbekend' ) ) gaps = list(text.select(folia.Gap)) self.assertTrue( xmlcheck(gaps[0].xmlstring(), '' ) ) self.assertTrue( xmlcheck(gaps[1].xmlstring(), '') ) self.assertTrue( xmlcheck(gaps[2].xmlstring(), '') ) self.assertTrue( xmlcheck(gaps[3].xmlstring(), '') ) self.assertTrue( xmlcheck(gaps[4].xmlstring(), '') ) def test102j_declarations(self): """Sanity Check - Declarations - Adding a declaration in other set.""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) text = doc["example.text.1"] doc.declare(folia.AnnotationType.GAP, "other-set", annotator='proycon' ) text.append( folia.Gap(doc, set='other-set', cls='Y', annotator='proycon') ) text.append( folia.Gap(doc, set='other-set', cls='Z' ) ) gaps = list(text.select(folia.Gap)) self.assertEqual( gaps[0].xmlstring(), '' ) self.assertEqual( gaps[1].xmlstring(), '' ) self.assertEqual( gaps[2].xmlstring(), '' ) def test102k_declarations(self): """Sanity Check - Declarations - Several annotator types.""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( doc.defaultannotatortype(folia.AnnotationType.GAP, 'gap-set'), folia.AnnotatorType.AUTO) text = doc["example.text.1"] gaps = list(text.select(folia.Gap)) self.assertTrue( xmlcheck(gaps[0].xmlstring(), '' ) ) doc.declare(folia.AnnotationType.GAP, "gap-set", annotatortype=folia.AnnotatorType.MANUAL ) self.assertEqual( doc.defaultannotatortype(folia.AnnotationType.GAP), folia.AnnotatorType.MANUAL ) self.assertRaises( ValueError, folia.Gap, doc, set='gap-set', cls='Y', annotatortype='unknown' ) text.append( folia.Gap(doc, set='gap-set', cls='Y', annotatortype='manual' ) ) text.append( folia.Gap(doc, set='gap-set', cls='Z', annotatortype='auto' ) ) gaps = list(text.select(folia.Gap)) self.assertTrue( xmlcheck(gaps[0].xmlstring(), '') ) self.assertTrue( xmlcheck(gaps[1].xmlstring(), '') ) self.assertTrue( xmlcheck(gaps[2].xmlstring(), '') ) def test102l_declarations(self): """Sanity Check - Declarations - Datetime default.""" xml = """\n """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertEqual( doc.defaultdatetime(folia.AnnotationType.GAP, 'gap-set'), folia.parse_datetime('2011-12-15T19:00') ) self.assertEqual( next(doc["example.text.1"].select(folia.Gap)).datetime , folia.parse_datetime('2011-12-15T19:00') ) def test103_namespaces(self): """Sanity Check - Alien namespaces - Checking whether properly ignored""" xml = """\n blah word """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertTrue( len(list(doc['example.text.1.s.1'].words())) == 1 ) #second word is in alien namespace, not read self.assertRaises( KeyError, doc.__getitem__, 'example.text.1.s.1.alienword') #doesn't exist def test104_speech(self): """Sanity Check - Speech data (without attributes)""" xml = """\n həlˈəʊ wˈɜːld həlˈəʊ wˈɜːld """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertTrue( isinstance(doc.data[0], folia.Speech) ) self.assertTrue( isinstance(doc['example.speech.utt.1'], folia.Utterance) ) self.assertEqual( doc['example.speech.utt.1'].phon(), "həlˈəʊ wˈɜːld" ) self.assertRaises( folia.NoSuchText, doc['example.speech.utt.1'].text) #doesn't exist self.assertEqual( doc['example.speech.utt.2'].phon(), "həlˈəʊ wˈɜːld" ) def test104b_speech(self): """Sanity Check - Speech data with speech attributes""" xml = """\n həlˈəʊ wˈɜːld həlˈəʊ wˈɜːld """.format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertTrue( isinstance(doc.data[0], folia.Speech) ) self.assertTrue( isinstance(doc['example.speech.utt.1'], folia.Utterance) ) self.assertEqual( doc['example.speech.utt.1'].phon(), "həlˈəʊ wˈɜːld" ) self.assertRaises( folia.NoSuchText, doc['example.speech.utt.1'].text) #doesn't exist self.assertEqual( doc['example.speech.utt.2'].phon(), "həlˈəʊ wˈɜːld" ) self.assertEqual( doc['example.speech'].speech_speaker(), "proycon" ) self.assertEqual( doc['example.speech'].speech_src(), "helloworld.ogg" ) self.assertEqual( doc['example.speech.utt.1'].begintime, (0,0,0,0) ) self.assertEqual( doc['example.speech.utt.1'].endtime, (0,0,2,12) ) #testing inheritance self.assertEqual( doc['example.speech.utt.2.w.2'].speech_speaker(), "proycon" ) self.assertEqual( doc['example.speech.utt.2.w.2'].speech_src(), "helloworld.ogg" ) self.assertEqual( doc['example.speech.utt.2.w.2'].begintime, (0,0,1,267) ) self.assertEqual( doc['example.speech.utt.2.w.2'].endtime, (0,0,2,12) ) def test104c_speech(self): """Sanity Check - Testing serialisation of speech data with speech attributes""" speechxml = """ həlˈəʊ wˈɜːld həlˈəʊ wˈɜːld """ xml = """\n %s """ % speechxml doc = folia.Document(string=xml) self.assertTrue( xmlcheck( doc['example.speech'].xmlstring(), u(speechxml)) ) def test105_complexalignment(self): """Sanity Check - Complex alignment""" xml = """

Dit is een test. Ik wil kijken of het werkt.

""".format(version=folia.FOLIAVERSION, generator='pynlpl.formats.folia-v' + folia.LIBVERSION) doc = folia.Document(string=xml) self.assertTrue(doc.xml() is not None) #serialisation check l = doc.paragraphs(0).annotation(folia.ComplexAlignmentLayer) ca = list(l.annotations(folia.ComplexAlignment)) self.assertEqual(len(ca),1) alignments = list(ca[0].select(folia.Alignment)) self.assertEqual(len(alignments),2) class Test4Edit(unittest.TestCase): def setUp(self): global FOLIAEXAMPLE self.doc = folia.Document(string=FOLIAEXAMPLE) def test001_addsentence(self): """Edit Check - Adding a sentence to first paragraph (verbose)""" #grab last paragraph p = self.doc.paragraphs(0) #how many sentences? tmp = len(list(p.sentences())) #make a sentence s = folia.Sentence(self.doc, generate_id_in=p) #add words to the sentence s.append( folia.Word(self.doc, text='Dit',generate_id_in=s, annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) ) s.append( folia.Word(self.doc, text='is',generate_id_in=s, annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) ) s.append( folia.Word(self.doc, text='een',generate_id_in=s, annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) ) s.append( folia.Word(self.doc, text='nieuwe',generate_id_in=s, annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) ) s.append( folia.Word(self.doc, text='zin',generate_id_in=s, annotator='testscript', annotatortype=folia.AnnotatorType.AUTO, space=False ) ) s.append( folia.Word(self.doc, text='.',generate_id_in=s, annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) ) #add the sentence p.append(s) #ID check self.assertEqual( s[0].id, s.id + '.w.1' ) self.assertEqual( s[1].id, s.id + '.w.2' ) self.assertEqual( s[2].id, s.id + '.w.3' ) self.assertEqual( s[3].id, s.id + '.w.4' ) self.assertEqual( s[4].id, s.id + '.w.5' ) self.assertEqual( s[5].id, s.id + '.w.6' ) #index check self.assertEqual( self.doc[s.id], s ) self.assertEqual( self.doc[s.id + '.w.3'], s[2] ) #attribute check self.assertEqual( s[0].annotator, 'testscript' ) self.assertEqual( s[0].annotatortype, folia.AnnotatorType.AUTO ) #addition to paragraph correct? self.assertEqual( len(list(p.sentences())) , tmp + 1) self.assertEqual( p[-1] , s) # text() ok? self.assertEqual( s.text(), "Dit is een nieuwe zin." ) # xml() ok? self.assertTrue( xmlcheck( s.xmlstring(), 'Ditiseennieuwezin.') ) def test001b_addsentence(self): """Edit Check - Adding a sentence to first paragraph (shortcut)""" #grab last paragraph p = self.doc.paragraphs(0) #how many sentences? tmp = len(list(p.sentences())) s = p.append(folia.Sentence) s.append(folia.Word,'Dit') s.append(folia.Word,'is') s.append(folia.Word,'een') s.append(folia.Word,'nieuwe') w = s.append(folia.Word,'zin') w2 = s.append(folia.Word,'.',cls='PUNCTUATION') self.assertEqual( s.id, 'WR-P-E-J-0000000001.p.1.s.9') self.assertEqual( len(list(s.words())), 6 ) #number of words in sentence self.assertEqual( w.text(), 'zin' ) #text check self.assertEqual( self.doc[w.id], w ) #index check #addition to paragraph correct? self.assertEqual( len(list(p.sentences())) , tmp + 1) self.assertEqual( p[-1] , s) self.assertTrue( xmlcheck(s.xmlstring(), 'Ditiseennieuwezin.')) def test001c_addsentence(self): """Edit Check - Adding a sentence to first paragraph (using add instead of append)""" #grab last paragraph p = self.doc.paragraphs(0) #how many sentences? tmp = len(list(p.sentences())) s = p.add(folia.Sentence) s.add(folia.Word,'Dit') s.add(folia.Word,'is') s.add(folia.Word,'een') s.add(folia.Word,'nieuwe') w = s.add(folia.Word,'zin') w2 = s.add(folia.Word,'.',cls='PUNCTUATION') self.assertEqual( len(list(s.words())), 6 ) #number of words in sentence self.assertEqual( w.text(), 'zin' ) #text check self.assertEqual( self.doc[w.id], w ) #index check #addition to paragraph correct? self.assertEqual( len(list(p.sentences())) , tmp + 1) self.assertEqual( p[-1] , s) self.assertTrue( xmlcheck(s.xmlstring(), 'Ditiseennieuwezin.')) def test002_addannotation(self): """Edit Check - Adding a token annotation (pos, lemma) (pre-generated instances)""" #grab a word (naam) w = self.doc['WR-P-E-J-0000000001.p.1.s.2.w.11'] self.doc.declare(folia.PosAnnotation, 'adhocpos') self.doc.declare(folia.LemmaAnnotation, 'adhoclemma') #add a pos annotation (in a different set than the one already present, to prevent conflict) w.append( folia.PosAnnotation(self.doc, set='adhocpos', cls='NOUN', annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) ) w.append( folia.LemmaAnnotation(self.doc, set='adhoclemma', cls='NAAM', annotator='testscript', annotatortype=folia.AnnotatorType.AUTO, datetime=datetime(1982, 12, 15, 19, 0, 1) ) ) #retrieve and check p = w.annotation(folia.PosAnnotation, 'adhocpos') self.assertTrue( isinstance(p, folia.PosAnnotation) ) self.assertEqual( p.cls, 'NOUN' ) l = w.annotation(folia.LemmaAnnotation, 'adhoclemma') self.assertTrue( isinstance(l, folia.LemmaAnnotation) ) self.assertEqual( l.cls, 'NAAM' ) self.assertTrue( xmlcheck(w.xmlstring(), 'naam') ) def test002b_addannotation(self): """Edit Check - Adding a token annotation (pos, lemma) (instances generated on the fly)""" #grab a word (naam) w = self.doc['WR-P-E-J-0000000001.p.1.s.2.w.11'] self.doc.declare(folia.PosAnnotation, 'adhocpos') self.doc.declare(folia.LemmaAnnotation, 'adhoclemma') #add a pos annotation (in a different set than the one already present, to prevent conflict) w.append( folia.PosAnnotation, set='adhocpos', cls='NOUN', annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) w.append( folia.LemmaAnnotation, set='adhoclemma', cls='NAAM', annotator='testscript', annotatortype=folia.AnnotatorType.AUTO ) #retrieve and check p = w.annotation(folia.PosAnnotation, 'adhocpos') self.assertTrue( isinstance(p, folia.PosAnnotation) ) self.assertEqual( p.cls, 'NOUN' ) l = w.annotation(folia.LemmaAnnotation, 'adhoclemma') self.assertTrue( isinstance(l, folia.LemmaAnnotation) ) self.assertEqual( l.cls, 'NAAM' ) self.assertTrue( xmlcheck(w.xmlstring(), 'naam')) def test002c_addannotation(self): """Edit Check - Adding a token annotation (pos, lemma) (using add instead of append)""" #grab a word (naam) w = self.doc['WR-P-E-J-0000000001.p.1.s.2.w.11'] self.doc.declare(folia.PosAnnotation, 'adhocpos') self.doc.declare(folia.LemmaAnnotation, 'adhoclemma') #add a pos annotation (in a different set than the one already present, to prevent conflict) w.add( folia.PosAnnotation(self.doc, set='adhocpos', cls='NOUN', annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) ) w.add( folia.LemmaAnnotation(self.doc, set='adhoclemma', cls='NAAM', annotator='testscript', annotatortype=folia.AnnotatorType.AUTO, datetime=datetime(1982, 12, 15, 19, 0, 1) ) ) #retrieve and check p = w.annotation(folia.PosAnnotation, 'adhocpos') self.assertTrue( isinstance(p, folia.PosAnnotation) ) self.assertEqual( p.cls, 'NOUN' ) l = w.annotation(folia.LemmaAnnotation, 'adhoclemma') self.assertTrue( isinstance(l, folia.LemmaAnnotation) ) self.assertEqual( l.cls, 'NAAM' ) self.assertTrue( xmlcheck(w.xmlstring(), 'naam') ) def test004_addinvalidannotation(self): """Edit Check - Adding a token default-set annotation that clashes with the existing one""" #grab a word (naam) w = self.doc['WR-P-E-J-0000000001.p.1.s.2.w.11'] #add a pos annotation without specifying a set (should take default set), but this will clash with existing tag! self.assertRaises( folia.DuplicateAnnotationError, w.append, folia.PosAnnotation(self.doc, cls='N', annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) ) self.assertRaises( folia.DuplicateAnnotationError, w.append, folia.LemmaAnnotation(self.doc, cls='naam', annotator='testscript', annotatortype=folia.AnnotatorType.AUTO ) ) def test005_addalternative(self): """Edit Check - Adding an alternative token annotation""" w = self.doc['WR-P-E-J-0000000001.p.1.s.2.w.11'] w.append( folia.Alternative(self.doc, generate_id_in=w, contents=folia.PosAnnotation(self.doc, cls='V'))) #reobtaining it: alt = list(w.alternatives()) #all alternatives set = self.doc.defaultset(folia.AnnotationType.POS) alt2 = list(w.alternatives(folia.PosAnnotation, set)) self.assertEqual( alt[0],alt2[0] ) self.assertEqual( len(alt),1 ) self.assertEqual( len(alt2),1 ) self.assertTrue( isinstance(alt[0].annotation(folia.PosAnnotation, set), folia.PosAnnotation) ) self.assertTrue( xmlcheck(w.xmlstring(), 'naam')) def test006_addcorrection(self): """Edit Check - Correcting Text""" w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] #stippelijn w.correct(new='stippellijn', set='corrections',cls='spelling',annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) self.assertEqual( w.annotation(folia.Correction).original(0).text() ,'stippelijn' ) self.assertEqual( w.annotation(folia.Correction).new(0).text() ,'stippellijn' ) self.assertEqual( w.text(), 'stippellijn') self.assertTrue( xmlcheck(w.xmlstring(),'stippellijnstippelijn')) def test006b_addcorrection(self): """Edit Check - Correcting Text (2)""" w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] #stippelijn w.correct(new=folia.TextContent(self.doc,value='stippellijn',set='undefined',cls='current'), set='corrections',cls='spelling',annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) self.assertEqual( w.annotation(folia.Correction).original(0).text() ,'stippelijn' ) self.assertEqual( w.annotation(folia.Correction).new(0).text() ,'stippellijn' ) self.assertEqual( w.text(), 'stippellijn') self.assertTrue( xmlcheck(w.xmlstring(),'stippellijnstippelijn')) def test007_addcorrection2(self): """Edit Check - Correcting a Token Annotation element""" w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] #stippelijn oldpos = w.annotation(folia.PosAnnotation) newpos = folia.PosAnnotation(self.doc, cls='N(soort,ev,basis,zijd,stan)') w.correct(original=oldpos,new=newpos, set='corrections',cls='spelling',annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) self.assertEqual( w.annotation(folia.Correction).original(0) ,oldpos ) self.assertEqual( w.annotation(folia.Correction).new(0),newpos ) self.assertTrue( xmlcheck(w.xmlstring(),'stippelijn')) def test008_addsuggestion(self): """Edit Check - Suggesting a text correction""" w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] #stippelijn w.correct(suggestion='stippellijn', set='corrections',cls='spelling',annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) self.assertTrue( isinstance(w.annotation(folia.Correction), folia.Correction) ) self.assertEqual( w.annotation(folia.Correction).suggestions(0).text() , 'stippellijn' ) self.assertEqual( w.text(), 'stippelijn') self.assertTrue( xmlcheck(w.xmlstring(),'stippelijnstippellijn')) def test009a_idclash(self): """Edit Check - Checking for exception on adding a duplicate ID""" w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] self.assertRaises( folia.DuplicateIDError, w.sentence().append, folia.Word, id='WR-P-E-J-0000000001.p.1.s.8.w.11', text='stippellijn') #def test009b_textcorrectionlevel(self): # """Edit Check - Checking for exception on an adding TextContent of wrong level""" # w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] # # self.assertRaises( ValueError, w.append, folia.TextContent, value='blah', corrected=folia.TextCorrectionLevel.ORIGINAL ) # #def test009c_duptextcontent(self): # """Edit Check - Checking for exception on an adding duplicate textcontent""" # w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] # # self.assertRaises( folia.DuplicateAnnotationError, w.append, folia.TextContent, value='blah', corrected=folia.TextCorrectionLevel.PROCESSED ) def test010_documentlesselement(self): """Edit Check - Creating an initially document-less tokenannotation element and adding it to a word""" #not associated with any document yet (first argument is None instead of Document instance) pos = folia.PosAnnotation(None, set='fakecgn', cls='N') w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] w.append(pos) self.assertEqual( w.annotation(folia.PosAnnotation,'fakecgn'), pos) self.assertEqual( pos.parent, w) self.assertEqual( pos.doc, w.doc) self.assertTrue( xmlcheck(w.xmlstring(), 'stippelijn')) def test011_subtokenannot(self): """Edit Check - Adding morphemes""" w = self.doc['WR-P-E-J-0000000001.p.1.s.5.w.3'] l = w.append( folia.MorphologyLayer ) l.append( folia.Morpheme(self.doc, folia.TextContent(self.doc, value='handschrift', offset=0), folia.LemmaAnnotation(self.doc, cls='handschrift'), cls='stem',function='lexical' )) l.append( folia.Morpheme(self.doc, folia.TextContent(self.doc, value='en', offset=11), cls='suffix',function='inflexional' )) self.assertEqual( len(l), 2) #two morphemes self.assertTrue( isinstance(l[0], folia.Morpheme ) ) self.assertEqual( l[0].text(), 'handschrift' ) self.assertEqual( l[0].cls , 'stem' ) self.assertEqual( l[0].feat('function'), 'lexical' ) self.assertEqual( l[1].text(), 'en' ) self.assertEqual( l[1].cls, 'suffix' ) self.assertEqual( l[1].feat('function'), 'inflexional' ) self.assertTrue( xmlcheck(w.xmlstring(),'handschriftenhandschriften')) def test012_alignment(self): """Edit Check - Adding Alignment""" w = self.doc['WR-P-E-J-0000000001.p.1.s.6.w.8'] a = w.append( folia.Alignment, cls="coreference") a.append( folia.AlignReference, id='WR-P-E-J-0000000001.p.1.s.6.w.1', type=folia.Word) a.append( folia.AlignReference, id='WR-P-E-J-0000000001.p.1.s.6.w.2', type=folia.Word) self.assertEqual( next(a.resolve()), self.doc['WR-P-E-J-0000000001.p.1.s.6.w.1'] ) self.assertEqual( list(a.resolve())[1], self.doc['WR-P-E-J-0000000001.p.1.s.6.w.2'] ) self.assertTrue( xmlcheck(w.xmlstring(),'ze')) def test013_spanannot(self): """Edit Check - Adding nested Span Annotatation (syntax)""" s = self.doc['WR-P-E-J-0000000001.p.1.s.4'] #sentence: 'De hoofdletter A wordt gebruikt voor het originele handschrift .' layer = s.append(folia.SyntaxLayer) layer.append( folia.SyntacticUnit(self.doc,cls='s',contents=[ folia.SyntacticUnit(self.doc,cls='np', contents=[ folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.1'] ,cls='det'), folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.2'], cls='n'), folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.3'], cls='n'), ]), folia.SyntacticUnit(self.doc,cls='vp',contents=[ folia.SyntacticUnit(self.doc,cls='vp',contents=[ folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.4'], cls='v'), folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.5'], cls='participle'), ]), folia.SyntacticUnit(self.doc, cls='pp',contents=[ folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.6'], cls='prep'), folia.SyntacticUnit(self.doc, cls='np',contents=[ folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.7'], cls='det'), folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.8'], cls='adj'), folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.9'], cls='n'), ]) ]) ]) ]) ) self.assertTrue( xmlcheck(layer.xmlstring(),'')) def test013a_spanannot(self): """Edit Check - Adding Span Annotation (entity, from word using add)""" word = self.doc["WR-P-E-J-0000000001.p.1.s.4.w.2"] #hoofdletter word2 = self.doc["WR-P-E-J-0000000001.p.1.s.4.w.3"] #A entity = word.add(folia.Entity, word, word2, cls="misc",set="http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml") self.assertIsInstance(entity, folia.Entity) self.assertTrue(xmlcheck(entity.parent.parent.xmlstring(),'DeDehoofdletterA')) def test013b_spanannot(self): """Edit Check - Adding nested Span Annotatation (add as append)""" s = self.doc['WR-P-E-J-0000000001.p.1.s.4'] #sentence: 'De hoofdletter A wordt gebruikt voor het originele handschrift .' layer = s.add(folia.SyntaxLayer) layer.add( folia.SyntacticUnit(self.doc,cls='s',contents=[ folia.SyntacticUnit(self.doc,cls='np', contents=[ folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.1'] ,cls='det'), folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.2'], cls='n'), folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.3'], cls='n'), ]), folia.SyntacticUnit(self.doc,cls='vp',contents=[ folia.SyntacticUnit(self.doc,cls='vp',contents=[ folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.4'], cls='v'), folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.5'], cls='participle'), ]), folia.SyntacticUnit(self.doc, cls='pp',contents=[ folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.6'], cls='prep'), folia.SyntacticUnit(self.doc, cls='np',contents=[ folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.7'], cls='det'), folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.8'], cls='adj'), folia.SyntacticUnit(self.doc, self.doc['WR-P-E-J-0000000001.p.1.s.4.w.9'], cls='n'), ]) ]) ]) ]) ) self.assertTrue( xmlcheck(layer.xmlstring(),'')) def test013c_spanannotcorrection(self): """Edit Check - Correcting Span Annotation""" s = self.doc['example.cell'] l = s.annotation(folia.EntitiesLayer) l.correct(original=self.doc['example.radboud.university.nijmegen.org'], new=folia.Entity(self.doc, *self.doc['example.radboud.university.nijmegen.org'].wrefs(), cls="loc",set="http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml") ,set='corrections',cls='wrongclass') self.assertTrue( xmlcheck(l.xmlstring(), 'This is our university!')) def test013d_spanannot(self): """Edit Check - Adding Span Annotation (entity, from sentence using add)""" sentence = self.doc["WR-P-E-J-0000000001.p.1.s.4"] word = self.doc["WR-P-E-J-0000000001.p.1.s.4.w.2"] #hoofdletter word2 = self.doc["WR-P-E-J-0000000001.p.1.s.4.w.3"] #A entity = sentence.add(folia.Entity, word, word2, cls="misc",set="http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml") self.assertIsInstance(entity, folia.Entity) self.assertTrue(xmlcheck(entity.parent.parent.xmlstring(),'De hoofdletter A wordt gebruikt voor het originele handschrift.De hoofdletter A wordt gebruikt voor het originele handschrift.Uppercase A is used for the original.DeDehoofdletterAwordtgebruiktvoorhetoriginelehandschrift.')) def test013e_spanannot(self): """Edit Check - Adding nested Span Annotation""" word = self.doc["WR-P-E-J-0000000001.p.1.s.1.w.7"] #stamboom for su in word.findspans(folia.SyntacticUnit): if su.cls == 'pp': parentspan = su self.assertIsInstance(parentspan, folia.SyntacticUnit) self.assertEqual(parentspan.wrefs(recurse=False) , [self.doc["WR-P-E-J-0000000001.p.1.s.1.w.6"],self.doc["WR-P-E-J-0000000001.p.1.s.1.w.7"]]) #prior to adding newspan = parentspan.add(folia.SyntacticUnit, word, cls='np') self.assertEqual(parentspan.wrefs(recurse=False) , [self.doc["WR-P-E-J-0000000001.p.1.s.1.w.6"]]) #after adding, parent span wref gone (moved to child) self.assertEqual(parentspan.wrefs(recurse=True) , [self.doc["WR-P-E-J-0000000001.p.1.s.1.w.6"],self.doc["WR-P-E-J-0000000001.p.1.s.1.w.7"]]) #result is still the same with recursion self.assertEqual(newspan.wrefs() , [self.doc["WR-P-E-J-0000000001.p.1.s.1.w.7"]]) def test014_replace(self): """Edit Check - Replacing an annotation""" word = self.doc['WR-P-E-J-0000000001.p.1.s.3.w.14'] word.replace(folia.PosAnnotation(self.doc, cls='BOGUS') ) self.assertEqual( len(list(word.annotations(folia.PosAnnotation))), 1) self.assertEqual( word.annotation(folia.PosAnnotation).cls, 'BOGUS') self.assertTrue( xmlcheck(word.xmlstring(), 'plaats')) def test015_remove(self): """Edit Check - Removing an annotation""" word = self.doc['WR-P-E-J-0000000001.p.1.s.3.w.14'] word.remove( word.annotation(folia.PosAnnotation) ) self.assertRaises( folia.NoSuchAnnotation, word.annotation, folia.PosAnnotation ) self.assertTrue( xmlcheck(word.xmlstring(), 'plaats')) def test016_datetime(self): """Edit Check - Time stamp""" w = self.doc['WR-P-E-J-0000000001.p.1.s.8.w.16'] pos = w.annotation(folia.PosAnnotation) pos.datetime = datetime(1982, 12, 15, 19, 0, 1) #(the datetime of my joyful birth) self.assertTrue( xmlcheck(pos.xmlstring(), '')) def test017_wordtext(self): """Edit Check - Altering word text""" #Important note: directly altering text is usually bad practise, you'll want to use proper corrections instead. w = self.doc['WR-P-E-J-0000000001.p.1.s.8.w.9'] self.assertEqual(w.text(), 'terweil') w.settext('terwijl') self.assertEqual(w.text(), 'terwijl') def test017b_wordtext(self): """Edit Check - Altering word text with reserved symbols""" #Important note: directly altering text is usually bad practise, you'll want to use proper corrections instead. w = self.doc['WR-P-E-J-0000000001.p.1.s.8.w.9'] w.settext('1 & 1 > 0') self.assertEqual(w.text(), '1 & 1 > 0') self.assertEqual(w.textcontent().xmlstring(), '1 & 1 > 0') def test018a_sentencetext(self): """Edit Check - Altering sentence text (untokenised by definition)""" s = self.doc['WR-P-E-J-0000000001.p.1.s.1'] self.assertEqual(s.text(), 'Stemma is een ander woord voor stamboom .') #text is obtained from children, since there is no direct text associated self.assertFalse(s.hastext()) #no text DIRECTLY associated with the sentence #associating text directly with the sentence: de-tokenised by definition! s.settext('Stemma is een ander woord voor stamboom.') self.assertTrue(s.hastext()) self.assertEqual(s.text(), 'Stemma is een ander woord voor stamboom .') #text still obtained from children rather than directly associated text!! self.assertEqual(s.stricttext(), 'Stemma is een ander woord voor stamboom.') def test018b_sentencetext(self): """Edit Check - Altering sentence text (untokenised by definition)""" s = self.doc['WR-P-E-J-0000000001.p.1.s.8'] self.assertEqual( s.text(), 'Een volle lijn duidt op een verwantschap , terweil een stippelijn op een onzekere verwantschap duidt .' ) #dynamic from children s.settext('Een volle lijn duidt op een verwantschap, terwijl een stippellijn op een onzekere verwantschap duidt.' ) #setting the correct text here will cause a mismatch with the text on deeper levels, but is permitted (deep validation should detect it) s.settext('Een volle lijn duidt op een verwantschap, terweil een stippelijn op een onzekere verwantschap duidt.', 'original' ) self.assertEqual( s.text(), 'Een volle lijn duidt op een verwantschap , terweil een stippelijn op een onzekere verwantschap duidt .' ) #from children by default (child has erroneous stippelijn and terweil) self.assertTrue( s.hastext('original') ) self.assertEqual( s.stricttext('original'), 'Een volle lijn duidt op een verwantschap, terweil een stippelijn op een onzekere verwantschap duidt.' ) self.assertTrue( xmlcheck(s.xmlstring(), 'Een volle lijn duidt op een verwantschap, terwijl een stippellijn op een onzekere verwantschap duidt.Een volle lijn duidt op een verwantschap, terweil een stippelijn op een onzekere verwantschap duidt.Eenvollelijnduidtopeenverwantschap,terweileenstippelijnopeenonzekeretwijfelachtigeongewisseverwantschapduidt.Confusion between EI and IJ diphtongues')) def test019_adderrordetection(self): """Edit Check - Error Detection""" w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] #stippelijn w.append( folia.ErrorDetection(self.doc, cls="spelling", annotator="testscript", annotatortype=folia.AnnotatorType.AUTO) ) self.assertEqual( w.annotation(folia.ErrorDetection).cls ,'spelling' ) #self.assertTrue( xmlcheck(w.xmlstring(),'stippellijnstippelijn')) #def test008_addaltcorrection(self): # """Edit Check - Adding alternative corrections""" # w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] #stippelijn # w.correcttext('stippellijn', set='corrections',cls='spelling',annotator='testscript', annotatortype='auto', alternative=True) # # alt = w.alternatives(folia.AnnotationType.CORRECTION) # self.assertEqual( alt[0].annotation(folia.Correction).original[0] ,'stippelijn' ) # self.assertEqual( alt[0].annotation(folia.Correction).new[0] ,'stippellijn' ) #def test009_addaltcorrection2(self): # """Edit Check - Adding an alternative and a selected correction""" # w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] #stippelijn # w.correcttext('stippel-lijn', set='corrections',cls='spelling',annotator='testscript', annotatortype='auto', alternative=True) # w.correcttext('stippellijn', set='corrections',cls='spelling',annotator='testscript', annotatortype='auto') # alt = w.alternatives(folia.AnnotationType.CORRECTION) # self.assertEqual( alt[0].annotation(folia.Correction).id ,'WR-P-E-J-0000000001.p.1.s.8.w.11.correction.1' ) # self.assertEqual( alt[0].annotation(folia.Correction).original[0] ,'stippelijn' ) # self.assertEqual( alt[0].annotation(folia.Correction).new[0] ,'stippel-lijn' ) # self.assertEqual( w.annotation(folia.Correction).id ,'WR-P-E-J-0000000001.p.1.s.8.w.11.correction.2' ) # self.assertEqual( w.annotation(folia.Correction).original[0] ,'stippelijn' ) # self.assertEqual( w.annotation(folia.Correction).new[0] ,'stippellijn' ) # self.assertEqual( w.text(), 'stippellijn') class Test4Create(unittest.TestCase): def test001_create(self): """Creating a FoLiA Document from scratch""" self.doc = folia.Document(id='example') self.doc.declare(folia.AnnotationType.TOKEN, 'adhocset',annotator='proycon') self.assertEqual(self.doc.defaultset(folia.AnnotationType.TOKEN), 'adhocset') self.assertEqual(self.doc.defaultannotator(folia.AnnotationType.TOKEN, 'adhocset'), 'proycon') text = folia.Text(self.doc, id=self.doc.id + '.text.1') self.doc.append( text ) text.append( folia.Sentence(self.doc,id=self.doc.id + '.s.1', contents=[ folia.Word(self.doc,id=self.doc.id + '.s.1.w.1', text="De"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.2', text="site"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.3', text="staat"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.4', text="online"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.5', text=".") ] ) ) self.assertEqual( len(self.doc.index[self.doc.id + '.s.1']), 5) class Test5Correction(unittest.TestCase): def setUp(self): self.doc = folia.Document(id='example') self.doc.declare(folia.AnnotationType.TOKEN, set='adhocset',annotator='proycon') self.text = folia.Text(self.doc, id=self.doc.id + '.text.1') self.doc.append( self.text ) def test001_splitcorrection(self): """Correction - Split correction""" self.text.append( folia.Sentence(self.doc,id=self.doc.id + '.s.1', contents=[ folia.Word(self.doc,id=self.doc.id + '.s.1.w.1', text="De"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.2', text="site"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.3', text="staat"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.4', text="online"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.5', text=".") ] ) ) w = self.doc.index[self.doc.id + '.s.1.w.4'] w.split( folia.Word(self.doc, id=self.doc.id + '.s.1.w.4a', text="on"), folia.Word(self.doc, id=self.doc.id + '.s.1.w.4b', text="line") ) s = self.doc.index[self.doc.id + '.s.1'] self.assertEqual( s.words(-3).text(), 'on' ) self.assertEqual( s.words(-2).text(), 'line' ) self.assertEqual( s.text(), 'De site staat on line .' ) self.assertEqual( len(list(s.words())), 6 ) self.assertTrue( xmlcheck(s.xmlstring(), 'Desitestaatonlineonline.')) def test001_splitcorrection2(self): """Correction - Split suggestion""" self.text.append( folia.Sentence(self.doc,id=self.doc.id + '.s.1', contents=[ folia.Word(self.doc,id=self.doc.id + '.s.1.w.1', text="De"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.2', text="site"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.3', text="staat"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.4', text="online"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.5', text=".") ] ) ) w = self.doc.index[self.doc.id + '.s.1.w.4'] s = self.doc.index[self.doc.id + '.s.1'] w.split( folia.Word(self.doc, generate_id_in=s, text="on"), folia.Word(self.doc, generate_id_in=s, text="line"), suggest=True ) self.assertEqual( len(list(s.words())), 5 ) self.assertEqual( s.words(-2).text(), 'online' ) self.assertEqual( s.text(), 'De site staat online .' ) self.assertTrue( xmlcheck(s.xmlstring(), 'Desitestaatonlineonline.')) def test002_mergecorrection(self): """Correction - Merge corrections""" self.text.append( folia.Sentence(self.doc,id=self.doc.id + '.s.1', contents=[ folia.Word(self.doc,id=self.doc.id + '.s.1.w.1', text="De"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.2', text="site"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.3', text="staat"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.4', text="on"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.5', text="line"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.6', text=".") ] ) ) s = self.doc.index[self.doc.id + '.s.1'] s.mergewords( folia.Word(self.doc, 'online', id=self.doc.id + '.s.1.w.4-5') , self.doc.index[self.doc.id + '.s.1.w.4'], self.doc.index[self.doc.id + '.s.1.w.5'] ) self.assertEqual( len(list(s.words())), 5 ) self.assertEqual( s.text(), 'De site staat online .') #incorrection() test, check if newly added word correctly reports being part of a correction w = self.doc.index[self.doc.id + '.s.1.w.4-5'] self.assertTrue( isinstance(w.incorrection(), folia.Correction) ) #incorrection return the correction the word is part of, or None if not part of a correction, self.assertTrue( xmlcheck(s.xmlstring(), 'Desitestaatonlineonline.')) def test003_deletecorrection(self): """Correction - Deletion""" self.text.append( folia.Sentence(self.doc,id=self.doc.id + '.s.1', contents=[ folia.Word(self.doc,id=self.doc.id + '.s.1.w.1', text="Ik"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.2', text="zie"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.3', text="een"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.4', text="groot"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.5', text="huis"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.6', text=".") ] ) ) s = self.doc.index[self.doc.id + '.s.1'] s.deleteword(self.doc.index[self.doc.id + '.s.1.w.4']) self.assertEqual( len(list(s.words())), 5 ) self.assertEqual( s.text(), 'Ik zie een huis .') self.assertTrue( xmlcheck(s.xmlstring(), 'Ikzieeengroothuis.') ) def test004_insertcorrection(self): """Correction - Insert""" self.text.append( folia.Sentence(self.doc,id=self.doc.id + '.s.1', contents=[ folia.Word(self.doc,id=self.doc.id + '.s.1.w.1', text="Ik"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.2', text="zie"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.3', text="een"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.4', text="huis"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.5', text=".") ] ) ) s = self.doc.index[self.doc.id + '.s.1'] s.insertword( folia.Word(self.doc, id=self.doc.id+'.s.1.w.3b',text='groot'), self.doc.index[self.doc.id + '.s.1.w.3']) self.assertEqual( len(list(s.words())), 6 ) self.assertEqual( s.text(), 'Ik zie een groot huis .') self.assertTrue( xmlcheck( s.xmlstring(), 'Ikzieeengroothuis.')) def test005_reusecorrection(self): """Correction - Re-using a correction with only suggestions""" global FOLIAEXAMPLE self.doc = folia.Document(string=FOLIAEXAMPLE) w = self.doc.index['WR-P-E-J-0000000001.p.1.s.8.w.11'] #stippelijn w.correct(suggestion='stippellijn', set='corrections',cls='spelling',annotator='testscript', annotatortype=folia.AnnotatorType.AUTO) c = w.annotation(folia.Correction) self.assertTrue( isinstance(w.annotation(folia.Correction), folia.Correction) ) self.assertEqual( w.annotation(folia.Correction).suggestions(0).text() , 'stippellijn' ) self.assertEqual( w.text(), 'stippelijn') w.correct(new='stippellijn',set='corrections',cls='spelling',annotator='John Doe', annotatortype=folia.AnnotatorType.MANUAL,reuse=c.id) self.assertEqual( w.text(), 'stippellijn') self.assertEqual( len(list(w.annotations(folia.Correction))), 1 ) self.assertEqual( w.annotation(folia.Correction).suggestions(0).text() , 'stippellijn' ) self.assertEqual( w.annotation(folia.Correction).suggestions(0).annotator , 'testscript' ) self.assertEqual( w.annotation(folia.Correction).suggestions(0).annotatortype , folia.AnnotatorType.AUTO) self.assertEqual( w.annotation(folia.Correction).new(0).text() , 'stippellijn' ) self.assertEqual( w.annotation(folia.Correction).annotator , 'John Doe' ) self.assertEqual( w.annotation(folia.Correction).annotatortype , folia.AnnotatorType.MANUAL) self.assertTrue( xmlcheck(w.xmlstring(), 'stippellijnstippellijnstippelijn')) def test006_deletionsuggestion(self): """Correction - Suggestion for deletion with parent merge suggestion""" self.text.append( folia.Sentence(self.doc,id=self.doc.id + '.s.1', contents=[ folia.Word(self.doc,id=self.doc.id + '.s.1.w.1', text="De"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.2', text="site"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.3', text="staat"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.4', text="on"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.5', text="line"), folia.Word(self.doc,id=self.doc.id + '.s.1.w.6', text=".") ]), ) self.text.append( folia.Sentence(self.doc,id=self.doc.id + '.s.2', contents=[ folia.Word(self.doc,id=self.doc.id + '.s.2.w.1', text="sinds"), folia.Word(self.doc,id=self.doc.id + '.s.2.w.2', text="vorige"), folia.Word(self.doc,id=self.doc.id + '.s.2.w.3', text="week"), folia.Word(self.doc,id=self.doc.id + '.s.2.w.4', text="zondag"), folia.Word(self.doc,id=self.doc.id + '.s.2.w.6', text=".") ]) ) s = self.doc.index[self.doc.id + '.s.1'] s2 = self.doc.index[self.doc.id + '.s.2'] w = self.doc.index[self.doc.id + '.s.1.w.6'] s.remove(w) s.append( folia.Correction(self.doc, folia.Current(self.doc, w), folia.Suggestion(self.doc, merge=s2.id)) ) self.assertTrue( xmlcheck(s.xmlstring(), 'Desitestaatonline.')) class Test6Query(unittest.TestCase): def setUp(self): global FOLIAEXAMPLE self.doc = folia.Document(string=FOLIAEXAMPLE) def test001_findwords_simple(self): """Querying - Find words (simple)""" matches = list(self.doc.findwords( folia.Pattern('van','het','alfabet') )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 3 ) self.assertEqual( matches[0][0].text(), 'van' ) self.assertEqual( matches[0][1].text(), 'het' ) self.assertEqual( matches[0][2].text(), 'alfabet' ) def test002_findwords_wildcard(self): """Querying - Find words (with wildcard)""" matches = list(self.doc.findwords( folia.Pattern('van','het',True) )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 3 ) self.assertEqual( matches[0][0].text(), 'van' ) self.assertEqual( matches[0][1].text(), 'het' ) self.assertEqual( matches[0][2].text(), 'alfabet' ) def test003_findwords_annotation(self): """Querying - Find words by annotation""" matches = list(self.doc.findwords( folia.Pattern('de','historisch','wetenschap','worden', matchannotation=folia.LemmaAnnotation) )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 4 ) self.assertEqual( matches[0][0].text(), 'de' ) self.assertEqual( matches[0][1].text(), 'historische' ) self.assertEqual( matches[0][2].text(), 'wetenschap' ) self.assertEqual( matches[0][3].text(), 'wordt' ) def test004_findwords_multi(self): """Querying - Find words using a conjunction of multiple patterns """ matches = list(self.doc.findwords( folia.Pattern('de','historische',True, 'wordt'), folia.Pattern('de','historisch','wetenschap','worden', matchannotation=folia.LemmaAnnotation) )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 4 ) self.assertEqual( matches[0][0].text(), 'de' ) self.assertEqual( matches[0][1].text(), 'historische' ) self.assertEqual( matches[0][2].text(), 'wetenschap' ) self.assertEqual( matches[0][3].text(), 'wordt' ) def test005_findwords_none(self): """Querying - Find words that don't exist""" matches = list(self.doc.findwords( folia.Pattern('bli','bla','blu'))) self.assertEqual( len(matches), 0) def test006_findwords_overlap(self): """Querying - Find words with overlap""" doc = folia.Document(id='test') text = folia.Text(doc, id='test.text') text.append( folia.Sentence(doc,id=doc.id + '.s.1', contents=[ folia.Word(doc,id=doc.id + '.s.1.w.1', text="a"), folia.Word(doc,id=doc.id + '.s.1.w.2', text="a"), folia.Word(doc,id=doc.id + '.s.1.w.3', text="a"), folia.Word(doc,id=doc.id + '.s.1.w.4', text="A"), folia.Word(doc,id=doc.id + '.s.1.w.5', text="b"), folia.Word(doc,id=doc.id + '.s.1.w.6', text="a"), folia.Word(doc,id=doc.id + '.s.1.w.7', text="a"), ] ) ) doc.append(text) matches = list(doc.findwords( folia.Pattern('a','a'))) self.assertEqual( len(matches), 4) matches = list(doc.findwords( folia.Pattern('a','a',casesensitive=True))) self.assertEqual( len(matches), 3) def test007_findwords_context(self): """Querying - Find words with context""" matches = list(self.doc.findwords( folia.Pattern('van','het','alfabet'), leftcontext=3, rightcontext=3 )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 9 ) self.assertEqual( matches[0][0].text(), 'de' ) self.assertEqual( matches[0][1].text(), 'laatste' ) self.assertEqual( matches[0][2].text(), 'letters' ) self.assertEqual( matches[0][3].text(), 'van' ) self.assertEqual( matches[0][4].text(), 'het' ) self.assertEqual( matches[0][5].text(), 'alfabet' ) self.assertEqual( matches[0][6].text(), 'en' ) self.assertEqual( matches[0][7].text(), 'worden' ) self.assertEqual( matches[0][8].text(), 'tussen' ) def test008_findwords_disjunction(self): """Querying - Find words with disjunctions""" matches = list(self.doc.findwords( folia.Pattern('de',('historische','hedendaagse'),'wetenschap','wordt') )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 4 ) self.assertEqual( matches[0][0].text(), 'de' ) self.assertEqual( matches[0][1].text(), 'historische' ) self.assertEqual( matches[0][2].text(), 'wetenschap' ) self.assertEqual( matches[0][3].text(), 'wordt' ) def test009_findwords_regexp(self): """Querying - Find words with regular expressions""" matches = list(self.doc.findwords( folia.Pattern('de',folia.RegExp('hist.*'),folia.RegExp('.*schap'),folia.RegExp('w[oae]rdt')) )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 4 ) self.assertEqual( matches[0][0].text(), 'de' ) self.assertEqual( matches[0][1].text(), 'historische' ) self.assertEqual( matches[0][2].text(), 'wetenschap' ) self.assertEqual( matches[0][3].text(), 'wordt' ) def test010a_findwords_variablewildcard(self): """Querying - Find words with variable wildcard""" matches = list(self.doc.findwords( folia.Pattern('de','laatste','*','alfabet') )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 6 ) self.assertEqual( matches[0][0].text(), 'de' ) self.assertEqual( matches[0][1].text(), 'laatste' ) self.assertEqual( matches[0][2].text(), 'letters' ) self.assertEqual( matches[0][3].text(), 'van' ) self.assertEqual( matches[0][4].text(), 'het' ) self.assertEqual( matches[0][5].text(), 'alfabet' ) def test010b_findwords_varwildoverlap(self): """Querying - Find words with variable wildcard and overlap""" doc = folia.Document(id='test') text = folia.Text(doc, id='test.text') text.append( folia.Sentence(doc,id=doc.id + '.s.1', contents=[ folia.Word(doc,id=doc.id + '.s.1.w.1', text="a"), folia.Word(doc,id=doc.id + '.s.1.w.2', text="b"), folia.Word(doc,id=doc.id + '.s.1.w.3', text="c"), folia.Word(doc,id=doc.id + '.s.1.w.4', text="d"), folia.Word(doc,id=doc.id + '.s.1.w.5', text="a"), folia.Word(doc,id=doc.id + '.s.1.w.6', text="b"), folia.Word(doc,id=doc.id + '.s.1.w.7', text="c"), ] ) ) doc.append(text) matches = list(doc.findwords( folia.Pattern('a','*', 'c'))) self.assertEqual( len(matches), 3) def test011_findwords_annotation_na(self): """Querying - Find words by non existing annotation""" matches = list(self.doc.findwords( folia.Pattern('bli','bla','blu', matchannotation=folia.SenseAnnotation) )) self.assertEqual( len(matches), 0 ) class Test9Reader(unittest.TestCase): def setUp(self): self.reader = folia.Reader(os.path.join(TMPDIR,"foliatest.xml"), folia.Word) def test000_worditer(self): """Stream reader - Iterating over words""" count = 0 for word in self.reader: count += 1 self.assertEqual(count, 192) def test001_findwords_simple(self): """Querying using stream reader - Find words (simple)""" matches = list(self.reader.findwords( folia.Pattern('van','het','alfabet') )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 3 ) self.assertEqual( matches[0][0].text(), 'van' ) self.assertEqual( matches[0][1].text(), 'het' ) self.assertEqual( matches[0][2].text(), 'alfabet' ) def test002_findwords_wildcard(self): """Querying using stream reader - Find words (with wildcard)""" matches = list(self.reader.findwords( folia.Pattern('van','het',True) )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 3 ) self.assertEqual( matches[0][0].text(), 'van' ) self.assertEqual( matches[0][1].text(), 'het' ) self.assertEqual( matches[0][2].text(), 'alfabet' ) def test003_findwords_annotation(self): """Querying using stream reader - Find words by annotation""" matches = list(self.reader.findwords( folia.Pattern('de','historisch','wetenschap','worden', matchannotation=folia.LemmaAnnotation) )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 4 ) self.assertEqual( matches[0][0].text(), 'de' ) self.assertEqual( matches[0][1].text(), 'historische' ) self.assertEqual( matches[0][2].text(), 'wetenschap' ) self.assertEqual( matches[0][3].text(), 'wordt' ) def test004_findwords_multi(self): """Querying using stream reader - Find words using a conjunction of multiple patterns """ matches = list(self.reader.findwords( folia.Pattern('de','historische',True, 'wordt'), folia.Pattern('de','historisch','wetenschap','worden', matchannotation=folia.LemmaAnnotation) )) self.assertEqual( len(matches), 1 ) self.assertEqual( len(matches[0]), 4 ) self.assertEqual( matches[0][0].text(), 'de' ) self.assertEqual( matches[0][1].text(), 'historische' ) self.assertEqual( matches[0][2].text(), 'wetenschap' ) self.assertEqual( matches[0][3].text(), 'wordt' ) def test005_findwords_none(self): """Querying using stream reader - Find words that don't exist""" matches = list(self.reader.findwords( folia.Pattern('bli','bla','blu'))) self.assertEqual( len(matches), 0) def test011_findwords_annotation_na(self): """Querying using stream reader - Find words by non existing annotation""" matches = list(self.reader.findwords( folia.Pattern('bli','bla','blu', matchannotation=folia.SenseAnnotation) )) self.assertEqual( len(matches), 0 ) class Test7XpathQuery(unittest.TestCase): def test050_findwords_xpath(self): """Xpath Querying - Collect all words (including non-authoritative)""" count = 0 for word in folia.Query(os.path.join(TMPDIR,'foliatest.xml'),'//f:w'): count += 1 self.assertTrue( isinstance(word, folia.Word) ) self.assertEqual(count, 192) def test051_findwords_xpath(self): """Xpath Querying - Collect all words (authoritative only)""" count = 0 for word in folia.Query(os.path.join(TMPDIR,'foliatest.xml'),'//f:w[not(ancestor-or-self::*/@auth)]'): count += 1 self.assertTrue( isinstance(word, folia.Word) ) self.assertEqual(count, 190) class Test8Validation(unittest.TestCase): def test000_relaxng(self): """Validation - RelaxNG schema generation""" folia.relaxng() def test001_shallowvalidation(self): """Validation - Shallow validation against automatically generated RelaxNG schema""" folia.validate(os.path.join(TMPDIR,'foliasavetest.xml')) def test002_loadsetdefinitions(self): """Validation - Loading of set definitions""" doc = folia.Document(file=os.path.join(TMPDIR,'foliatest.xml'), loadsetdefinitions=True) assert isinstance( doc.setdefinitions["http://raw.github.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml"], folia.SetDefinition) class Test9Validation(unittest.TestCase): def test001_deepvalidation(self): """Validation - Deep Validation""" doc = folia.Document(file=os.path.join(FOLIAPATH,'test/example.deep.xml'), deepvalidation=True, allowadhocsets=True) f = io.open(FOLIAPATH + '/test/example.xml', 'r',encoding='utf-8') FOLIAEXAMPLE = f.read() f.close() #We cheat, by setting the generator and version attributes to match the library, so xmldiff doesn't complain when we compare against this reference FOLIAEXAMPLE = re.sub(r' version="[^"]*" generator="[^"]*"', ' version="' + folia.FOLIAVERSION + '" generator="pynlpl.formats.folia-v' + folia.LIBVERSION + '"', FOLIAEXAMPLE, re.MULTILINE) #Another cheat, alien namespace attributes are ignored by the folia library, strip them so xmldiff doesn't complain FOLIAEXAMPLE = re.sub(r' xmlns:alien="[^"]*" alien:attrib="[^"]*"', '', FOLIAEXAMPLE, re.MULTILINE) DCOIEXAMPLE=""" WR-P-E-J-0000125009 Aspirine 3D model van Aspirine 2009-01-27 Europe NL/B D-Coi Dutch 162304 Unknown false Unknown GNU Free Documentation License Wikimedia Foundation (NL/B)
Aspirine 3D model van Aspirine

Aspirine is een merknaam voor een medicijn van Bayer . De werkzame stof is acetylsalicylzuur . Aspirine is ook bekend onder de naam acetosal en aspro , dat een merknaam is van Nicholas Ltd. Het werkt pijnstillend , koortsverlagend en ontstekingsremmend .

Oorspronkelijk is de werking van salicylzuur als pijnstiller ontdekt doordat het werd geïdentificeerd als de werkzame stof in wilgenbast . Het zuur zelf werd echter bijzonder slecht door de maag getolereerd . De acetyl-ester is daarin veel beter . Deze stof wordt in zuivere toestand of als het iets minder maagprikkelende calciumzout op de markt gebracht ( ascal )

De werking zelf berust erop dat Aspirine irreversibel bindt aan het enzym cyclo-oxygenase ( COX ) , waardoor dit niet meer kan helpen arachidonzuur om te zetten in prostaglandines , stoffen die de zenuwuiteinden gevoelig maken voor prikkels . De vermelde maagproblemen ontstaan door de irreversibele binding aan COX-1 , een variant van het enzym die een rolspeelt bij bescherming van de maag tegen zijn eigen zure inhoud . Ook is dit COX-1 aanwezig in bloedplaatjes , vandaar de trombocytenaggregatieremmende werking . Vandaar dat de farmaceutische industrie zich richt op de ontwikkeling van COX-2 ( induceerbaar COX ) specifieke pijnstillers .

Geschiedenis van Aspirine

De ontdekking van aspirine wordt algemeen toegeschreven aan Felix Hoffmann , werkzaam bij Bayer te Elberfeld . Uit onderzoek van de labjournaals bij Bayer blijkt echter dat de werkelijke ontdekker van aspirine Arthur Eichengrün was , die onderzoek deed naar betere pijnstillers . Felix Hoffmann werkte als laboratorium-assistent onder zijn leiding . Door zijn joodse achtergrond werd Eichengrün door de Nazis uit de annalen geschrapt en werd het verhaal van de rheumatisch vader bedacht . In 1949 publiceerde Eigengrün een artikel waarin hij de uitvinding van aspirine claimde . Deze claim werd bevestigd na onderzoek van Walter Sneader van de universiteit van Glasgow in 1999 . Salicylzuur werd al gebruikt , zelfs Hippocrates kende er de werking van , maar het was een walgelijk goedje dat erg slecht op de maag lag . Dit zuur werd in eerste instantie geëxtraheerd uit bast van leden van de plantenfamilie der wilgen ( Latijnse gelachtsnaam Salix ) , vandaar de naam salicylzuur . Hetzelfde zuur was te vinden in de Moerasspirea , vandaar de ' spir ' in aspirine . Hoffmann ging systematisch te werk en zocht hardnekkig naar een nieuwe verbinding om het middel beter verteerbaar te maken . Volgens het principe van de veredeling van bestaande geneesmiddelen , waarmee hij al eerder succes heeft geboekt , ontdekt hij in 1897 de oplossing van het probleem in de acetylering van het salicylzuur . Op 10 augustus beschrijft hij in zijn laboratoriumdagboek hoe hij het acetylsalicylzuur in chemisch zuivere en bewaarbare vorm heeft samengesteld . Nadat hij de nieuwe stof samen met dokter Heinrich Dreser uitgebreid getest heeft op dieren , komt de stof in 1899 in poedervorm op de markt . Een jaar later zijn er de gedoseerde tabletten . Het wereldverbruik wordt vandaag de dag op vijftigduizend ton of ongeveer honderd miljard tabletten per jaar geschat .

Geschiedenis van Aspro

Tijdens de 1ste Wereldoorlog loofde de Britse regering een prijs uit voor eenieder die een nieuwe formule kon vinden van aspirine , gezien het feit dat de invoer uit Duitsland stil lag . Een chemicus uit het Australische Melbourne , George Nicholas , ontdekte in 1915 een synthetische oplossing , die zelfs zuiverder was dan aspirine en oplosbaar was . Hij noemde dit Aspro , wat later de gehele wereld veroverde .

Pijnstillende werking Aspirine

Pijn wordt veroorzaakt door verschillende stoffen die vrijkomen bij beschadigingen . Werkende cellen in beschadigd weefsel geven die stoffen af , onder invloed van o.a. cytokinen en mitogenen . Deze stoffen werken dan op de zenuwuiteinden die het pijnsignaal naar de hersenen doorsturen . Een hormoon , dat daarin een belangrijke rol speelt is prostaglandine . Prostaglandine geeft niet alleen een pijnsignaal af , maar speelt een belangrijke rol in het hele lichaam . Daarom eerst wat meer over Prostaglandine . Prostaglandine wordt geproduceerd in cellen en werkt alleen in de buurt waar het geproduceerd is en wordt dan afgebroken . Het stimuleert naast de pijnreactie ook de ontstekingsreactie wanneer er een infectie is en zorgt voor de verhoging van de lichaamstemperatuur . In de cellen speelt het cyclooxygenase ( COX ) enzym een onmisbare rol in het maken van prostaglandine . Cyclooxygenase katalyseert de omzetting van arachidonzuur naar prostaglandine , een reactie die anders vrijwel niet verloopt . De aspirine voorkomt de werking van Cyclooxygenase en voorkomt daarmee de vorming van prostaglandine , waardoor een groot gedeelte van de pijn verdwijnt , en ook de koorts en de ontsteking geremd worden , omdat dat de prostaglandine deze reacties niet meer kan veroorzaken . Aspirine is dus een inhibitor , een stof die de werking van een eiwit , in dit geval die van COX , remt of stopt . Daarnaast speelt prostaglandine ook nog een rol in het normaal functioneren . De prostaglandine die wordt gemaakt door COX-1 werkt in de normale processen , als boodschapper . De prostaglandine die werkt bij beschadiging en die een rol speelt in het pijnsignaal , wordt gemaakt door COX-2 . COX-1 kan als het niet functioneert maagbloedingen e.d. veroorzaken . Er is een sinds een aantal jaren een aantal andere geneesmiddelen op de markt die selectief COX-2 remmen . Zie COX-2 remmers .

Andere werkingen Werking op de bloedplaatjes

Aspirine is niet alleen een analgeticum ( pijnstillend middel ) , maar het heeft ook nog andere effecten op ons lichaam . Aspirine heeft een ( onomkeerbaar ) effect op de bloedplaatjes en belemmert deze om samen te klonteren : het is een trombocytenaggregatieremmer . Hierdoor vermindert het stelpend vermogen van het bloed bij bloedvatbeschadiging . De vaak gebruikte benaming ' bloedverdunner ' is onjuist - het bloed wordt niet dunner . Dit effect treedt al op na 1/4 aspirinetablet en houdt aan tot de uitgeschakelde bloedplaatjes ( na ongeveer een week ) allemaal zijn vervangen . Voor dit laatste effect wordt het middel tegenwoordig zeer veel voorgeschreven aan mensen die eerder een beroerte of hartaanval hebben gehad ; het vermindert de kans op herhaling met ca 40 % .

Andere

Ook op het gebied van de kanker-preventie liggen er mogelijk toepassingen voor aspirine , aangezien deze tumorvorming tegengaat . Het dagelijks slikken van een kleine dosis aspirine , gedurende 5 jaar , zou de kans op tumoren in slokdarm en darmstelsel met twee derde doen afnemen . Naar het schijnt heeft aspirine ook een positieve werking tegen de ziekte van Alzheimer , zwangerschaps- , darm- , hart- en vaatziekten .

Bijwerkingen

Aspirine is vrij sterk maagprikkelend : als het nu als nieuw geneesmiddel zou moeten worden geregistreerd als pijnstiller zou dat waarschijnlijk niet lukken . Bij gebruik kunnen maag-klachten en zelfs maagbloedingen ontstaan . Aspirine heeft vooral in hoge doseringen ernstige bijwerkingen , met name de al genoemde maagbloedingen maar ook oorsuizen en doofheid kunnen optreden . Ook weet men dat het gebruik ervan tijdelijk de aanmaak van testosteron vermindert , maar dit neveneffect heeft geen blijvende of erg schadelijke werking . Naast gebruik bij zwangerschap of toediening aan baby's , wordt aspirine liefst ook niet met alcohol gebruikt , omdat dit de kans op maagklachten kan verhogen .

Advies voor gebruik als pijnstiller

Voor gebruik als eenvoudige pijnstiller wordt medisch gezien algemeen de voorkeur gegeven aan paracetamol .

Synthese van aspirine

Bij het maken van acetylsalicylzuur ( aspirine ) op laboratorium schaal gaat het om een opbrengst van enkele grammen . Bij de bereiding van aspirine kan worden uitgegaan van verschillende begin producten , in deze beschrijving wordt uitgegaan van de beginstof salicylzuur . Dit heeft als voordeel dat er maar een synthese stap uitgevoerd hoeft te worden . Uitgaande van salicylzuur en azijnzuuranhydride , wordt salicylzuur veresterd volgens nevenstaande reactie :

Zoals te zien boven de reactiepijl vindt deze synthese plaats in een zuur milieu . In dit geval is gekozen voor geconcentreerd fosforzuur . Na de reactie moet het hoofdproduct ( aspirine ) gescheiden worden van de bijproducten ( azijnzuur en niet gereageerde reactanten ) , dit gebeurt door middel van herkristallisatie . De herkristallisatie wordt uitgevoerd door het ruwe product op te lossen in methanol ( in een reflux opstelling ) en dan net genoeg water toe te voegen zodat de verontreinigingen uitkristalliseren , maar de aspirine niet . Het hete mengsel wordt nu gefiltreerd , waardoor de verontreinigingen op het filter achterblijven en alleen de zuivere aspirine in het filtraat komt . Na deze filtratie wordt het filtraat gekoeld en opnieuw gefiltreerd , de gezuiverde aspirine blijft nu achter op het filter . De verkregen aspirine kan nu worden gedroogd en is klaar voor verpakking of gebruik .

Backmatter bli bli bla, bla bla bli
""" if __name__ == '__main__': unittest.main() PyNLPl-1.1.2/pynlpl/tests/cql.py0000644000175000001440000001101512525651367017324 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- #--------------------------------------------------------------- # PyNLPl - Test Units for CQL using Finite State Automata # by Maarten van Gompel, Radboud University Nijmegen # proycon AT anaproy DOT nl # # Licensed under GPLv3 #---------------------------------------------------------------- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout import sys import unittest from pynlpl.formats import cql tokens = [ { 'word': 'This', 'lemma': 'this', 'pos': 'det', }, { 'word': 'is', 'lemma': 'be', 'pos': 'v', }, { 'word': 'a', 'lemma': 'a', 'pos': 'det', }, { 'word': 'first', 'lemma': 'first', 'pos': 'a', }, { 'word': 'test', 'lemma': 'test', 'pos': 'n', }, { 'word': 'of', 'lemma': 'dit', 'pos': 'prep', }, { 'word': 'the', 'lemma': 'the', 'pos': 'det', }, { 'word': 'new', 'lemma': 'new', 'pos': 'a', }, { 'word': 'module', 'lemma': 'module', 'pos': 'n', }, { 'word': '.', 'lemma': '.', 'pos': 'punc', }, ] class Test1(unittest.TestCase): def test1(self): q = cql.Query("\"the\"") result = q(tokens) self.assertEqual(len(result),1) #one result self.assertEqual(len(result[0]),1) #result 1 consists of one word self.assertEqual(result[0][0]['word'],"the") def test2(self): q = cql.Query("[ pos = \"det\" ]") result = q(tokens) self.assertEqual(len(result),3) self.assertEqual(result[0][0]['word'],"This") self.assertEqual(result[1][0]['word'],"a") self.assertEqual(result[2][0]['word'],"the") def test3(self): q = cql.Query("[ pos = \"det\" ] [ pos = \"a\" ] [ pos = \"n\" ]") result = q(tokens) self.assertEqual(len(result),2) self.assertEqual(result[0][0]['word'],"a") self.assertEqual(result[0][1]['word'],"first") self.assertEqual(result[0][2]['word'],"test") self.assertEqual(result[1][0]['word'],"the") self.assertEqual(result[1][1]['word'],"new") self.assertEqual(result[1][2]['word'],"module") def test4(self): q = cql.Query("[ pos = \"det\" ] [ pos = \"a\" ]? [ pos = \"n\" ]") result = q(tokens) self.assertEqual(len(result),2) self.assertEqual(result[0][0]['word'],"a") self.assertEqual(result[0][1]['word'],"first") self.assertEqual(result[0][2]['word'],"test") self.assertEqual(result[1][0]['word'],"the") self.assertEqual(result[1][1]['word'],"new") self.assertEqual(result[1][2]['word'],"module") def test5(self): q = cql.Query("[ pos = \"det\" ] []? [ pos = \"n\" ]") result = q(tokens) self.assertEqual(len(result),2) self.assertEqual(result[0][0]['word'],"a") self.assertEqual(result[0][1]['word'],"first") self.assertEqual(result[0][2]['word'],"test") self.assertEqual(result[1][0]['word'],"the") self.assertEqual(result[1][1]['word'],"new") self.assertEqual(result[1][2]['word'],"module") def test6(self): q = cql.Query("[ pos = \"det\" ] []+ [ pos = \"n\" ]") result = q(tokens) self.assertEqual(len(result),2) self.assertEqual(result[0][0]['word'],"a") self.assertEqual(result[0][1]['word'],"first") self.assertEqual(result[0][2]['word'],"test") self.assertEqual(result[1][0]['word'],"the") self.assertEqual(result[1][1]['word'],"new") self.assertEqual(result[1][2]['word'],"module") def test7(self): q = cql.Query("[ pos = \"det\" ] []* [ pos = \"n\" ]") result = q(tokens) self.assertEqual(len(result),2) self.assertEqual(result[0][0]['word'],"a") self.assertEqual(result[0][1]['word'],"first") self.assertEqual(result[0][2]['word'],"test") self.assertEqual(result[1][0]['word'],"the") self.assertEqual(result[1][1]['word'],"new") self.assertEqual(result[1][2]['word'],"module") if __name__ == '__main__': unittest.main() PyNLPl-1.1.2/pynlpl/tests/search.py0000755000175000001440000001536012445064173020016 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- #--------------------------------------------------------------- # PyNLPl - Test Units for Search Algorithms # by Maarten van Gompel, ILK, Universiteit van Tilburg # http://ilk.uvt.nl/~mvgompel # proycon AT anaproy DOT nl # # Licensed under GPLv3 # #---------------------------------------------------------------- import sys import os import unittest sys.path.append(sys.path[0] + '/../../') os.environ['PYTHONPATH'] = sys.path[0] + '/../../' from pynlpl.search import AbstractSearchState, DepthFirstSearch, BreadthFirstSearch, IterativeDeepening, HillClimbingSearch, BeamSearch class ReorderSearchState(AbstractSearchState): def __init__(self, tokens, parent = None): self.tokens = tokens super(ReorderSearchState, self).__init__(parent) def expand(self): #Operator: Swap two consecutive pairs l = len(self.tokens) for i in range(0,l - 1): newtokens = self.tokens[:i] newtokens.append(self.tokens[i + 1]) newtokens.append(self.tokens[i]) if i+2 < l: newtokens += self.tokens[i+2:] yield ReorderSearchState(newtokens, self) def __hash__(self): return hash(str(self)) def __eq__(self, other): return str(self) == str(other) def __str__(self): return " ".join(self.tokens) class InformedReorderSearchState(ReorderSearchState): def __init__(self, tokens, goal = None, parent = None): self.tokens = tokens self.goal = goal super(ReorderSearchState, self).__init__(parent) def score(self): """Compute distortion""" totaldistortion = 0 for i, token in enumerate(self.goal.tokens): tokendistortion = 9999999 for j, token2 in enumerate(self.tokens): if token == token2 and abs(i - j) < tokendistortion: tokendistortion = abs(i - j) totaldistortion += tokendistortion return totaldistortion def expand(self): #Operator: Swap two consecutive pairs l = len(self.tokens) for i in range(0,l - 1): newtokens = self.tokens[:i] newtokens.append(self.tokens[i + 1]) newtokens.append(self.tokens[i]) if i+2 < l: newtokens += self.tokens[i+2:] yield InformedReorderSearchState(newtokens, self.goal, self) inputstate = ReorderSearchState("a This test . sentence is".split(' ')) goalstate = ReorderSearchState("This is a test sentence .".split(' ')) class DepthFirstSearchTest(unittest.TestCase): def test_solution(self): """Depth First Search""" global inputstate, goalstate search = DepthFirstSearch(inputstate ,graph=True, goal=goalstate) solution = search.searchfirst() #print "DFS:", search.traversalsize(), "nodes visited |", self.assertEqual(solution, goalstate) class BreadthFirstSearchTest(unittest.TestCase): def test_solution(self): """Breadth First Search""" global inputstate, goalstate search = BreadthFirstSearch(inputstate ,graph=True, goal=goalstate) solution = search.searchfirst() #print "BFS:", search.traversalsize(), "nodes visited |", self.assertEqual(solution, goalstate) class IterativeDeepeningTest(unittest.TestCase): def test_solution(self): """Iterative Deepening DFS""" global inputstate, goalstate search = IterativeDeepening(inputstate ,graph=True, goal=goalstate) solution = search.searchfirst() #print "It.Deep:", search.traversalsize(), "nodes visited |", self.assertEqual(solution, goalstate) informedinputstate = InformedReorderSearchState("a This test . sentence is".split(' '), goalstate) #making a simple language model class HillClimbingTest(unittest.TestCase): def test_solution(self): """Hill Climbing""" global informedinputstate search = HillClimbingSearch(informedinputstate, graph=True, minimize=True,debug=False) solution = search.searchbest() self.assertTrue(solution) #TODO: this is not a test! class BeamSearchTest(unittest.TestCase): def test_minimizeC1(self): """Beam Search needle-in-haystack problem (beam=2, minimize)""" #beamsize has been set to the minimum that yields the correct solution global informedinputstate, solution, goalstate search = BeamSearch(informedinputstate, beamsize=2, graph=True, minimize=True,debug=0, goal=goalstate) solution = search.searchbest() self.assertEqual( str(solution), str(goalstate) ) self.assertEqual( search.solutions, 1 ) def test_minimizeA1(self): """Beam Search optimisation problem A (beam=2, minimize)""" #beamsize has been set to the minimum that yields the correct solution global informedinputstate, solution, goalstate search = BeamSearch(informedinputstate, beamsize=2, graph=True, minimize=True,debug=0) solution = search.searchbest() self.assertEqual( str(solution), str(goalstate) ) self.assertTrue( search.solutions > 1 ) #everything is a solution def test_minimizeA2(self): """Beam Search optimisation problem A (beam=100, minimize)""" #if a small beamsize works, a very large one should too global informedinputstate, solution, goalstate search = BeamSearch(informedinputstate, beamsize=100, graph=True, minimize=True,debug=0) solution = search.searchbest() self.assertEqual( str(solution), str(goalstate) ) self.assertTrue( search.solutions > 1 ) #everything is a solution #def test_minimizeA3(self): # """Beam Search optimisation problem A (eager mode, beam=2, minimize)""" # #beamsize has been set to the minimum that yields the correct solution # global informedinputstate, solution, goalstate # search = BeamSearch(informedinputstate, beamsize=50, graph=True, minimize=True,eager=True,debug=2) # solution = search.searchbest() # self.assertEqual( str(solution), str(goalstate) ) def test_minimizeB1(self): """Beam Search optimisation problem (longer) (beam=3, minimize)""" #beamsize has been set to the minimum that yields the correct solution goalstate = InformedReorderSearchState("This is supposed to be a very long sentence .".split(' ')) informedinputstate = InformedReorderSearchState("a long very . sentence supposed to be This is".split(' '), goalstate) search = BeamSearch(informedinputstate, beamsize=3, graph=True, minimize=True,debug=False) solution = search.searchbest() self.assertEqual(str(solution),str(goalstate)) if __name__ == '__main__': unittest.main() PyNLPl-1.1.2/pynlpl/formats/0000755000175000001440000000000013024723552016475 5ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/formats/taggerdata.py0000644000175000001440000001150412445064173021156 0ustar proyconusers00000000000000#-*- coding:utf-8 -*- ############################################################### # PyNLPl - Read tagger data # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # Licensed under GPLv3 # # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import io class Taggerdata(object): def __init__(self,filename, encoding = 'utf-8', mode ='r'): self.filename = filename self.encoding = encoding assert (mode == 'r' or mode == 'w') self.mode = mode self.reset() self.firstiter = True self.indexed = False self.writeindex = 0 def __iter__(self): words = [] lemmas = [] postags = [] for line in self.f: line = line.strip() if self.firstiter: self.indexed = (line == "#0") self.firstiter = False if not line and not self.indexed: yield (words, lemmas, postags) words = [] lemmas = [] postags = [] elif self.indexed and len(line) > 1 and line[0] == '#' and line[1:].isdigit(): if line != "#0": yield (words, lemmas, postags) words = [] lemmas = [] postags = [] elif line: try: word, lemma, pos = line.split("\t") except: word = lemma = pos = "NONE" if word == "NONE": word = None if lemma == "NONE": lemma = None if pos == "NONE": pos = None words.append(word) lemmas.append(lemma) postags.append(pos) if words: yield (words, lemmas, postags) def next(self): words = [] lemmas = [] postags = [] while True: try: line = self.f.next().strip() except StopIteration: if words: return (words, lemmas, postags) else: raise if self.firstiter: self.indexed = (line == "#0") self.firstiter = False if not line and not self.indexed: return (words, lemmas, postags) elif self.indexed and len(line) > 1 and line[0] == '#' and line[1:].isdigit(): if line != "#0": return (words, lemmas, postags) elif line: try: word, lemma, pos = line.split("\t") except: word = lemma = pos = "NONE" if word == "NONE": word = None if lemma == "NONE": lemma = None if pos == "NONE": pos = None words.append(word) lemmas.append(lemma) postags.append(pos) def align(self, referencewords, datatuple): """align the reference sentence with the tagged data""" targetwords = [] for i, (word,lemma,postag) in enumerate(zip(datatuple[0],datatuple[1],datatuple[2])): if word: subwords = word.split("_") for w in subwords: #split multiword expressions targetwords.append( (w, lemma, postag, i, len(subwords) > 1 ) ) #word, lemma, pos, index, multiword? referencewords = [ w.lower() for w in referencewords ] alignment = [] for i, referenceword in enumerate(referencewords): found = False best = 0 distance = 999999 for j, (targetword, lemma, pos, index, multiword) in enumerate(targetwords): if referenceword == targetword and abs(i-j) < distance: found = True best = j distance = abs(i-j) if found: alignment.append(targetwords[best]) else: alignment.append((None,None,None,None,False)) #no alignment found return alignment def reset(self): self.f = io.open(self.filename,self.mode, encoding=self.encoding) def write(self, sentence): self.f.write("#" + str(self.writeindex)+"\n") for word, lemma, pos in sentence: if not word: word = "NONE" if not lemma: lemma = "NONE" if not pos: pos = "NONE" self.f.write( word + "\t" + lemma + "\t" + pos + "\n" ) self.writeindex += 1 def close(self): self.f.close() PyNLPl-1.1.2/pynlpl/formats/moses.py0000644000175000001440000001731212445064173020204 0ustar proyconusers00000000000000############################################################### # PyNLPl - Moses formats # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # Licensed under GPLv3 # # This is a Python library classes and functions for # reading file-formats produced by Moses. Currently # contains only a class for reading a Moses PhraseTable. # (migrated to pynlpl from pbmbmt) # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u import sys import bz2 import gzip import datetime import socket import io try: from twisted.internet import protocol, reactor #No Python 3 support yet :( from twisted.protocols import basic twistedimported = True except: print("WARNING: Twisted could not be imported",file=sys.stderr) twistedimported = False class PhraseTable(object): def __init__(self,filename, quiet=False, reverse=False, delimiter="|||", score_column = 3, max_sourcen = 0,sourceencoder=None, targetencoder=None, scorefilter=None): """Load a phrase table from file into memory (memory intensive!)""" self.phrasetable = {} self.sourceencoder = sourceencoder self.targetencoder = targetencoder if filename.split(".")[-1] == "bz2": f = bz2.BZ2File(filename,'r') elif filename.split(".")[-1] == "gz": f = gzip.GzipFile(filename,'r') else: f = io.open(filename,'r',encoding='utf-8') linenum = 0 prevsource = None targets = [] while True: if not quiet: linenum += 1 if (linenum % 100000) == 0: print("Loading phrase-table: @%d" % linenum, "\t(" + datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + ")",file=sys.stderr) line = u(f.readline()) if not line: break #split into (trimmed) segments segments = [ segment.strip() for segment in line.split(delimiter) ] if len(segments) < 3: print("Invalid line: ", line, file=sys.stderr) continue #Do we have a score associated? if score_column > 0 and len(segments) >= score_column: scores = tuple( ( float(x) for x in segments[score_column-1].strip().split() ) ) else: scores = tuple() #if align2_column > 0: # try: # null_alignments = segments[align2_column].count("()") # except: # null_alignments = 0 #else: # null_alignments = 0 if scorefilter: if not scorefilter(scores): continue if reverse: if max_sourcen > 0 and segments[1].count(' ') + 1 > max_sourcen: continue if self.sourceencoder: source = self.sourceencoder(segments[1]) #tuple(segments[1].split(" ")) else: source = segments[1] if self.targetencoder: target = self.targetencoder(segments[0]) #tuple(segments[0].split(" ")) else: target = segments[0] else: if max_sourcen > 0 and segments[0].count(' ') + 1 > max_sourcen: continue if self.sourceencoder: source = self.sourceencoder(segments[0]) #tuple(segments[0].split(" ")) else: source = segments[0] if self.targetencoder: target = self.targetencoder(segments[1]) #tuple(segments[1].split(" ")) else: target = segments[1] if prevsource and source != prevsource and targets: self.phrasetable[prevsource] = tuple(targets) targets = [] targets.append( (target,scores) ) prevsource = source #don't forget last one: if prevsource and targets: self.phrasetable[prevsource] = tuple(targets) f.close() def __contains__(self, phrase): """Query if a certain phrase exist in the phrase table""" if self.sourceencoder: phrase = self.sourceencoder(phrase) return (phrase in self.phrasetable) #d = self.phrasetable #for word in phrase: # if not word in d: # return False # d = d[word #return ("" in d) def __iter__(self): for phrase, targets in self.phrasetable.items(): yield phrase, targets def __len__(self): return len(self.phrasetable) def __bool__(self): return bool(self.phrasetable) def __getitem__(self, phrase): #same as translations """Return a list of (translation, scores) tuples""" if self.sourceencoder: phrase = self.sourceencoder(phrase) return self.phrasetable[phrase] #d = self.phrasetable #for word in phrase: # if not word in d: # raise KeyError # d = d[word] #if "" in d: # return d[""] #else: # raise KeyError if twistedimported: class PTProtocol(basic.LineReceiver): def lineReceived(self, phrase): try: for target,Pst,Pts,null_alignments in self.factory.phrasetable[phrase]: self.sendLine(target+"\t"+str(Pst)+"\t"+str(Pts)+"\t"+str(null_alignments)) except KeyError: self.sendLine("NOTFOUND") class PTFactory(protocol.ServerFactory): protocol = PTProtocol def __init__(self, phrasetable): self.phrasetable = phrasetable class PhraseTableServer(object): def __init__(self, phrasetable, port=65432): reactor.listenTCP(port, PTFactory(phrasetable)) reactor.run() class PhraseTableClient(object): def __init__(self,host= "localhost",port=65432): self.BUFSIZE = 4048 self.socket = socket.socket(socket.AF_INET,socket.SOCK_STREAM) #Create the socket self.socket.settimeout(120) self.socket.connect((host, port)) #Connect to server self.lastresponse = "" self.lastquery = "" def __getitem__(self, phrase): solutions = [] if phrase != self.lastquery: self.socket.send(phrase+ "\r\n") data = b"" while not data or data[-1] != '\n': data += self.socket.recv(self.BUFSIZE) else: data = self.lastresponse data = u(data) for line in data.split('\n'): line = line.strip('\r\n') if line == "NOTFOUND": raise KeyError(phrase) elif line: fields = tuple(line.split("\t")) if len(fields) == 4: solutions.append( fields ) else: print >>sys.stderr,"PHRASETABLECLIENT WARNING: Unable to parse response line" self.lastresponse = data self.lastquery = phrase return solutions def __contains__(self, phrase): self.socket.send(phrase.encode('utf-8')+ b"\r\n")\ data = b"" while not data or data[-1] != '\n': data += self.socket.recv(self.BUFSIZE) data = u(data) for line in data.split('\n'): line = line.strip('\r\n') if line == "NOTFOUND": return False self.lastresponse = data self.lastquery = phrase return True PyNLPl-1.1.2/pynlpl/formats/dutchsemcor.py0000644000175000001440000002020112445064173021365 0ustar proyconusers00000000000000#-*- coding:utf-8 -*- ############################################################### # PyNLPl - DutchSemCor # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # Licensed under GPLv3 # # Modified by Ruben Izquierdo # We need also to store the TIMBL distance to the nearest neighboor # # Collection of formats for the DutchSemCor project # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout from pynlpl.formats.timbl import TimblOutput from pynlpl.statistics import Distribution import io class WSDSystemOutput(object): def __init__(self, filename = None): self.data = {} self.distances={} self.maxDistance=1 if filename: self.load(filename) def append(self, word_id, senses,distance=0): # Commented by Ruben, there are some ID's that are repeated in all sonar test files... #assert (not word_id in self.data) if isinstance(senses, Distribution): self.data[word_id] = ( (x,y) for x,y in senses ) #PATCH UNDONE (#TODO: this is a patch, something's not right in Distribution?) self.distances[word_id]=distance if distance > self.maxDistance: self.maxDistance=distance return else: assert isinstance(senses, list) and len(senses) >= 1 self.distances[word_id]=distance if distance > self.maxDistance: self.maxDistance=distance if len(senses[0]) == 1: #not a (sense_id, confidence) tuple! compute equal confidence for all elements automatically: confidence = 1 / float(len(senses)) self.data[word_id] = [ (x,confidence) for x in senses ] else: fulldistr = True for sense, confidence in senses: if confidence == None: fulldistr = False break if fulldistr: self.data[word_id] = Distribution(senses) else: self.data[word_id] = senses def getMaxDistance(self): return self.maxDistance def __iter__(self): for word_id, senses in self.data.items(): yield word_id, senses,self.distances[word_id] def __len__(self): return len(self.data) def __getitem__(self, word_id): """Returns the sense distribution for the given word_id""" return self.data[word_id] def load(self, filename): f = io.open(filename,'r',encoding='utf-8') for line in f: fields = line.strip().split(" ") word_id = fields[0] if len(fields[1:]) == 1: #only one sense, no confidence expressed: self.append(word_id, [(fields[1],None)]) else: senses = [] distance=-1 for i in range(1,len(fields),2): if i+1==len(fields): #The last field is the distance if fields[i][:4]=='+vdi': #Support for previous format of wsdout distance=float(fields[i][4:]) else: distance=float(fields[i]) else: if fields[i+1] == '?': fields[i+1] = None senses.append( (fields[i], fields[i+1]) ) self.append(word_id, senses,distance) f.close() def save(self, filename): f = io.open(filename,'w',encoding='utf-8') for word_id, senses,distance in self: f.write(word_id) for sense, confidence in senses: if confidence == None: confidence = "?" f.write(" " + str(sense) + " " + str(confidence)) if word_id in self.distances.keys(): f.write(' '+str(self.distances[word_id])) f.write("\n") f.close() def out(self, filename): for word_id, senses,distance in self: print(word_id,distance,end="") for sense, confidence in senses: if confidence == None: confidence = "?" print(" " + sense + " " + str(confidence),end="") print() def senses(self, bestonly=False): """Returns a list of all predicted senses""" l = [] for word_id, senses,distance in self: for sense, confidence in senses: if not sense in l: l.append(sense) if bestonly: break return l def loadfromtimbl(self, filename): timbloutput = TimblOutput(io.open(filename,'r',encoding='utf-8')) for i, (features, referenceclass, predictedclass, distribution, distance) in enumerate(timbloutput): if distance != None: #distance='+vdi'+str(distance) distance=float(distance) if len(features) == 0: print("WARNING: Empty feature vector in " + filename + " (line " + str(i+1) + ") skipping!!",file=stderr) continue word_id = features[0] #note: this is an assumption that must be adhered to! if distribution: self.append(word_id, distribution,distance) def fromTimblToWsdout(self,fileTimbl,fileWsdout): timbloutput = TimblOutput(io.open(fileTimbl,'r',encoding='utf-8')) wsdoutfile = io.open(fileWsdout,'w',encoding='utf-8') for i, (features, referenceclass, predictedclass, distribution, distance) in enumerate(timbloutput): if len(features) == 0: print("WARNING: Empty feature vector in " + fileTimbl + " (line " + str(i+1) + ") skipping!!",file=stderr) continue word_id = features[0] #note: this is an assumption that must be adhered to! if distribution: wsdoutfile.write(word_id+' ') for sense, confidence in distribution: if confidence== None: confidence='?' wsdoutfile.write(sense+' '+str(confidence)+' ') wsdoutfile.write(str(distance)+'\n') wsdoutfile.close() class DataSet(object): #for testsets/trainingsets def __init__(self, filename): self.sense = {} #word_id => (sense_id, lemma,pos) self.targetwords = {} #(lemma,pos) => [sense_id] f = io.open(filename,'r',encoding='utf-8') for line in f: if len(line) > 0 and line[0] != '#': fields = line.strip('\n').split('\t') word_id = fields[0] sense_id = fields[1] lemma = fields[2] pos = fields[3] self.sense[word_id] = (sense_id, lemma, pos) if not (lemma,pos) in self.targetwords: self.targetwords[(lemma,pos)] = [] if not sense_id in self.targetwords[(lemma,pos)]: self.targetwords[(lemma,pos)].append(sense_id) f.close() def __getitem__(self, word_id): return self.sense[self._sanitize(word_id)] def getsense(self, word_id): return self.sense[self._sanitize(word_id)][0] def getlemma(self, word_id): return self.sense[self._sanitize(word_id)][1] def getpos(self, word_id): return self.sense[self._sanitize(word_id)][2] def _sanitize(self, word_id): return u(word_id) def __contains__(self, word_id): return (self._sanitize(word_id) in self.sense) def __iter__(self): for word_id, (sense, lemma, pos) in self.sense.items(): yield (word_id, sense, lemma, pos) def senses(self, lemma, pos): return self.targetwords[(lemma,pos)] PyNLPl-1.1.2/pynlpl/formats/foliaset.py0000644000175000001440000005162413024723325020663 0ustar proyconusers00000000000000# -*- coding: utf-8 -*- #---------------------------------------------------------------- # PyNLPl - FoLiA Set Definition Module # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # # https://proycon.github.io/folia # httsp://github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Module for reading, editing and writing FoLiA XML # # Licensed under GPLv3 # #---------------------------------------------------------------- #pylint: disable=redefined-builtin,trailing-whitespace,superfluous-parens,bad-classmethod-argument,wrong-import-order,wrong-import-position,ungrouped-imports from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import sys import io import rdflib from lxml import etree as ElementTree if sys.version < '3': from StringIO import StringIO #pylint: disable=import-error,wrong-import-order from urllib import urlopen #pylint: disable=no-name-in-module,wrong-import-order from urllib2 import HTTPError else: from io import StringIO, BytesIO #pylint: disable=wrong-import-order,ungrouped-imports from urllib.request import urlopen #pylint: disable=E0611,wrong-import-order,ungrouped-imports from urllib.error import HTTPError #foliaspec:namespace:NSFOLIA #The FoLiA XML namespace NSFOLIA = "http://ilk.uvt.nl/folia" #foliaspec:setdefinitionnamespace:NSFOLIASETDEFINITION NSFOLIASETDEFINITION = "http://folia.science.ru.nl/setdefinition" NSSKOS = "http://www.w3.org/2004/02/skos/core" class DeepValidationError(Exception): pass class SetDefinitionError(DeepValidationError): pass class SetType: #legacy only CLOSED, OPEN, MIXED, EMPTY = range(4) class LegacyClassDefinition(object): def __init__(self,id, label, subclasses=None): self.id = id self.label = label if subclasses: self.subclasses = subclasses else: self.subclasses = [] @classmethod def parsexml(Class, node): if not node.tag == '{' + NSFOLIA + '}class': raise Exception("Expected class tag for this xml node, got" + node.tag) if 'label' in node.attrib: label = node.attrib['label'] else: label = "" subclasses= [] for subnode in node: if isinstance(subnode.tag, str) or (sys.version < '3' and isinstance(subnode.tag, unicode)): #pylint: disable=undefined-variable if subnode.tag == '{' + NSFOLIA + '}class': subclasses.append( LegacyClassDefinition.parsexml(subnode) ) elif subnode.tag[:len(NSFOLIA) +2] == '{' + NSFOLIA + '}': raise Exception("Invalid tag in Class definition: " + subnode.tag) if '{http://www.w3.org/XML/1998/namespace}id' in node.attrib: idkey = '{http://www.w3.org/XML/1998/namespace}id' else: idkey = 'id' return LegacyClassDefinition(node.attrib[idkey],label, subclasses) def __iter__(self): for c in self.subclasses: yield c def json(self): jsonnode = {'id': self.id, 'label': self.label} jsonnode['subclasses'] = [] for subclass in self.subclasses: jsonnode['subclasses'].append(subclass.json()) return jsonnode def rdf(self,graph, basens,parentseturi, parentclass=None, seqnr=None): graph.add((rdflib.term.URIRef(basens + '#' + self.id), rdflib.RDF.type, rdflib.term.URIRef(NSSKOS + '#Concept'))) graph.add((rdflib.term.URIRef(basens + '#' + self.id), rdflib.term.URIRef(NSSKOS + '#notation'), rdflib.term.Literal(self.id))) graph.add((rdflib.term.URIRef(basens + '#' + self.id), rdflib.term.URIRef(NSSKOS + '#prefLabel'), rdflib.term.Literal(self.label))) graph.add((parentseturi , rdflib.term.URIRef(NSSKOS + '#member'), rdflib.term.URIRef(basens + '#' + self.id))) if seqnr is not None: graph.add((rdflib.term.URIRef(basens + '#' + self.id), rdflib.term.URIRef(NSFOLIASETDEFINITION + '#sequenceNumber'), rdflib.term.Literal(seqnr) )) if parentclass: graph.add((rdflib.term.URIRef(basens + '#' + self.id), rdflib.term.URIRef(NSSKOS + '#broader'), rdflib.term.URIRef(basens + '#' + parentclass) )) for subclass in self.subclasses: subclass.rdf(graph,basens,parentseturi, self.id) class LegacySetDefinition(object): def __init__(self, id, type, classes = None, subsets = None, label =None): self.id = id self.type = type self.label = label if classes: self.classes = classes else: self.classes = [] if subsets: self.subsets = subsets else: self.subsets = [] @classmethod def parsexml(Class, node): issubset = node.tag == '{' + NSFOLIA + '}subset' if not issubset: assert node.tag == '{' + NSFOLIA + '}set' classes = [] subsets= [] if 'type' in node.attrib: if node.attrib['type'] == 'open': type = SetType.OPEN elif node.attrib['type'] == 'closed': type = SetType.CLOSED elif node.attrib['type'] == 'mixed': type = SetType.MIXED elif node.attrib['type'] == 'empty': type = SetType.EMPTY else: raise Exception("Invalid set type: ", type) else: type = SetType.CLOSED if 'label' in node.attrib: label = node.attrib['label'] else: label = None for subnode in node: if isinstance(subnode.tag, str) or (sys.version < '3' and isinstance(subnode.tag, unicode)): #pylint: disable=undefined-variable if subnode.tag == '{' + NSFOLIA + '}class': classes.append( LegacyClassDefinition.parsexml(subnode) ) elif not issubset and subnode.tag == '{' + NSFOLIA + '}subset': subsets.append( LegacySetDefinition.parsexml(subnode) ) elif subnode.tag == '{' + NSFOLIA + '}constraint': pass elif subnode.tag[:len(NSFOLIA) +2] == '{' + NSFOLIA + '}': raise SetDefinitionError("Invalid tag in Set definition: " + subnode.tag) return LegacySetDefinition(node.attrib['{http://www.w3.org/XML/1998/namespace}id'],type,classes, subsets, label) def json(self): jsonnode = {'id': self.id} if self.label: jsonnode['label'] = self.label if self.type == SetType.OPEN: jsonnode['type'] = 'open' elif self.type == SetType.CLOSED: jsonnode['type'] = 'closed' elif self.type == SetType.MIXED: jsonnode['type'] = 'mixed' elif self.type == SetType.EMPTY: jsonnode['type'] = 'empty' jsonnode['subsets'] = {} for subset in self.subsets: jsonnode['subsets'][subset.id] = subset.json() jsonnode['classes'] = {} jsonnode['classorder'] = [] for c in sorted(self.classes, key=lambda x: x.label): jsonnode['classes'][c.id] = c.json() jsonnode['classorder'].append( c.id ) return jsonnode def rdf(self,graph, basens="",parenturi=None): if not basens: basens = NSFOLIASETDEFINITION + "/" + self.id if not parenturi: graph.bind( self.id, basens + '#', override=True ) #set a prefix for our namespace (does not use @base because of issue RDFLib/rdflib#559 ) seturi = rdflib.term.URIRef(basens + '#Set') else: seturi = rdflib.term.URIRef(basens + '#Subset.' + self.id) graph.add((seturi, rdflib.RDF.type, rdflib.term.URIRef(NSSKOS + '#Collection'))) if self.id: graph.add((seturi, rdflib.term.URIRef(NSSKOS + '#notation'), rdflib.term.Literal(self.id))) if self.type == SetType.OPEN: graph.add((seturi, rdflib.term.URIRef(NSFOLIASETDEFINITION + '#open'), rdflib.term.Literal(True))) elif self.type == SetType.EMPTY: graph.add((seturi, rdflib.term.URIRef(NSFOLIASETDEFINITION + '#empty'), rdflib.term.Literal(True))) if self.label: graph.add((seturi, rdflib.term.URIRef(NSSKOS + '#prefLabel'), rdflib.term.Literal(self.label))) if parenturi: graph.add((parenturi, rdflib.term.URIRef(NSSKOS + '#member'), seturi)) for i, c in enumerate(self.classes): c.rdf(graph, basens, seturi, None, i+1) for s in self.subsets: s.rdf(graph, basens, seturi) def xmltreefromstring(s): """Internal function, deals with different Python versions, unicode strings versus bytes, and with the leak bug in lxml""" if sys.version < '3': #Python 2 if isinstance(s,unicode): #pylint: disable=undefined-variable s = s.encode('utf-8') try: return ElementTree.parse(StringIO(s), ElementTree.XMLParser(collect_ids=False)) except TypeError: return ElementTree.parse(StringIO(s), ElementTree.XMLParser()) #older lxml, may leak!!!! else: #Python 3 if isinstance(s,str): s = s.encode('utf-8') try: return ElementTree.parse(BytesIO(s), ElementTree.XMLParser(collect_ids=False)) except TypeError: return ElementTree.parse(BytesIO(s), ElementTree.XMLParser()) #older lxml, may leak!!!! class SetDefinition(object): def __init__(self, url, format=None, basens="",verbose=False): self.graph = rdflib.Graph() self.basens = basens self.mainsetcache = {} self.subsetcache = {} self.set_id_uri_cache = {} self.verbose = verbose self.graph.bind( 'fsd', NSFOLIASETDEFINITION+'#', override=True) self.graph.bind( 'skos', NSSKOS+'#', override=True) if not format: #try to guess format from URL if url.endswith('.ttl'): format = 'text/turtle' elif url.endswith('.n3'): format = 'text/n3' elif url.endswith('.rdf.xml') or url.endswith('.rdf'): format = 'application/rdf+xml' elif url.endswith('.xml'): #other XML will be considered legacy format = 'application/foliaset+xml' #legacy if format in ('application/foliaset+xml','legacy',None): #legacy format, has some checks and fallbacks if the format turns out to be RDF anyway self.legacyset = None if url[0] == '/' or url[0] == '.': #local file f = io.open(url,'r',encoding='utf-8') else: #remote URL if not self.basens: self.basens = url try: f = urlopen(url) except: raise DeepValidationError("Unable to download " + url) try: data = f.read() except IOError: raise DeepValidationError("Unable to download " + url) finally: f.close() if data[0] in ('@',b'@',64): #this is not gonna be valid XML, but looks like turtle/n3 RDF self.graph.parse(location=url, format='text/turtle') if self.verbose: print("Loaded set " + url + " (" + str(len(self.graph)) + " triples)",file=sys.stderr) return tree = xmltreefromstring(data) root = tree.getroot() if root.tag != '{' + NSFOLIA + '}set': if root.tag.lower().find('rdf') != 1: #well, this is RDF after all... self.graph.parse(location=url, format='rdf') return else: raise SetDefinitionError("Not a FoLiA Set Definition! Unexpected root tag:"+ root.tag) legacyset = LegacySetDefinition.parsexml(root) legacyset.rdf(self.graph, self.basens) if self.verbose: print("Loaded legacy set " + url + " (" + str(len(self.graph)) + " triples)",file=sys.stderr) else: try: self.graph.parse(location=url, format=format) except HTTPError: raise DeepValidationError("Unable to download " + url) if self.verbose: print("Loaded set " + url + " (" + str(len(self.graph)) + " triples)",file=sys.stderr) def testclass(self,cls): """Test for the presence of the class, returns the full URI or raises an exception""" mainsetinfo = self.mainset() if mainsetinfo['open']: return cls #everything is okay elif mainsetinfo['empty']: if cls: raise DeepValidationError("Expected an empty class, got \"" + cls + "\"") else: if not cls: raise DeepValidationError("No class specified") #closed set set_uri = mainsetinfo['uri'] for row in self.graph.query("SELECT ?c WHERE { ?c rdf:type skos:Concept ; skos:notation \"" + cls + "\". <" + str(set_uri) + "> skos:member ?c }"): return str(row.c) raise DeepValidationError("Not a valid class: " + cls) def testsubclass(self, cls, subset, subclass): """Test for the presence of a class in a subset (used with features), returns the full URI or raises an exception""" subsetinfo = self.subset(subset) if subsetinfo['open']: return subclass #everything is okay else: subset_uri = subsetinfo['uri'] if not subset_uri: raise DeepValidationError("Not a valid subset: " + subset) query = "SELECT ?c WHERE { ?c rdf:type skos:Concept ; skos:notation \"" + subclass + "\" . <" + str(subset_uri) + "> skos:member ?c }" for row in self.graph.query(query): return str(row.c) raise DeepValidationError("Not a valid class in subset " + subset + ": " + subclass) def get_set_uri(self, set_id=None): if set_id in self.set_id_uri_cache: return self.set_id_uri_cache[set_id] if set_id: for row in self.graph.query("SELECT ?s WHERE { ?s rdf:type skos:Collection ; skos:notation \"" + set_id + "\" }"): self.set_id_uri_cache[set_id] = row.s return row.s raise DeepValidationError("No such set: " + str(set_id)) else: for row in self.graph.query("SELECT ?s WHERE { ?s rdf:type skos:Collection . FILTER NOT EXISTS { ?y rdf:type skos:Collection . ?y skos:member ?s } }"): self.set_id_uri_cache[set_id] = row.s return row.s raise DeepValidationError("Main set not found") def mainset(self): """Returns information regarding the set""" if self.mainsetcache: return self.mainsetcache set_uri = self.get_set_uri() for row in self.graph.query("SELECT ?seturi ?setid ?setlabel ?setopen ?setempty WHERE { ?seturi rdf:type skos:Collection . OPTIONAL { ?seturi skos:notation ?setid } OPTIONAL { ?seturi skos:prefLabel ?setlabel } OPTIONAL { ?seturi fsd:open ?setopen } OPTIONAL { ?seturi fsd:empty ?setempty } FILTER NOT EXISTS { ?y skos:member ?seturi . ?y rdf:type skos:Collection } }"): self.mainsetcache = {'uri': str(row.seturi), 'id': str(row.setid), 'label': str(row.setlabel) if row.setlabel else "", 'open': bool(row.setopen), 'empty': bool(row.setempty) } return self.mainsetcache raise DeepValidationError("Unable to find main set (set_uri=" + str(set_uri)+"), this should not happen") def subset(self, subset_id): """Returns information regarding the set""" if subset_id in self.subsetcache: return self.subsetcache[subset_id] set_uri = self.get_set_uri(subset_id) for row in self.graph.query("SELECT ?seturi ?setid ?setlabel ?setopen WHERE { ?seturi rdf:type skos:Collection . OPTIONAL { ?seturi skos:notation ?setid } OPTIONAL { ?seturi skos:prefLabel ?setlabel } OPTIONAL { ?seturi fsd:open ?setopen } FILTER (?seturi = <" + str(set_uri)+">) }"): self.subsetcache[str(row.setid)] = {'uri': str(row.seturi), 'id': str(row.setid), 'label': str(row.setlabel) if row.setlabel else "", 'open': bool(row.setopen) } return self.subsetcache[str(row.setid)] raise DeepValidationError("Unable to find subset (set_uri=" + str(set_uri)+")") def orderedclasses(self, set_uri_or_id=None, nestedhierarchy=False): """Higher-order generator function that yields class information in the right order, combines calls to :meth:`SetDefinition.classes` and :meth:`SetDefinition.classorder`""" classes = self.classes(set_uri_or_id, nestedhierarchy) for classid in self.classorder(classes): yield classes[classid] def __iter__(self): """Alias for :meth:`SetDefinition.orderedclasses`""" return self.orderedclasses() def classes(self, set_uri_or_id=None, nestedhierarchy=False): """Returns a dictionary of classes for the specified (sub)set (if None, default, the main set is selected)""" if set_uri_or_id and set_uri_or_id.startswith(('http://','https://')): set_uri = set_uri_or_id else: set_uri = self.get_set_uri(set_uri_or_id) assert set_uri is not None classes= {} uri2idmap = {} for row in self.graph.query("SELECT ?classuri ?classid ?classlabel ?parentclass ?seqnr WHERE { ?classuri rdf:type skos:Concept ; skos:notation ?classid. <" + str(set_uri) + "> skos:member ?classuri . OPTIONAL { ?classuri skos:prefLabel ?classlabel } OPTIONAL { ?classuri skos:broader ?parentclass } OPTIONAL { ?classuri fsd:sequenceNumber ?seqnr } }"): classinfo = {'uri': str(row.classuri), 'id': str(row.classid),'label': str(row.classlabel) if row.classlabel else "" } if nestedhierarchy: uri2idmap[str(row.classuri)] = str(row.classid) if row.parentclass: classinfo['parentclass'] = str(row.parentclass) #uri if row.seqnr: classinfo['seqnr'] = int(row.seqnr) classes[str(row.classid)] = classinfo if nestedhierarchy: #build hierarchy removekeys = [] for classid, classinfo in classes.items(): if 'parentclass' in classinfo: removekeys.append(classid) parentclassid = uri2idmap[classinfo['parentclass']] if 'subclasses' not in classes[parentclassid]: classes[parentclassid]['subclasses'] = {} classes[parentclassid]['subclasses'][classid] = classinfo for key in removekeys: del classes[key] return classes def classorder(self,classes): """Return a list of class IDs in order for presentational purposes: order is determined first and foremost by explicit ordering, else alphabetically by label or as a last resort by class ID""" return [ classid for classid, classitem in sorted( ((classid, classitem) for classid, classitem in classes.items() if 'seqnr' in classitem) , key=lambda pair: pair[1]['seqnr'] )] + \ [ classid for classid, classitem in sorted( ((classid, classitem) for classid, classitem in classes.items() if 'seqnr' not in classitem) , key=lambda pair: pair[1]['label'] if 'label' in pair[1] else pair[1]['id']) ] def subsets(self, set_uri_or_id=None): if set_uri_or_id and set_uri_or_id.startswith(('http://', 'https://')): set_uri = set_uri_or_id else: set_uri = self.get_set_uri(set_uri_or_id) assert set_uri is not None for row in self.graph.query("SELECT ?seturi ?setid ?setlabel ?setopen WHERE { ?seturi rdf:type skos:Collection . <" + str(set_uri) + "> skos:member ?seturi . OPTIONAL { ?seturi skos:notation ?setid } OPTIONAL { ?seturi skos:prefLabel ?setlabel } OPTIONAL { ?seturi fsd:open ?setopen } }"): yield {'uri': str(row.seturi), 'id': str(row.setid), 'label': str(row.setlabel) if row.setlabel else "", 'open': bool(row.setopen) } def json(self): data = {'subsets': {}} setinfo = self.mainset() #backward compatibility, set type: if setinfo['open']: setinfo['type'] = 'open' else: setinfo['type'] = 'closed' data.update(setinfo) classes = self.classes() data['classes'] = classes data['classorder'] = self.classorder(classes) for subsetinfo in self.subsets(): #backward compatibility, set type: if subsetinfo['open']: subsetinfo['type'] = 'open' else: subsetinfo['type'] = 'closed' data['subsets'][subsetinfo['id']] = subsetinfo classes = self.classes(subsetinfo['uri']) data['subsets'][subsetinfo['id']]['classes'] = classes data['subsets'][subsetinfo['id']]['classorder'] = self.classorder(classes) return data PyNLPl-1.1.2/pynlpl/formats/giza.py0000644000175000001440000002446312466637746020034 0ustar proyconusers00000000000000# -*- coding: utf-8 -*- ############################################################### # PyNLPl - WordAlignment Library for reading GIZA++ A3 files # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # In part using code by Sander Canisius # # Licensed under GPLv3 # # # This library reads GIZA++ A3 files. It contains three classes over which # you can iterate to obtain (sourcewords,targetwords,alignment) pairs. # # - WordAlignment - Reads target-source.A3.final files, in which each source word is aligned to one target word # - MultiWordAlignment - Reads source-target.A3.final files, in which each source word may be aligned to multiple target target words # - IntersectionAlignment - Computes the intersection between the above two alignments # # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u import bz2 import gzip import copy import io from sys import stderr class GizaSentenceAlignment(object): def __init__(self, sourceline, targetline, index): self.index = index self.alignment = [] if sourceline: self.source = self._parsesource(sourceline.strip()) else: self.source = [] self.target = targetline.strip().split(' ') def _parsesource(self, line): cleanline = "" inalignment = False begin = 0 sourceindex = 0 for i in range(0,len(line)): if line[i] == ' ' or i == len(line) - 1: if i == len(line) - 1: offset = 1 else: offset = 0 word = line[begin:i+offset] if word == '})': inalignment = False begin = i + 1 continue elif word == "({": inalignment = True begin = i + 1 continue if word.strip() and word != 'NULL': if not inalignment: sourceindex += 1 if cleanline: cleanline += " " cleanline += word else: targetindex = int(word) self.alignment.append( (sourceindex-1, targetindex-1) ) begin = i + 1 return cleanline.split(' ') def intersect(self,other): if other.target != self.source: print("GizaSentenceAlignment.intersect(): Mismatch between self.source and other.target: " + repr(self.source) + " -- vs -- " + repr(other.target),file=stderr) return None intersection = copy.copy(self) intersection.alignment = [] for sourceindex, targetindex in self.alignment: for targetindex2, sourceindex2 in other.alignment: if targetindex2 == targetindex and sourceindex2 == sourceindex: intersection.alignment.append( (sourceindex, targetindex) ) return intersection def __repr__(self): s = " ".join(self.source)+ " ||| " s += " ".join(self.target) + " ||| " for S,T in sorted(self.alignment): s += self.source[S] + "->" + self.target[T] + " ; " return s def getalignedtarget(self, index): """Returns target range only if source index aligns to a single consecutive range of target tokens.""" targetindices = [] target = None foundindex = -1 for sourceindex, targetindex in self.alignment: if sourceindex == index: targetindices.append(targetindex) if len(targetindices) > 1: for i in range(1,len(targetindices)): if abs(targetindices[i] - targetindices[i-1]) != 1: break # not consecutive foundindex = (min(targetindices), max(targetindices)) target = ' '.join(self.target[min(targetindices):max(targetindices)+1]) elif targetindices: foundindex = targetindices[0] target = self.target[foundindex] return target, foundindex class GizaModel(object): def __init__(self, filename, encoding= 'utf-8'): if filename.split(".")[-1] == "bz2": self.f = bz2.BZ2File(filename,'r') elif filename.split(".")[-1] == "gz": self.f = gzip.GzipFile(filename,'r') else: self.f = io.open(filename,'r',encoding=encoding) self.nextlinebuffer = None def __iter__(self): self.f.seek(0) nextlinebuffer = u(next(self.f)) sentenceindex = 0 done = False while not done: sentenceindex += 1 line = nextlinebuffer if line[0] != '#': raise Exception("Error parsing GIZA++ Alignment at sentence " + str(sentenceindex) + ", expected new fragment, found: " + repr(line)) targetline = u(next(self.f)) sourceline = u(next(self.f)) yield GizaSentenceAlignment(sourceline, targetline, sentenceindex) try: nextlinebuffer = u(next(self.f)) except StopIteration: done = True def __del__(self): if self.f: self.f.close() #------------------ OLD ------------------- def parseAlignment(tokens): #by Sander Canisius assert tokens.pop(0) == "NULL" while tokens.pop(0) != "})": pass while tokens: word = tokens.pop(0) assert tokens.pop(0) == "({" positions = [] token = tokens.pop(0) while token != "})": positions.append(int(token)) token = tokens.pop(0) yield word, positions class WordAlignment: """Target to Source alignment: reads target-source.A3.final files, in which each source word is aligned to one target word""" def __init__(self,filename, encoding=False): """Open a target-source GIZA++ A3 file. The file may be bzip2 compressed. If an encoding is specified, proper unicode strings will be returned""" if filename.split(".")[-1] == "bz2": self.stream = bz2.BZ2File(filename,'r') else: self.stream = open(filename) self.encoding = encoding def __del__(self): self.stream.close() def __iter__(self): #by Sander Canisius line = self.stream.readline() while line: assert line.startswith("#") src = self.stream.readline().split() trg = [] alignment = [None for i in xrange(len(src))] for i, (targetWord, positions) in enumerate(parseAlignment(self.stream.readline().split())): trg.append(targetWord) for pos in positions: assert alignment[pos - 1] is None alignment[pos - 1] = i if self.encoding: yield [ u(w,self.encoding) for w in src ], [ u(w,self.encoding) for w in trg ], alignment else: yield src, trg, alignment line = self.stream.readline() def targetword(self, index, targetwords, alignment): """Return the aligned targetword for a specified index in the source words""" if alignment[index]: return targetwords[alignment[index]] else: return None def reset(self): self.stream.seek(0) class MultiWordAlignment: """Source to Target alignment: reads source-target.A3.final files, in which each source word may be aligned to multiple target words (adapted from code by Sander Canisius)""" def __init__(self,filename, encoding = False): """Load a target-source GIZA++ A3 file. The file may be bzip2 compressed. If an encoding is specified, proper unicode strings will be returned""" if filename.split(".")[-1] == "bz2": self.stream = bz2.BZ2File(filename,'r') else: self.stream = open(filename) self.encoding = encoding def __del__(self): self.stream.close() def __iter__(self): line = self.stream.readline() while line: assert line.startswith("#") trg = self.stream.readline().split() src = [] alignment = [] for i, (word, positions) in enumerate(parseAlignment(self.stream.readline().split())): src.append(word) alignment.append( [ p - 1 for p in positions ] ) if self.encoding: yield [ unicode(w,self.encoding) for w in src ], [ unicode(w,self.encoding) for w in trg ], alignment else: yield src, trg, alignment line = self.stream.readline() def targetword(self, index, targetwords, alignment): """Return the aligned targeword for a specified index in the source words. Multiple words are concatenated together with a space in between""" return " ".join(targetwords[alignment[index]]) def targetwords(self, index, targetwords, alignment): """Return the aligned targetwords for a specified index in the source words""" return [ targetwords[x] for x in alignment[index] ] def reset(self): self.stream.seek(0) class IntersectionAlignment: def __init__(self,source2target,target2source,encoding=False): self.s2t = MultiWordAlignment(source2target, encoding) self.t2s = WordAlignment(target2source, encoding) self.encoding = encoding def __iter__(self): for (src, trg, alignment), (revsrc, revtrg, revalignment) in zip(self.s2t,self.t2s): #will take unnecessary memory in Python 2.x, optimal in Python 3 if src != revsrc or trg != revtrg: raise Exception("Files are not identical!") else: #keep only those alignments that are present in both intersection = [] for i, x in enumerate(alignment): if revalignment[i] in x: intersection.append(revalignment[i]) else: intersection.append(None) yield src, trg, intersection def reset(self): self.s2t.reset() self.t2s.reset() PyNLPl-1.1.2/pynlpl/formats/fql.py0000644000175000001440000030154513024723325017637 0ustar proyconusers00000000000000#--------------------------------------------------------------- # PyNLPl - FoLiA Query Language # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # http://proycon.github.com/folia # http://www.github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Module for reading, editing and writing FoLiA XML using # the FoLiA Query Language # # Licensed under GPLv3 # #---------------------------------------------------------------- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.formats import folia from copy import copy import json import re import sys import random import datetime OPERATORS = ('=','==','!=','>','<','<=','>=','CONTAINS','NOTCONTAINS','MATCHES','NOTMATCHES') MASK_NORMAL = 0 MASK_LITERAL = 1 MASK_EXPRESSION = 2 MAXEXPANSION = 99 FOLIAVERSION = '1.4.0' FQLVERSION = '0.4.0' class SyntaxError(Exception): pass class QueryError(Exception): pass def getrandomid(query,prefix=""): randomid = "" while not randomid or randomid in query.doc.index: randomid = prefix + "%08x" % random.getrandbits(32) #generate a random ID return randomid class UnparsedQuery(object): """This class takes care of handling grouped blocks in parentheses and handling quoted values""" def __init__(self, s, i=0): self.q = [] self.mask = [] l = len(s) begin = 0 while i < l: c = s[i] if c == " ": #process previous word if begin < i: w = s[begin:i] self.q.append(w) self.mask.append(MASK_NORMAL) begin = i + 1 elif i == l - 1: #process last word w = s[begin:] self.q.append(w) self.mask.append(MASK_NORMAL) if c == '(': #groups #find end quote and process block level = 0 quoted = False s2 = "" for j in range(i+1,l): c2 = s[j] if c2 == '"': if s[j-1] != "\\": #check it isn't escaped quoted = not quoted if not quoted: if c2 == '(': level += 1 elif c2 == ')': if level == 0: s2 = s[i+1:j] break else: level -= 1 if s2: self.q.append(UnparsedQuery(s2)) self.mask.append(MASK_EXPRESSION) i = j begin = i+1 else: raise SyntaxError("Unmatched parenthesis at char " + str(i)) elif c == '"': #literals if i == 0 or (i > 0 and s[i-1] != "\\"): #check it isn't escaped #find end quote and process block s2 = None for j in range(i+1,l): c2 = s[j] if c2 == '"': if s[j-1] != "\\": #check it isn't escaped s2 = s[i+1:j] break if not s2 is None: self.q.append(s2.replace('\\"','"').replace("\\n","\n")) #undo escaped quotes and newlines self.mask.append(MASK_LITERAL) i = j begin = i+1 else: raise SyntaxError("Unterminated string literal at char " + str(i)) i += 1 remove = [] #process shortcut notation for i, (w,m) in enumerate(zip(self.q,self.mask)): if m == MASK_NORMAL and w[0] == ':': #we have shortcut notation for a HAS statement, rewrite: self.q[i] = UnparsedQuery(w[1:] + " HAS class " + self.q[i+1] + " \"" + self.q[i+2] + "\"") self.mask[i] = MASK_EXPRESSION remove += [i+1,i+2] if remove: for index in reversed(remove): del self.q[index] del self.mask[index] def __iter__(self): for w in self.q: yield w def __len__(self): return len(self.q) def __getitem__(self, index): try: return self.q[index] except: return "" def kw(self, index, value): try: if isinstance(value, tuple): return self.q[index] in value and self.mask[index] == MASK_NORMAL else: return self.q[index] == value and self.mask[index] == MASK_NORMAL except: return False def __exists__(self, keyword): for k,m in zip(self.q,self.mask): if keyword == k and m == MASK_NORMAL: return True return False def __setitem__(self, index, value): self.q[index] = value def __str__(self): s = [] for w,m in zip(self.q,self.mask): if m == MASK_NORMAL: s.append(w) elif m == MASK_LITERAL: s.append('"' + w.replace('"','\\"') + '"') elif m == MASK_EXPRESSION: s.append('(' + str(w) + ')') return " ".join(s) class Filter(object): #WHERE .... def __init__(self, filters, negation=False,disjunction=False): self.filters = filters self.negation = negation self.disjunction = disjunction @staticmethod def parse(q, i=0): filters = [] negation = False logop = "" l = len(q) while i < l: if q.kw(i, "NOT"): negation = True i += 1 elif isinstance(q[i], UnparsedQuery): filter,_ = Filter.parse(q[i]) filters.append(filter) i += 1 if q.kw(i,"AND") or q.kw(i, "OR"): if logop and q[i] != logop: raise SyntaxError("Mixed logical operators, use parentheses: " + str(q)) logop = q[i] i += 1 else: break #done elif i == 0 and (q[i].startswith("PREVIOUS") or q[i].startswith("NEXT") or q.kw(i, ("LEFTCONTEXT","RIGHTCONTEXT","CONTEXT","PARENT","ANCESTOR","CHILD") )): #we have a context expression, always occuring in its own subquery modifier = q[i] i += 1 selector,i = Selector.parse(q,i) filters.append( (modifier, selector,None) ) break elif q[i+1] in OPERATORS and q[i] and q[i+2]: operator = q[i+1] if q[i] == "class": v = lambda x,y='cls': getattr(x,y) elif q[i] in ("text","value","phon"): v = lambda x,y='text': getattr(x,'value') if isinstance(x, (folia.Description, folia.Comment, folia.Content)) else getattr(x,'phon') if isinstance(x,folia.PhonContent) else getattr(x,'text')() else: v = lambda x,y=q[i]: getattr(x,y) if q[i] == 'confidence': cnv = float else: cnv = lambda x: x if operator == '=' or operator == '==': filters.append( lambda x,y=q[i+2],v=v : v(x) == y ) elif operator == '!=': filters.append( lambda x,y=q[i+2],v=v : v(x) != y ) elif operator == '>': filters.append( lambda x,y=cnv(q[i+2]),v=v : False if v(x) is None else v(x) > y ) elif operator == '<': filters.append( lambda x,y=cnv(q[i+2]),v=v : False if v(x) is None else v(x) < y ) elif operator == '>=': filters.append( lambda x,y=cnv(q[i+2]),v=v : False if v(x) is None else v(x) >= y ) elif operator == '<=': filters.append( lambda x,y=cnv(q[i+2]),v=v : False if v(x) is None else v(x) <= y ) elif operator == 'CONTAINS': filters.append( lambda x,y=q[i+2],v=v : v(x).find( y ) != -1 ) elif operator == 'NOTCONTAINS': filters.append( lambda x,y=q[i+2],v=v : v(x).find( y ) == -1 ) elif operator == 'MATCHES': filters.append( lambda x,y=re.compile(q[i+2]),v=v : y.search(v(x)) is not None ) elif operator == 'NOTMATCHES': filters.append( lambda x,y=re.compile(q[i+2]),v=v : y.search(v(x)) is None ) if q.kw(i+3,("AND","OR")): if logop and q[i+3] != logop: raise SyntaxError("Mixed logical operators, use parentheses: " + str(q)) logop = q[i+3] i += 4 else: i += 3 break #done elif 'HAS' in q[i:]: #has statement (spans full UnparsedQuery by definition) selector,i = Selector.parse(q,i) if not q.kw(i,"HAS"): raise SyntaxError("Expected HAS, got " + str(q[i]) + " at position " + str(i) + " in: " + str(q)) i += 1 subfilter,i = Filter.parse(q,i) filters.append( ("CHILD",selector,subfilter) ) else: raise SyntaxError("Expected comparison operator, got " + str(q[i+1]) + " in: " + str(q)) if negation and len(filters) > 1: raise SyntaxError("Expecting parentheses when NOT is used with multiple conditions") return Filter(filters, negation, logop == "OR"), i def __call__(self, query, element, debug=False): """Tests the filter on the specified element, returns a boolean""" match = True if debug: print("[FQL EVALUATION DEBUG] Filter - Testing filter [" + str(self) + "] for ", repr(element),file=sys.stderr) for filter in self.filters: if isinstance(filter,tuple): modifier, selector, subfilter = filter if debug: print("[FQL EVALUATION DEBUG] Filter - Filter is a subfilter of type " + modifier + ", descending...",file=sys.stderr) #we have a subfilter, i.e. a HAS statement on a subelement match = False if modifier == "CHILD": for subelement,_ in selector(query, [element], True, debug): #if there are multiple subelements, they are always treated disjunctly if not subfilter: match = True else: match = subfilter(query, subelement, debug) if match: break #only one subelement has to match by definition, then the HAS statement is matched elif modifier == "PARENT": match = selector.match(query, element.parent,debug) elif modifier == "NEXT": neighbour = element.next() if neighbour: match = selector.match(query, neighbour,debug) elif modifier == "PREVIOUS": neighbour = element.previous() if neighbour: match = selector.match(query, neighbour,debug) else: raise NotImplementedError("Context keyword " + modifier + " not implemented yet") elif isinstance(filter, Filter): #we have a nested filter (parentheses) match = filter(query, element, debug) else: #we have a condition function we can evaluate match = filter(element) if self.negation: match = not match if match: if self.disjunction: if debug: print("[FQL EVALUATION DEBUG] Filter returns True",file=sys.stderr) return True else: if not self.disjunction: #implies conjunction if debug: print("[FQL EVALUATION DEBUG] Filter returns False",file=sys.stderr) return False if debug: print("[FQL EVALUATION DEBUG] Filter returns ", str(match),file=sys.stderr) return match def __str__(self): q = "" if self.negation: q += "NOT " for i, filter in enumerate(self.filters): if i > 0: if self.disjunction: q += "OR " else: q += "AND " if isinstance(filter, Filter): q += "(" + str(filter) + ") " elif isinstance(filter, tuple): modifier,selector,subfilter = filter q += "(" + modifier + " " + str(selector) + " HAS " + str(subfilter) + ") " else: #original filter can't be reconstructed, place dummy: q += "...\"" + str(filter.__defaults__[0]) +"\"" return q.strip() class SpanSet(list): def select(self,*args): raise QueryError("Got a span set for a non-span element") def partof(self, collection): for e in collection: if isinstance(e, SpanSet): if len(e) != len(self): return False for c1,c2 in zip(e,self): if c1 is not c2: return False return False class Selector(object): def __init__(self, Class, set=None,id=None, filter=None, nextselector=None, expansion = None): self.Class = Class self.set = set self.id = id self.filter = filter self.nextselector = nextselector #selectors can be chained self.expansion = expansion #{min,max} occurrence interval, allowed only in Span and evaluated there instead of here def chain(self, targets): assert targets[0] is self selector = self selector.nextselector = None for target in targets[1:]: selector.nextselector = target selector = target @staticmethod def parse(q, i=0, allowexpansion=False): l = len(q) set = None id = None filter = None expansion = None if q[i] == "ID" and q[i+1]: id = q[i+1] Class = None i += 2 else: if q[i] == "ALL": Class = "ALL" else: try: Class = folia.XML2CLASS[q[i]] except: raise SyntaxError("Expected element type, got " + str(q[i]) + " in: " + str(q)) i += 1 if q[i] and q[i][0] == "{" and q[i][-1] == "}": if not allowexpansion: raise SyntaxError("Expansion expressions not allowed at this point, got one at position " + str(i) + " in: " + str(q)) expansion = q[i][1:-1] expansion = expansion.split(',') i += 1 try: if len(expansion) == 1: expansion = (int(expansion), int(expansion)) elif len(expansion) == 2 and expansion[0] == "": expansion = (0,int(expansion[1])) elif len(expansion) == 2 and expansion[1] == "": expansion = (int(expansion[0]),MAXEXPANSION) elif len(expansion) == 2: expansion = tuple(int(x) for x in expansion if x) else: raise SyntaxError("Invalid expansion expression: " + ",".join(expansion)) except ValueError: raise SyntaxError("Invalid expansion expression: " + ",".join(expansion)) while i < l: if q.kw(i,"OF") and q[i+1]: set = q[i+1] i += 2 elif q.kw(i,"ID") and q[i+1]: id = q[i+1] i += 2 elif q.kw(i, "WHERE"): #ok, big filter coming up! filter, i = Filter.parse(q,i+1) break else: #something we don't handle break return Selector(Class,set,id,filter, None, expansion), i def __call__(self, query, contextselector, recurse=True, debug=False): #generator, lazy evaluation! if isinstance(contextselector,tuple) and len(contextselector) == 2: selection = contextselector[0](*contextselector[1]) else: selection = contextselector count = 0 for e in selection: selector = self while True: #will loop through the chain of selectors, only the first one is called explicitly if debug: print("[FQL EVALUATION DEBUG] Select - Running selector [", str(self), "] on ", repr(e),file=sys.stderr) if selector.id: if debug: print("[FQL EVALUATION DEBUG] Select - Selecting ID " + selector.id,file=sys.stderr) try: candidate = query.doc[selector.id] selector.Class = candidate.__class__ if not selector.filter or selector.filter(query,candidate, debug): if debug: print("[FQL EVALUATION DEBUG] Select - Yielding (by ID) ", repr(candidate),file=sys.stderr) yield candidate, e except KeyError: if debug: print("[FQL EVALUATION DEBUG] Select - Selecting by ID failed for ID " + selector.id,file=sys.stderr) pass #silently ignore ID mismatches elif selector.Class == "ALL": for candidate in e: if isinstance(candidate, folia.AbstractElement): yield candidate, e elif selector.Class: if debug: print("[FQL EVALUATION DEBUG] Select - Selecting Class " + selector.Class.XMLTAG + " with set " + str(selector.set),file=sys.stderr) if selector.Class.XMLTAG in query.defaultsets: selector.set = query.defaultsets[selector.Class.XMLTAG] isspan = issubclass(selector.Class, folia.AbstractSpanAnnotation) if isinstance(e, tuple): e = e[0] if isspan and (isinstance(e, folia.Word) or isinstance(e, folia.Morpheme)): for candidate in e.findspans(selector.Class, selector.set): if not selector.filter or selector.filter(query,candidate, debug): if debug: print("[FQL EVALUATION DEBUG] Select - Yielding span, single reference: ", repr(candidate),file=sys.stderr) yield candidate, e elif isspan and isinstance(e, SpanSet): #we take the first item of the span to find the candidates for candidate in e[0].findspans(selector.Class, selector.set): if not selector.filter or selector.filter(query,candidate, debug): #test if all the other elements in the span are in this candidate matched = True spanelements = list(candidate.wrefs()) for e2 in e[1:]: if e2 not in spanelements: matched = False break if matched: if debug: print("[FQL EVALUATION DEBUG] Select - Yielding span, multiple references: ", repr(candidate),file=sys.stderr) yield candidate, e elif isinstance(e, SpanSet): yield e, e else: #print("DEBUG: doing select " + selector.Class.__name__ + " (recurse=" + str(recurse)+") on " + repr(e)) for candidate in e.select(selector.Class, selector.set, recurse): try: if candidate.changedbyquery is query: #this candidate has been added/modified by the query, don't select it again continue except AttributeError: pass if not selector.filter or selector.filter(query,candidate, debug): if debug: print("[FQL EVALUATION DEBUG] Select - Yielding ", repr(candidate), " in ", repr(e),file=sys.stderr) yield candidate, e if selector.nextselector is None: if debug: print("[FQL EVALUATION DEBUG] Select - End of chain",file=sys.stderr) break # end of chain else: if debug: print("[FQL EVALUATION DEBUG] Select - Selecting next in chain",file=sys.stderr) selector = selector.nextselector def match(self, query, candidate, debug = False): if debug: print("[FQL EVALUATION DEBUG] Select - Matching selector [", str(self), "] on ", repr(candidate),file=sys.stderr) if self.id: if candidate.id != self.id: return False elif self.Class: if not isinstance(candidate,self.Class): return False if self.filter and not self.filter(query,candidate, debug): return False if debug: print("[FQL EVALUATION DEBUG] Select - Selector matches! ", repr(candidate),file=sys.stderr) return True def autodeclare(self,doc): if self.Class and self.set: if not doc.declared(self.Class, self.set): doc.declare(self.Class, self.set) if self.nextselector: self.nextselector.autodeclare() def __str__(self): s = "" if self.Class: s += self.Class.XMLTAG + " " if self.set: s += "OF " + self.set + " " if self.id: s += "ID " + self.id + " " if self.filter: s += "WHERE " + str(self.filter) if self.nextselector: s += str(self.nextselector) return s.strip() class Span(object): def __init__(self, targets, intervals = []): self.targets = targets #Selector instances making up the span def __len__(self): return len(self.targets) @staticmethod def parse(q, i=0): targets = [] l = len(q) while i < l: if q.kw(i,"ID") or q[i] in folia.XML2CLASS: target,i = Selector.parse(q,i, True) targets.append(target) elif q.kw(i,"&"): #we're gonna have more targets i += 1 elif q.kw(i,"NONE"): #empty span return Span([]), i+1 else: break if not targets: raise SyntaxError("Expected one or more span targets, got " + str(q[i]) + " in: " + str(q)) return Span(targets), i def __call__(self, query, contextselector, recurse=True,debug=False): #returns a list of element in a span if debug: print("[FQL EVALUATION DEBUG] Span - Building span from target selectors (" + str(len(self.targets)) + ")",file=sys.stderr) backtrack = [] l = len(self.targets) if l == 0: #span is explicitly empty, this is allowed in RESPAN context if debug: print("[FQL EVALUATION DEBUG] Span - Yielding explicitly empty SpanSet",file=sys.stderr) yield SpanSet() else: #find the first non-optional element, it will be our pivot: pivotindex = None for i, target in enumerate(self.targets): if self.targets[i].id or not self.targets[i].expansion or self.targets[i].expansion[0] > 0: pivotindex = i break if pivotindex is None: raise QueryError("All parts in the SPAN expression are optional, at least one non-optional component is required") #get first target for element, target in self.targets[pivotindex](query, contextselector, recurse,debug): if debug: print("[FQL EVALUATION DEBUG] Span - First item of span found (pivotindex=" + str(pivotindex) + ",l=" + str(l) + "," + str(repr(element)) + ")",file=sys.stderr) spanset = SpanSet() #elemnent is added later match = True #we attempt to disprove this #now see if consecutive elements match up #--- matching prior to pivot ------- #match optional elements before pivotindex i = pivotindex currentelement = element while i > 0: i -= 1 if i < 0: break selector = self.targets[i] minmatches = selector.expansion[0] assert minmatches == 0 #everything before pivot has to have minmatches 0 maxmatches = selector.expansion[1] done = False matches = 0 while True: prevelement = element element = element.previous(selector.Class, None) if not element or (target and target not in element.ancestors()): if debug: print("[FQL EVALUATION DEBUG] Span - Prior element not found or out of scope",file=sys.stderr) done = True #no more elements left break elif element and not selector.match(query, element,debug): if debug: print("[FQL EVALUATION DEBUG] Span - Prior element does not match filter",file=sys.stderr) element = prevelement #reset break if debug: print("[FQL EVALUATION DEBUG] Span - Prior element matches",file=sys.stderr) #we have a match matches += 1 spanset.insert(0,element) if matches >= maxmatches: if debug: print("[FQL EVALUATION DEBUG] Span - Maximum threshold reached for span selector " + str(i) + ", breaking", file=sys.stderr) break if done: break #--- matching pivot and selectors after pivot ------- done = False #are we done with this selector? element = currentelement i = pivotindex - 1 #loop does +1 at the start of each iteration, we want to start with the pivotindex while i < l: i += 1 if i == l: if debug: print("[FQL EVALUATION DEBUG] Span - No more selectors to try",i,l, file=sys.stderr) break selector = self.targets[i] if selector.id: #selection by ID, don't care about consecutiveness try: element = query.doc[selector.id] if debug: print("[FQL EVALUATION DEBUG] Span - Obtained subsequent span item from ID: ", repr(element), file=sys.stderr) except KeyError: if debug: print("[FQL EVALUATION DEBUG] Span - Obtained subsequent with specified ID does not exist ", file=sys.stderr) match = False break if element and not selector.match(query, element,debug): if debug: print("[FQL EVALUATION DEBUG] Span - Subsequent element does not match filter",file=sys.stderr) else: spanset.append(element) else: #element must be consecutive if selector.expansion: minmatches = selector.expansion[0] maxmatches = selector.expansion[1] else: minmatches = maxmatches = 1 if debug: print("[FQL EVALUATION DEBUG] Span - Preparing to match selector " + str(i) + " of span, expansion={" + str(minmatches) + "," + str(maxmatches) + "}", file=sys.stderr) matches = 0 while True: submatch = True #does the element currenty under consideration match? (the match variable is reserved for the entire match) done = False #are we done with this span selector? holdelement = False #do not go to next element if debug: print("[FQL EVALUATION DEBUG] Span - Processing element with span selector " + str(i) + ": ", repr(element), file=sys.stderr) if not element or (target and target not in element.ancestors()): if debug: if not element: print("[FQL EVALUATION DEBUG] Span - Element not found",file=sys.stderr) elif target and not target in element.ancestors(): print("[FQL EVALUATION DEBUG] Span - Element out of scope",file=sys.stderr) submatch = False elif element and not selector.match(query, element,debug): if debug: print("[FQL EVALUATION DEBUG] Span - Element does not match filter",file=sys.stderr) submatch = False if submatch: matches += 1 if debug: print("[FQL EVALUATION DEBUG] Span - Element is a match, got " + str(matches) + " match(es) now", file=sys.stderr) if matches > minmatches: #check if the next selector(s) match too, then we have a point where we might branch two ways #j = 1 #while i+j < len(self.targets): # nextselector = self.targets[i+j] # if nextselector.match(query, element,debug): # #save this point for backtracking, when we get stuck, we'll roll back to this point # backtrack.append( (i+j, prevelement, copy(spanset) ) ) #using prevelement, nextelement will be recomputed after backtracking, using different selector # if not nextselector.expansion or nextselector.expansion[0] > 0: # break # j += 1 #TODO: implement pass elif matches < minmatches: if debug: print("[FQL EVALUATION DEBUG] Span - Minimum threshold not reached yet for span selector " + str(i), file=sys.stderr) spanset.append(element) if matches >= maxmatches: if debug: print("[FQL EVALUATION DEBUG] Span - Maximum threshold reached for span selector " + str(i) + ", breaking", file=sys.stderr) done = True #done with this selector else: if matches < minmatches: #can we backtrack? if backtrack: #(not reached currently) if debug: print("[FQL EVALUATION DEBUG] Span - Backtracking",file=sys.stderr) index, element, spanset = backtrack.pop() i = index - 1 #next iteration will do +1 again match = True #default continue else: #nope, all is lost, we have no match if debug: print("[FQL EVALUATION DEBUG] Span - Minimum threshold could not be attained for span selector " + str(i), file=sys.stderr) match = False break else: if debug: print("[FQL EVALUATION DEBUG] Span - No match for span selector " + str(i) + ", but no problem since matching threshold was already reached", file=sys.stderr) holdelement = True done = True break if not holdelement: prevelement = element #get next element element = element.next(selector.Class, None) if debug: print("[FQL EVALUATION DEBUG] Span - Selecting next element for next round", repr(element), file=sys.stderr) if done or not match: if debug: print("[FQL EVALUATION DEBUG] Span - Done with span selector " + str(i), repr(element), file=sys.stderr) break if not match: break if match: if debug: print("[FQL EVALUATION DEBUG] Span - Span found, returning spanset (" + repr(spanset) + ")",file=sys.stderr) yield spanset else: if debug: print("[FQL EVALUATION DEBUG] Span - Span not found",file=sys.stderr) class Target(object): #FOR/IN... expression def __init__(self, targets, strict=False,nested = None, start=None, end=None,endinclusive=True,repeat=False): self.targets = targets #Selector instances self.strict = strict #True for IN self.nested = nested #in a nested another target self.start = start self.end = end self.endinclusive = endinclusive self.repeat = repeat @staticmethod def parse(q, i=0): if q.kw(i,'FOR'): strict = False elif q.kw(i,'IN'): strict = True else: raise SyntaxError("Expected target expression, got " + str(q[i]) + " in: " + str(q)) i += 1 targets = [] nested = None start = end = None endinclusive = True repeat = False l = len(q) while i < l: if q.kw(i,'SPAN'): target,i = Span.parse(q,i+1) targets.append(target) elif q.kw(i,"ID") or q[i] in folia.XML2CLASS or q[i] == "ALL": target,i = Selector.parse(q,i) targets.append(target) elif q.kw(i,","): #we're gonna have more targets i += 1 elif q.kw(i, ('FOR','IN')): nested,i = Selector.parse(q,i+1) elif q.kw(i,"START"): start,i = Selector.parse(q,i+1) elif q.kw(i,("END","ENDAFTER")): #inclusive end,i = Selector.parse(q,i+1) endinclusive = True elif q.kw(i,"ENDBEFORE"): #exclusive end,i = Selector.parse(q,i+1) endinclusive = False elif q.kw(i,"REPEAT"): repeat = True i += 1 else: break if not targets: raise SyntaxError("Expected one or more targets, got " + str(q[i]) + " in: " + str(q)) return Target(targets,strict,nested,start,end,endinclusive, repeat), i def __call__(self, query, contextselector, recurse, debug=False): #generator, lazy evaluation! if self.nested: if debug: print("[FQL EVALUATION DEBUG] Target - Deferring to nested target first",file=sys.stderr) contextselector = (self.nested, (query, contextselector, not self.strict)) if debug: print("[FQL EVALUATION DEBUG] Target - Chaining and calling target selectors (" + str(len(self.targets)) + ")",file=sys.stderr) if self.targets: if isinstance(self.targets[0], Span): for span in self.targets: if not isinstance(span, Span): raise QueryError("SPAN statement may not be mixed with non-span statements in a single selection") if debug: print("[FQL EVALUATION DEBUG] Target - Evaluation span ",file=sys.stderr) for spanset in span(query, contextselector, recurse, debug): if debug: print("[FQL EVALUATION DEBUG] Target - Yielding spanset ",file=sys.stderr) yield spanset else: selector = self.targets[0] selector.chain(self.targets) started = (self.start is None) dobreak = False for e,_ in selector(query, contextselector, recurse, debug): if not started: if self.start.match(query, e): if debug: print("[FQL EVALUATION DEBUG] Target - Matched start! Starting from here...",e, file=sys.stderr) started = True if started: if self.end: if self.end.match(query, e): if not self.endinclusive: if debug: print("[FQL EVALUATION DEBUG] Target - Matched end! Breaking before yielding...",e, file=sys.stderr) started = False if self.repeat: continue else: break else: if debug: print("[FQL EVALUATION DEBUG] Target - Matched end! Breaking after yielding...",e, file=sys.stderr) started = False dobreak = True if debug: print("[FQL EVALUATION DEBUG] Target - Yielding ",repr(e), file=sys.stderr) yield e if dobreak and not self.repeat: break class Alternative(object): #AS ALTERNATIVE ... expression def __init__(self, subassignments={},assignments={},filter=None, nextalternative=None): self.subassignments = subassignments self.assignments = assignments self.filter = filter self.nextalternative = nextalternative @staticmethod def parse(q,i=0): if q.kw(i,'AS') and q[i+1] == "ALTERNATIVE": i += 1 subassignments = {} assignments = {} filter = None if q.kw(i,'ALTERNATIVE'): i += 1 if not q.kw(i,'WITH'): i = getassignments(q, i, subassignments) if q.kw(i,'WITH'): i = getassignments(q, i+1, assignments) if q.kw(i,'WHERE'): filter, i = Filter.parse(q, i+1) else: raise SyntaxError("Expected ALTERNATIVE, got " + str(q[i]) + " in: " + str(q)) if q.kw(i,'ALTERNATIVE'): #we have another! nextalternative,i = Alternative.parse(q,i) else: nextalternative = None return Alternative(subassignments, assignments, filter, nextalternative), i def __call__(self, query, action, focus, target,debug=False): """Action delegates to this function""" isspan = isinstance(action.focus.Class, folia.AbstractSpanAnnotation) subassignments = {} #make a copy for key, value in action.assignments.items(): subassignments[key] = value for key, value in self.subassignments.items(): subassignments[key] = value if action.action == "SELECT": if not focus: raise QueryError("SELECT requires a focus element") if not isspan: for alternative in focus.alternatives(action.focus.Class, focus.set): if not self.filter or (self.filter and self.filter.match(query, alternative, debug)): yield alternative else: raise NotImplementedError("Selecting alternative span not implemented yet") elif action.action == "EDIT" or action.action == "ADD": if not isspan: if focus: parent = focus.ancestor(folia.AbstractStructureElement) alternative = folia.Alternative( query.doc, action.focus.Class( query.doc , **subassignments), **self.assignments) parent.append(alternative) yield alternative else: alternative = folia.Alternative( query.doc, action.focus.Class( query.doc , **subassignments), **self.assignments) target.append(alternative) yield alternative else: raise NotImplementedError("Editing alternative span not implemented yet") else: raise QueryError("Alternative does not handle action " + action.action) def autodeclare(self, doc): pass #nothing to declare def substitute(self, *args): raise QueryError("SUBSTITUTE not supported with AS ALTERNATIVE") class Correction(object): #AS CORRECTION/SUGGESTION expression... def __init__(self, set,actionassignments={}, assignments={},filter=None,suggestions=[], bare=False): self.set = set self.actionassignments = actionassignments #the assignments in the action self.assignments = assignments #the assignments for the correction self.filter = filter self.suggestions = suggestions # [ (subassignments, suggestionassignments) ] self.bare = bare @staticmethod def parse(q,i, focus): if q.kw(i,'AS') and q.kw(i+1,'CORRECTION'): i += 1 bare = False if q.kw(i,'AS') and q.kw(i+1,'BARE') and q.kw(i+2,'CORRECTION'): bare = True i += 2 set = None actionassignments = {} assignments = {} filter = None suggestions = [] if q.kw(i,'CORRECTION'): i += 1 if q.kw(i,'OF') and q[i+1]: set = q[i+1] i += 2 if not q.kw(i,'WITH'): i = getassignments(q, i, actionassignments, focus) if q.kw(i,'WHERE'): filter, i = Filter.parse(q, i+1) if q.kw(i,'WITH'): i = getassignments(q, i+1, assignments) else: raise SyntaxError("Expected CORRECTION, got " + str(q[i]) + " in: " + str(q)) l = len(q) while i < l: if q.kw(i,'SUGGESTION'): i+= 1 suggestion = ( {}, {} ) #subassignments, suggestionassignments if isinstance(q[i], UnparsedQuery): if not q[i].kw(0,'SUBSTITUTE') and not q[i].kw(0,'ADD'): raise SyntaxError("Subexpression after SUGGESTION, expected ADD or SUBSTITUTE, got " + str(q[i])) Correction.parsesubstitute(q[i],suggestion) i += 1 elif q.kw(i,'MERGE') or q.kw(i,'SPLIT'): if q.kw(i,'MERGE'): suggestion[1]['merge'] = True else: suggestion[1]['split'] = True i+= 1 if q.kw(i,'DELETION'): #No need to do anything, DELETION is just to make things more explicit in the syntax, it will result in an empty suggestion i+=1 elif isinstance(q[i], UnparsedQuery): if not q[i].kw(0,'SUBSTITUTE') and not q[i].kw(0,'ADD'): raise SyntaxError("Subexpression after SUGGESTION, expected ADD or SUBSTITUTE, got " + str(q[i])) Correction.parsesubstitute(q[i],suggestion) i += 1 elif not q.kw(i,'WITH'): i = getassignments(q, i, suggestion[0], focus) #subassignments (the actual element in the suggestion) elif not q.kw(i,'WITH'): i = getassignments(q, i, suggestion[0], focus) #subassignments (the actual element in the suggestion) if q.kw(i,'WITH'): i = getassignments(q, i+1, suggestion[1]) #assignments for the suggestion suggestions.append(suggestion) else: raise SyntaxError("Expected SUGGESTION or end of AS clause, got " + str(q[i]) + " in: " + str(q)) return Correction(set, actionassignments, assignments, filter, suggestions, bare), i @staticmethod def parsesubstitute(q,suggestion): suggestion[0]['substitute'],_ = Action.parse(q) def __call__(self, query, action, focus, target,debug=False): """Action delegates to this function""" if debug: print("[FQL EVALUATION DEBUG] Correction - Processing ", repr(focus),file=sys.stderr) isspan = isinstance(action.focus.Class, folia.AbstractSpanAnnotation) actionassignments = {} #make a copy for key, value in action.assignments.items(): if key == 'class': key = 'cls' actionassignments[key] = value for key, value in self.actionassignments.items(): if key == 'class': key = 'cls' actionassignments[key] = value if actionassignments: if (not 'set' in actionassignments or actionassignments['set'] is None) and action.focus.Class: try: actionassignments['set'] = query.defaultsets[action.focus.Class.XMLTAG] except KeyError: actionassignments['set'] = query.doc.defaultset(action.focus.Class) if action.focus.Class.REQUIRED_ATTRIBS and folia.Attrib.ID in action.focus.Class.REQUIRED_ATTRIBS: actionassignments['id'] = getrandomid(query, "corrected." + action.focus.Class.XMLTAG + ".") kwargs = {} if self.set: kwargs['set'] = self.set for key, value in self.assignments.items(): if key == 'class': key = 'cls' kwargs[key] = value if action.action == "SELECT": if not focus: raise QueryError("SELECT requires a focus element") correction = focus.incorrection() if correction: if not self.filter or (self.filter and self.filter.match(query, correction, debug)): yield correction elif action.action in ("EDIT","ADD","PREPEND","APPEND"): if focus: correction = focus.incorrection() else: correction = False inheritchildren = [] if focus and not self.bare: #copy all data within inheritchildren = list(focus.copychildren(query.doc, True)) if action.action == "EDIT" and action.span: #respan #delete all word references from the copy first, we will add new ones inheritchildren = [ c for c in inheritchildren if not isinstance(c, folia.WordReference) ] if not isinstance(focus, folia.AbstractSpanAnnotation): raise QueryError("Can only perform RESPAN on span annotation elements!") contextselector = target if target else query.doc spanset = next(action.span(query, contextselector, True, debug)) #there can be only one for w in spanset: inheritchildren.append(w) if actionassignments: kwargs['new'] = action.focus.Class(query.doc,*inheritchildren, **actionassignments) if focus and action.action not in ('PREPEND','APPEND'): kwargs['original'] = focus #TODO: if not bare, fix all span annotation references to this element elif focus and action.action not in ('PREPEND','APPEND'): if isinstance(focus, folia.AbstractStructureElement): kwargs['current'] = focus #current only needed for structure annotation if correction and (not 'set' in kwargs or correction.set == kwargs['set']) and (not 'cls' in kwargs or correction.cls == kwargs['cls']): #reuse the existing correction element print("Reusing " + correction.id,file=sys.stderr) kwargs['reuse'] = correction if action.action in ('PREPEND','APPEND'): #get parent relative to target parent = target.ancestor( (folia.AbstractStructureElement, folia.AbstractSpanAnnotation, folia.AbstractAnnotationLayer) ) elif focus: if 'reuse' in kwargs and kwargs['reuse']: parent = focus.ancestor( (folia.AbstractStructureElement, folia.AbstractSpanAnnotation, folia.AbstractAnnotationLayer) ) else: parent = focus.ancestor( (folia.AbstractStructureElement, folia.AbstractSpanAnnotation, folia.AbstractAnnotationLayer, folia.Correction) ) else: parent = target if 'id' not in kwargs and 'reuse' not in kwargs: kwargs['id'] = parent.generate_id(folia.Correction) kwargs['suggestions'] = [] for subassignments, suggestionassignments in self.suggestions: subassignments = copy(subassignments) #assignment for the element in the suggestion for key, value in action.assignments.items(): if not key in subassignments: if key == 'class': key = 'cls' subassignments[key] = value if (not 'set' in subassignments or subassignments['set'] is None) and action.focus.Class: try: subassignments['set'] = query.defaultsets[action.focus.Class.XMLTAG] except KeyError: subassignments['set'] = query.doc.defaultset(action.focus.Class) if focus and not self.bare: #copy all data within (we have to do this again for each suggestion as it will generate different ID suffixes) inheritchildren = list(focus.copychildren(query.doc, True)) if action.focus.Class.REQUIRED_ATTRIBS and folia.Attrib.ID in action.focus.Class.REQUIRED_ATTRIBS: subassignments['id'] = getrandomid(query, "suggestion.") kwargs['suggestions'].append( folia.Suggestion(query.doc, action.focus.Class(query.doc, *inheritchildren,**subassignments), **suggestionassignments ) ) if action.action == 'PREPEND': index = parent.getindex(target,True) #recursive if index == -1: raise QueryError("Insertion point for PREPEND action not found") kwargs['insertindex'] = index kwargs['nooriginal'] = True elif action.action == 'APPEND': index = parent.getindex(target,True) #recursive if index == -1: raise QueryError("Insertion point for APPEND action not found") kwargs['insertindex'] = index+1 kwargs['insertindex_offset'] = 1 #used by correct if it needs to recompute the index kwargs['nooriginal'] = True yield parent.correct(**kwargs) #generator elif action.action == "DELETE": if debug: print("[FQL EVALUATION DEBUG] Correction - Deleting ", repr(focus), " (in " + repr(focus.parent) + ")",file=sys.stderr) if not focus: raise QueryError("DELETE AS CORRECTION did not find a focus to operate on") kwargs['original'] = focus kwargs['new'] = [] #empty new c = focus.parent.correct(**kwargs) #generator yield c else: raise QueryError("Correction does not handle action " + action.action) def autodeclare(self,doc): if self.set: if not doc.declared(folia.Correction, self.set): doc.declare(folia.Correction, self.set) def prepend(self, query, content, contextselector, debug): return self.insert(query, content, contextselector, 0, debug) def append(self, query, content, contextselector, debug): return self.insert(query, content, contextselector, 1, debug) def insert(self, query, content, contextselector, offset, debug): kwargs = {} if self.set: kwargs['set'] = self.set for key, value in self.assignments.items(): if key == 'class': key = 'cls' kwargs[key] = value self.autodeclare(query.doc) if not content: #suggestions only, no subtitution obtained from main action yet, we have to process it still if debug: print("[FQL EVALUATION DEBUG] Correction.insert - Initialising for suggestions only",file=sys.stderr) if isinstance(contextselector,tuple) and len(contextselector) == 2: contextselector = contextselector[0](*contextselector[1]) target = list(contextselector)[0] #not a spanset insertindex = 0 #find insertion index: if debug: print("[FQL EVALUATION DEBUG] Correction.insert - Finding insertion index for target ", repr(target), " in ", repr(target.parent),file=sys.stderr) for i, e in enumerate(target.parent): if e is target: if debug: print("[FQL EVALUATION DEBUG] Correction.insert - Target ", repr(target), " found in ", repr(target.parent), " at index ", i,file=sys.stderr) insertindex = i break content = {'parent': target.parent,'new':[]} kwargs['insertindex'] = insertindex + offset else: kwargs['insertindex'] = content['index'] + offset if debug: print("[FQL EVALUATION DEBUG] Correction.insert - Initialising correction",file=sys.stderr) kwargs['new'] = [] #stuff will be appended kwargs['nooriginal'] = True #this is an insertion, there is no original kwargs = self.assemblesuggestions(query,content,debug,kwargs) if debug: print("[FQL EVALUATION DEBUG] Correction.insert - Applying and returning correction ", repr(kwargs),file=sys.stderr) return content['parent'].correct(**kwargs) def substitute(self, query, substitution, contextselector, debug): kwargs = {} if self.set: kwargs['set'] = self.set for key, value in self.assignments.items(): if key == 'class': key = 'cls' kwargs[key] = value self.autodeclare(query.doc) if not substitution: #suggestions only, no subtitution obtained from main action yet, we have to process it still if debug: print("[FQL EVALUATION DEBUG] Correction.substitute - Initialising for suggestions only",file=sys.stderr) if isinstance(contextselector,tuple) and len(contextselector) == 2: contextselector = contextselector[0](*contextselector[1]) target = list(contextselector)[0] if not isinstance(target, SpanSet): raise QueryError("SUBSTITUTE expects target SPAN") prev = target[0].parent for e in target[1:]: if e.parent != prev: raise QueryError("SUBSTITUTE can only be performed when the target items share the same parent. First parent is " + repr(prev) + ", parent of " + repr(e) + " is " + repr(e.parent)) insertindex = 0 #find insertion index: for i, e in enumerate(target[0].parent): if e is target[0]: insertindex = i break substitution = {'parent': target[0].parent,'new':[]} kwargs['insertindex'] = insertindex kwargs['current'] = target else: kwargs['insertindex'] = substitution['index'] kwargs['original'] = substitution['span'] if debug: print("[FQL EVALUATION DEBUG] Correction.substitute - Initialising correction",file=sys.stderr) kwargs['new'] = [] #stuff will be appended kwargs = self.assemblesuggestions(query,substitution,debug,kwargs) if debug: print("[FQL EVALUATION DEBUG] Correction.substitute - Applying and returning correction",file=sys.stderr) return substitution['parent'].correct(**kwargs) def assemblesuggestions(self, query, substitution, debug, kwargs): if self.suggestions: kwargs['suggestions'] = [] #stuff will be appended for i, (Class, actionassignments, subactions) in enumerate(substitution['new']): if actionassignments: if (not 'set' in actionassignments or actionassignments['set'] is None): try: actionassignments['set'] = query.defaultsets[Class.XMLTAG] except KeyError: actionassignments['set'] = query.doc.defaultset(Class) actionassignments['id'] = "corrected.%08x" % random.getrandbits(32) #generate a random ID e = Class(query.doc, **actionassignments) if debug: print("[FQL EVALUATION DEBUG] Correction.assemblesuggestions - Adding to new",file=sys.stderr) kwargs['new'].append(e) for subaction in subactions: subaction.focus.autodeclare(query.doc) if debug: print("[FQL EVALUATION DEBUG] Correction.assemblesuggestions - Invoking subaction", subaction.action,file=sys.stderr) subaction(query, [e], debug ) #note: results of subactions will be silently discarded for subassignments, suggestionassignments in self.suggestions: suggestionchildren = [] if 'substitute' in subassignments: #SUBTITUTE (or synonym ADD) action = subassignments['substitute'] del subassignments['substitute'] else: #we have a suggested deletion action = None if debug: print("[FQL EVALUATION DEBUG] Correction.assemblesuggestions - Adding suggestion",file=sys.stderr) while action: subassignments = copy(subassignments) #assignment for the element in the suggestion if isinstance(action.focus, tuple) and len(action.focus) == 2: action.focus = action.focus[0] for key, value in action.assignments.items(): if key == 'class': key = 'cls' subassignments[key] = value if (not 'set' in subassignments or subassignments['set'] is None) and action.focus.Class: try: subassignments['set'] = query.defaultsets[action.focus.Class.XMLTAG] except KeyError: subassignments['set'] = query.doc.defaultset(action.focus.Class) focus = action.focus focus.autodeclare(query.doc) if focus.Class.REQUIRED_ATTRIBS and folia.Attrib.ID in focus.Class.REQUIRED_ATTRIBS: subassignments['id'] = getrandomid(query, "suggestion.") suggestionchildren.append( focus.Class(query.doc, **subassignments)) action = action.nextaction if debug: print("[FQL EVALUATION DEBUG] Correction.assemblesuggestions - Suggestionchildren: ", len(suggestionchildren),file=sys.stderr) if 'split' in suggestionassignments and suggestionassignments['split']: nextitem = substitution['parent'].next(substitution['parent'].__class__, None) if nextitem: suggestionassignments['split'] = nextitem.id else: del suggestionassignments['split'] if 'merge' in suggestionassignments and suggestionassignments['merge']: nextitem = substitution['parent'].next(substitution['parent'].__class__, None) if nextitem: suggestionassignments['merge'] = nextitem.id else: del suggestionassignments['merge'] kwargs['suggestions'].append( folia.Suggestion(query.doc,*suggestionchildren, **suggestionassignments ) ) return kwargs def getassignments(q, i, assignments, focus=None): l = len(q) while i < l: if q.kw(i, ('id','set','subset','annotator','class','n')): if q[i+1] == 'NONE': assignments[q[i]] = None else: assignments[q[i]] = q[i+1] i+=2 elif q.kw(i,'confidence'): if q[i+1] == 'NONE': assignments[q[i]] = None else: try: assignments[q[i]] = float(q[i+1]) except: raise SyntaxError("Invalid value for confidence: " + str(q[i+1])) i+=2 elif q.kw(i,'annotatortype'): if q[i+1] == "auto": assignments[q[i]] = folia.AnnotatorType.AUTO elif q[i+1] == "manual": assignments[q[i]] = folia.AnnotatorType.MANUAL elif q[i+1] == "NONE": assignments[q[i]] = None else: raise SyntaxError("Invalid value for annotatortype: " + str(q[i+1])) i+=2 elif q.kw(i,('text','value','phon')): if not focus is None and focus.Class in (folia.TextContent, folia.Description, folia.Comment): key = 'value' elif not focus is None and focus.Class is folia.PhonContent: key = 'phon' else: key = 'text' assignments[key] = q[i+1] i+=2 elif q.kw(i, 'datetime'): if q[i+1] == "now": assignments[q[i]] = datetime.datetime.now() elif q[i+1] == "NONE": assignments[q[i]] = None elif q[i+1].isdigit(): try: assignments[q[i]] = datetime.datetime.fromtimestamp(q[i+1]) except: raise SyntaxError("Unable to parse datetime: " + str(q[i+1])) else: try: assignments[q[i]] = datetime.strptime("%Y-%m-%dT%H:%M:%S") except: raise SyntaxError("Unable to parse datetime: " + str(q[i+1])) i += 2 else: if not assignments: raise SyntaxError("Expected assignments after WITH statement, but no valid attribute found, got " + str(q[i]) + " at position " + str(i) + " in: " + str(q)) break return i class Action(object): #Action expression def __init__(self, action, focus, assignments={}): self.action = action self.focus = focus #Selector self.assignments = assignments self.form = None self.subactions = [] self.nextaction = None self.span = None #encodes an extra SPAN/RESPAN action @staticmethod def parse(q,i=0): if q.kw(i, ('SELECT','EDIT','DELETE','ADD','APPEND','PREPEND','SUBSTITUTE')): action = q[i] else: raise SyntaxError("Expected action, got " + str(q[i]) + " in: " + str(q)) assignments = {} i += 1 if (action in ('SUBSTITUTE','APPEND','PREPEND')) and (isinstance(q[i],UnparsedQuery)): focus = None #We have a SUBSTITUTE/APPEND/PREPEND (AS CORRECTION) expression elif (action == 'SELECT') and q.kw(i,('FOR','IN')): #select statement without focus, pure target focus = None else: focus, i = Selector.parse(q,i) if action == "ADD" and focus.filter: raise SyntaxError("Focus has WHERE statement but ADD action does not support this") if q.kw(i,"WITH"): if action in ("SELECT", "DELETE"): raise SyntaxError("Focus has WITH statement but " + action + " does not support this: " +str(q)) i += 1 i = getassignments(q,i ,assignments, focus) #we have enough to set up the action now action = Action(action, focus, assignments) if action.action in ("EDIT","ADD", "APPEND","PREPEND") and q.kw(i,("RESPAN","SPAN")): action.span, i = Span.parse(q,i+1) done = False while not done: if isinstance(q[i], UnparsedQuery): #we have a sub expression if q[i].kw(0, ('EDIT','DELETE','ADD')): #It's a sub-action! if action.action in ("DELETE"): raise SyntaxError("Subactions are not allowed for action " + action.action + ", in: " + str(q)) subaction, _ = Action.parse(q[i]) action.subactions.append( subaction ) elif q[i].kw(0, 'AS'): if q[i].kw(1, "ALTERNATIVE"): action.form,_ = Alternative.parse(q[i]) elif q[i].kw(1, "CORRECTION") or (q[i].kw(1,"BARE") and q[i].kw(2, "CORRECTION")): action.form,_ = Correction.parse(q[i],0,action.focus) else: raise SyntaxError("Invalid keyword after AS: " + str(q[i][1])) i+=1 else: done = True if q.kw(i, ('SELECT','EDIT','DELETE','ADD','APPEND','PREPEND','SUBSTITUTE')): #We have another action! action.nextaction, i = Action.parse(q,i) return action, i def __call__(self, query, contextselector, debug=False): """Returns a list focusselection after having performed the desired action on each element therein""" #contextselector is a two-tuple function recipe (f,args), so we can reobtain the generator which it returns #select all focuss, not lazy because we are going return them all by definition anyway if debug: print("[FQL EVALUATION DEBUG] Action - Preparing to evaluate action chain starting with ", self.action,file=sys.stderr) #handles all actions further in the chain, not just this one!!! This actual method is only called once actions = [self] a = self while a.nextaction: actions.append(a.nextaction) a = a.nextaction if len(actions) > 1: #multiple actions to perform, apply contextselector once and load in memory (will be quicker at higher memory cost, proportionate to the target selection size) if isinstance(contextselector, tuple) and len(contextselector) == 2: contextselector = list(contextselector[0](*contextselector[1])) focusselection_all = [] constrainedtargetselection_all = [] for action in actions: if action.action != "SELECT" and action.focus: #check if set is declared, if not, auto-declare if debug: print("[FQL EVALUATION DEBUG] Action - Auto-declaring ",action.focus.Class.__name__, " of ", str(action.focus.set),file=sys.stderr) action.focus.autodeclare(query.doc) if action.form and isinstance(action.form, Correction) and action.focus: if debug: print("[FQL EVALUATION DEBUG] Action - Auto-declaring ",action.focus.Class.__name__, " of ", str(action.focus.set),file=sys.stderr) action.form.autodeclare(query.doc) substitution = {} if self.action == 'SUBSTITUTE' and not self.focus and self.form: #we have a SUBSTITUTE (AS CORRECTION) statement with no correction but only suggestions #defer substitute to form result = self.form.substitute(query, None, contextselector, debug) focusselection = [result] constrainedtargetselection = [] #(no further chaining possible in this setup) elif self.action == 'PREPEND' and not self.focus and self.form: #we have a PREPEND (AS CORRECTION) statement with no correction but only suggestions #defer substitute to form result = self.form.prepend(query, None, contextselector, debug) focusselection = [result] constrainedtargetselection = [] #(no further chaining possible in this setup) elif self.action == 'APPEND' and not self.focus and self.form: #we have a APPEND (AS CORRECTION) statement with no correction but only suggestions #defer substitute to form result = self.form.append(query, None, contextselector, debug) focusselection = [result] constrainedtargetselection = [] #(no further chaining possible in this setup) else: for action in actions: if debug: print("[FQL EVALUATION DEBUG] Action - Evaluating action ", action.action,file=sys.stderr) focusselection = [] constrainedtargetselection = [] #selecting focus elements constrains the target selection processed_form = [] if substitution and action.action != "SUBSTITUTE": raise QueryError("SUBSTITUTE can not be chained with " + action.action) if action.action == "SELECT" and not action.focus: #SELECT without focus, pure target-select if isinstance(contextselector, tuple) and len(contextselector) == 2: for e in contextselector[0](*contextselector[1]): constrainedtargetselection.append(e) focusselection.append(e) else: for e in contextselector: constrainedtargetselection.append(e) focusselection.append(e) elif action.action not in ("ADD","APPEND","PREPEND"): #only for actions that operate on an existing focus if contextselector is query.doc and action.focus.Class in ('ALL',folia.Text): focusselector = ( (x,x) for x in query.doc ) #Patch to make root-level SELECT ALL work as intended else: strict = query.targets and query.targets.strict focusselector = action.focus(query,contextselector, not strict, debug) if debug: print("[FQL EVALUATION DEBUG] Action - Obtaining focus...",file=sys.stderr) for focus, target in focusselector: if target and action.action != "SUBSTITUTE": if isinstance(target, SpanSet): if not target.partof(constrainedtargetselection): if debug: print("[FQL EVALUATION DEBUG] Action - Got target result (spanset), adding ", repr(target),file=sys.stderr) constrainedtargetselection.append(target) elif not any(x is target for x in constrainedtargetselection): if debug: print("[FQL EVALUATION DEBUG] Action - Got target result, adding ", repr(target),file=sys.stderr) constrainedtargetselection.append(target) if action.form and action.action != "SUBSTITUTE": #Delegate action to form (= correction or alternative) if not any(x is focus for x in processed_form): if debug: print("[FQL EVALUATION DEBUG] Action - Got focus result, processing using form ", repr(focus),file=sys.stderr) processed_form.append(focus) focusselection += list(action.form(query, action,focus,target,debug)) else: if debug: print("[FQL EVALUATION DEBUG] Action - Focus result already obtained, skipping... ", repr(focus),file=sys.stderr) continue else: if isinstance(focus,SpanSet): if not focus.partof(focusselection): if debug: print("[FQL EVALUATION DEBUG] Action - Got focus result (spanset), adding ", repr(target),file=sys.stderr) focusselection.append(target) else: if debug: print("[FQL EVALUATION DEBUG] Action - Focus result (spanset) already obtained, skipping... ", repr(target),file=sys.stderr) continue elif not any(x is focus for x in focusselection): if debug: print("[FQL EVALUATION DEBUG] Action - Got focus result, adding ", repr(focus),file=sys.stderr) focusselection.append(focus) else: if debug: print("[FQL EVALUATION DEBUG] Action - Focus result already obtained, skipping... ", repr(focus),file=sys.stderr) continue if action.action == "EDIT": if debug: print("[FQL EVALUATION DEBUG] Action - Applying EDIT to focus ", repr(focus),file=sys.stderr) for attr, value in action.assignments.items(): if attr in ("text","value","phon"): if isinstance(focus, (folia.Description, folia.Comment, folia.Content)): if debug: print("[FQL EVALUATION DEBUG] Action - setting value ("+ value+ ") on focus ", repr(focus),file=sys.stderr) focus.value = value elif isinstance(focus, (folia.PhonContent)): if debug: print("[FQL EVALUATION DEBUG] Action - setphon("+ value+ ") on focus ", repr(focus),file=sys.stderr) focus.setphon(value) else: if debug: print("[FQL EVALUATION DEBUG] Action - settext("+ value+ ") on focus ", repr(focus),file=sys.stderr) focus.settext(value) elif attr == "class": if debug: print("[FQL EVALUATION DEBUG] Action - " + attr + " = " + value + " on focus ", repr(focus),file=sys.stderr) focus.cls = value else: if debug: print("[FQL EVALUATION DEBUG] Action - " + attr + " = " + value + " on focus ", repr(focus),file=sys.stderr) setattr(focus, attr, value) if action.span is not None: #respan if not isinstance(focus, folia.AbstractSpanAnnotation): raise QueryError("Can only perform RESPAN on span annotation elements!") spanset = next(action.span(query, contextselector, True, debug)) #there can be only one focus.setspan(*spanset) query._touch(focus) elif action.action == "DELETE": if debug: print("[FQL EVALUATION DEBUG] Action - Applying DELETE to focus ", repr(focus),file=sys.stderr) p = focus.parent p.remove(focus) #we set the parent back on the element we return, so return types like ancestor-focus work focus.parent = p elif action.action == "SUBSTITUTE": if debug: print("[FQL EVALUATION DEBUG] Action - Applying SUBSTITUTE to target ", repr(focus),file=sys.stderr) if not isinstance(target,SpanSet) or not target: raise QueryError("SUBSTITUTE requires a target SPAN") focusselection.remove(focus) if not substitution: #this is the first SUBSTITUTE in a chain prev = target[0].parent for e in target[1:]: if e.parent != prev: raise QueryError("SUBSTITUTE can only be performed when the target items share the same parent") substitution['parent'] = target[0].parent substitution['index'] = 0 substitution['span'] = target substitution['new'] = [] #find insertion index: for i, e in enumerate(target[0].parent): if e is target[0]: substitution['index'] = i break substitution['new'].append( (action.focus.Class, action.assignments, action.subactions) ) if action.action in ("ADD","APPEND","PREPEND") or (action.action == "EDIT" and not focusselection): if debug: print("[FQL EVALUATION DEBUG] Action - Applying " + action.action + " to targets",file=sys.stderr) if not action.focus.Class: raise QueryError("Focus of action has no class!") isspan = issubclass(action.focus.Class, folia.AbstractSpanAnnotation) isspanrole = issubclass(action.focus.Class, folia.AbstractSpanRole) if 'set' not in action.assignments and action.focus.Class not in (folia.Description, folia.Comment, folia.Feature) and not isspanrole: if action.focus.set and action.focus.set != "undefined": action.assignments['set'] = action.focus.set elif action.focus.Class.XMLTAG in query.defaultsets: action.assignments['set'] = action.focus.set = query.defaultsets[action.focus.Class.XMLTAG] else: action.assignments['set'] = action.focus.set = query.doc.defaultset(action.focus.Class) if isinstance(contextselector, tuple) and len(contextselector) == 2: targetselection = contextselector[0](*contextselector[1]) else: targetselection = contextselector for target in targetselection: if action.form: #Delegate action to form (= correction or alternative) focusselection += list( action.form(query, action,None,target,debug) ) else: if isinstance(target, SpanSet): if action.action == "ADD" or action.action == "EDIT": if debug: print("[FQL EVALUATION DEBUG] Action - Applying " + action.action + " of " + action.focus.Class.__name__ + " to target spanset " + repr(target),file=sys.stderr) if action.span is not None and len(action.span) == 0: action.assignments['emptyspan'] = True focusselection.append( target[0].add(action.focus.Class, *target, **action.assignments) ) #handles span annotation too query._touch(focusselection[-1]) else: if action.action == "ADD" or action.action == "EDIT": if debug: print("[FQL EVALUATION DEBUG] Action - Applying " + action.action + " of " + action.focus.Class.__name__ + " to target " + repr(target),file=sys.stderr) focusselection.append( target.add(action.focus.Class, **action.assignments) ) #handles span annotation too query._touch(focusselection[-1]) elif action.action == "APPEND": if debug: print("[FQL EVALUATION DEBUG] Action - Applying " + action.action + " of " + action.focus.Class.__name__ +" to target " + repr(target),file=sys.stderr) index = target.parent.getindex(target) if index == -1: raise QueryError("Insertion point for APPEND action not found") focusselection.append( target.parent.insert(index+1, action.focus.Class, **action.assignments) ) query._touch(focusselection[-1]) elif action.action == "PREPEND": if debug: print("[FQL EVALUATION DEBUG] Action - Applying " + action.action + " of " + action.focus.Class.__name__ +" to target " + repr(target),file=sys.stderr) index = target.parent.getindex(target) if index == -1: raise QueryError("Insertion point for PREPEND action not found") focusselection.append( target.parent.insert(index, action.focus.Class, **action.assignments) ) query._touch(focusselection[-1]) if isinstance(target, SpanSet): if not target.partof(constrainedtargetselection): constrainedtargetselection.append(target) elif not any(x is target for x in constrainedtargetselection): constrainedtargetselection.append(target) if focusselection and action.span: #process SPAN keyword (ADD .. SPAN .. FOR .. rather than ADD ... FOR SPAN ..) if not isspan: raise QueryError("Can only use SPAN with span annotation elements!") for focus in focusselection: spanset = next(action.span(query, contextselector, True, debug)) #there can be only one focus.setspan(*spanset) if focusselection and action.subactions and not substitution: for subaction in action.subactions: #check if set is declared, if not, auto-declare if debug: print("[FQL EVALUATION DEBUG] Action - Auto-declaring ",action.focus.Class.__name__, " of ", str(action.focus.set),file=sys.stderr) subaction.focus.autodeclare(query.doc) if debug: print("[FQL EVALUATION DEBUG] Action - Invoking subaction ", subaction.action,file=sys.stderr) subaction(query, focusselection, debug ) #note: results of subactions will be silently discarded, they can never select anything if len(actions) > 1: #consolidate results: focusselection_all = [] for e in focusselection: if isinstance(e, SpanSet): if not e.partof(focusselection_all): focusselection_all.append(e) elif not any(x is e for x in focusselection_all): focusselection_all.append(e) constrainedtargetselection_all = [] for e in constrainedtargetselection: if isinstance(e, SpanSet): if not e.partof(constrainedtargetselection_all): constrainedtargetselection_all.append(e) elif not any(x is e for x in constrainedtargetselection_all): constrainedtargetselection_all.append(e) if substitution: constrainedtargetselection_all = [] constrainedtargetselection = [] if action.form: result = action.form.substitute(query, substitution, None, debug) if len(actions) > 1: focusselection_all.append(result) else: focusselection.append(result) else: if debug: print("[FQL EVALUATION DEBUG] Action - Substitution - Removing target",file=sys.stderr) for e in substitution['span']: substitution['parent'].remove(e) for i, (Class, assignments, subactions) in enumerate(substitution['new']): if debug: print("[FQL EVALUATION DEBUG] Action - Substitution - Inserting substitution",file=sys.stderr) e = substitution['parent'].insert(substitution['index']+i, Class, **assignments) for subaction in subactions: subaction.focus.autodeclare(query.doc) if debug: print("[FQL EVALUATION DEBUG] Action - Invoking subaction (in substitution) ", subaction.action,file=sys.stderr) subaction(query, [e], debug ) #note: results of subactions will be silently discarded, they can never select anything if len(actions) > 1: focusselection_all.append(e) else: focusselection.append(e) if len(actions) > 1: return focusselection_all, constrainedtargetselection_all else: return focusselection, constrainedtargetselection class Context(object): def __init__(self): self.format = "python" self.returntype = "focus" self.request = "all" self.defaults = {} self.defaultsets = {} class Query(object): """This class represents an FQL query. Selecting a word with a particular text is done as follows, ``doc`` is an instance of :class:`pynlpl.formats.folia.Document`:: query = fql.Query('SELECT w WHERE text = "house"') for word in query(doc): print(word) #this will be an instance of folia.Word Regular expression matching can be done using the ``MATCHES`` operator:: query = fql.Query('SELECT w WHERE text MATCHES "^house.*$"') for word in query(doc): print(word) The classes of other annotation types can be easily queried as follows:: query = fql.Query('SELECT w WHERE :pos = "v"' AND :lemma = "be"') for word in query(doc): print(word) You can constrain your queries to a particular target selection using the ``FOR`` keyword:: query = fql.Query('SELECT w WHERE text MATCHES "^house.*$" FOR s WHERE text CONTAINS "sell"') for word in query(doc): print(word) This construction also allows you to select the actual annotations. To select all people (a named entity) for words that are not John:: query = fql.Query('SELECT entity WHERE class = "person" FOR w WHERE text != "John"') for entity in query(doc): print(entity) #this will be an instance of folia.Entity **FOR** statement may be chained, and Explicit IDs can be passed using the ``ID`` keyword:: query = fql.Query('SELECT entity WHERE class = "person" FOR w WHERE text != "John" FOR div ID "section.21"') for entity in query(doc): print(entity) Sets are specified using the **OF** keyword, it can be omitted if there is only one for the annotation type, but will be required otherwise:: query = fql.Query('SELECT su OF "http://some/syntax/set" WHERE class = "np"') for su in query(doc): print(su) #this will be an instance of folia.SyntacticUnit We have just covered just the **SELECT** keyword, FQL has other keywords for manipulating documents, such as **EDIT**, **ADD**, **APPEND** and **PREPEND**. Note: Consult the FQL documentation at https://github.com/proycon/foliadocserve/blob/master/README.rst for further documentation on the language. """ def __init__(self, q, context=Context()): self.action = None self.targets = None self.declarations = [] self.format = context.format self.returntype = context.returntype self.request = copy(context.request) self.defaults = copy(context.defaults) self.defaultsets = copy(context.defaultsets) self.parse(q) def parse(self, q, i=0): if not isinstance(q,UnparsedQuery): q = UnparsedQuery(q) l = len(q) if q.kw(i,"DECLARE"): try: Class = folia.XML2CLASS[q[i+1]] except: raise SyntaxError("DECLARE statement expects a FoLiA element, got: " + str(q[i+1])) if not Class.ANNOTATIONTYPE: raise SyntaxError("DECLARE statement for undeclarable element type: " + str(q[i+1])) i += 2 defaults = {} decset = None if q.kw(i,"OF") and q[i+1]: i += 1 decset = q[i] i += 1 if q.kw(i,"WITH"): i = getassignments(q,i+1,defaults) if not decset: raise SyntaxError("DECLARE statement must state a set") self.declarations.append( (Class, decset, defaults) ) if i < l: self.action,i = Action.parse(q,i) if q.kw(i,("FOR","IN")): self.targets, i = Target.parse(q,i) while i < l: if q.kw(i,"RETURN"): self.returntype = q[i+1] i+=2 elif q.kw(i,"FORMAT"): self.format = q[i+1] i+=2 elif q.kw(i,"REQUEST"): self.request = q[i+1].split(",") i+=2 else: raise SyntaxError("Unexpected " + str(q[i]) + " at position " + str(i) + " in: " + str(q)) if i != l: raise SyntaxError("Expected end of query, got " + str(q[i]) + " in: " + str(q)) def __call__(self, doc, wrap=True,debug=False): """Execute the query on the specified document""" self.doc = doc if debug: print("[FQL EVALUATION DEBUG] Query - Starting on document ", doc.id,file=sys.stderr) if self.declarations: for Class, decset, defaults in self.declarations: if debug: print("[FQL EVALUATION DEBUG] Processing declaration for ", Class.__name__, "of",str(decset),file=sys.stderr) doc.declare(Class,decset,**defaults) if self.action: targetselector = doc if self.targets and not (isinstance(self.targets.targets[0], Selector) and self.targets.targets[0].Class in ("ALL", folia.Text)): targetselector = (self.targets, (self, targetselector, True, debug)) #function recipe to get the generator for the targets, (f, *args) (first is always recursive) focusselection, targetselection = self.action(self, targetselector, debug) #selecting focus elements further constrains the target selection (if any), return values will be lists if self.returntype == "nothing": return "" elif self.returntype == "focus": responseselection = focusselection elif self.returntype == "target" or self.returntype == "inner-target": responseselection = [] for e in targetselection: if not any(x is e for x in responseselection): #filter out duplicates responseselection.append(e) elif self.returntype == "outer-target": raise NotImplementedError elif self.returntype == "ancestor" or self.returntype == "ancestor-focus": responseselection = [] try: responseselection.append( next(folia.commonancestors(folia.AbstractStructureElement,*focusselection)) ) except StopIteration: raise QueryError("No ancestors found for focus: " + str(repr(focusselection))) elif self.returntype == "ancestor-target": elems = [] for e in targetselection: if isinstance(e, SpanSet): elems += e else: elems.append(e) responseselection = [] try: responseselection.append( next(folia.commonancestors(folia.AbstractStructureElement,*elems)) ) except StopIteration: raise QueryError("No ancestors found for targets: " + str(repr(targetselection))) else: raise QueryError("Invalid return type: " + self.returntype) else: responseselection = [] if self.returntype == "nothing": #we're done return "" #convert response selection to proper format and return if self.format.startswith('single'): if len(responseselection) > 1: raise QueryError("A single response was expected, but multiple are returned") if self.format == "single-xml": if debug: print("[FQL EVALUATION DEBUG] Query - Returning single-xml",file=sys.stderr) if not responseselection: return "" else: if isinstance(responseselection[0], SpanSet): r = "\n" for e in responseselection[0]: r += e.xmlstring(True) r += "\n" return r else: return responseselection[0].xmlstring(True) elif self.format == "single-json": if debug: print("[FQL EVALUATION DEBUG] Query - Returning single-json",file=sys.stderr) if not responseselection: return "null" else: return json.dumps(responseselection[0].json()) elif self.format == "single-python": if debug: print("[FQL EVALUATION DEBUG] Query - Returning single-python",file=sys.stderr) if not responseselection: return None else: return responseselection[0] else: if self.format == "xml": if debug: print("[FQL EVALUATION DEBUG] Query - Returning xml",file=sys.stderr) if not responseselection: if wrap: return "" else: return "" else: if wrap: r = "\n" else: r = "" for e in responseselection: if isinstance(e, SpanSet): r += "\n" for e2 in e: r += "" + e2.xmlstring(True) + "\n" r += "\n" else: r += "\n" + e.xmlstring(True) + "\n" if wrap: r += "\n" return r elif self.format == "json": if debug: print("[FQL EVALUATION DEBUG] Query - Returning json",file=sys.stderr) if not responseselection: if wrap: return "[]" else: return "" else: if wrap: s = "[ " else: s = "" for e in responseselection: if isinstance(e, SpanSet): s += json.dumps([ e2.json() for e2 in e ] ) + ", " else: s += json.dumps(e.json()) + ", " s = s.strip(", ") if wrap: s += "]" return s else: #python and undefined formats if debug: print("[FQL EVALUATION DEBUG] Query - Returning python",file=sys.stderr) return responseselection return QueryError("Invalid format: " + self.format) def _touch(self, *args): for e in args: if isinstance(e, folia.AbstractElement): e.changedbyquery = self self._touch(*e.data) PyNLPl-1.1.2/pynlpl/formats/cgn.py0000644000175000001440000000736012445064173017627 0ustar proyconusers00000000000000#-*- coding:utf-8 -*- ############################################################### # PyNLPl - Corpus Gesproken Nederlands # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # Licensed under GPLv3 # # Classes for reading CGN (still to be added). Most notably, contains a function for decoding # PoS features like "N(soort,ev,basis,onz,stan)" into a data structure. # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout from pynlpl.formats import folia from pynlpl.common import Enum class InvalidTagException(Exception): pass class InvalidFeatureException(Exception): pass subsets = { 'ntype': ['soort','eigen'], 'getal': ['ev','mv','getal',], 'genus': ['zijd','onz','masc','fem','genus'], 'naamval': ['stan','gen','dat','nomin','obl','bijz'], 'spectype': ['afgebr','afk','deeleigen','symb','vreemd','enof','meta','achter','comment','onverst'], 'conjtype': ['neven','onder'], 'vztype': ['init','versm','fin'], 'npagr': ['agr','evon','rest','evz','mv','agr3','evmo','rest3','evf'], 'lwtype': ['bep','onbep'], 'vwtype': ['pers','pr','refl','recip','bez','vb','vrag','betr','excl','aanw','onbep'], 'pdtype': ['adv-pron','pron','det','grad'], 'status': ['vol','red','nadr'], 'persoon': ['1','2','2v','2b','3','3p','3m','3v','3o','persoon'], 'positie': ['prenom','postnom', 'nom','vrij'], 'buiging': ['zonder','met-e','met-s'], 'getal-n' : ['zonder-v','mv-n','zonder-n'], 'graad' : ['basis','comp','sup','dim'], 'wvorm': ['pv','inf','vd','od'], 'pvtijd': ['tgw','verl','conj'], 'pvagr': ['ev','mv','met-t'], 'numtype': ['hoofd','rang'], 'dial': ['dial'], } constraints = { 'getal':['N','VNW'], 'npagr':['VNW','LID'], 'pvagr':['WW'], } def parse_cgn_postag(rawtag, raisefeatureexceptions = False): global subsets, constraints """decodes PoS features like "N(soort,ev,basis,onz,stan)" into a PosAnnotation data structure based on CGN tag overview compiled by Matje van de Camp""" begin = rawtag.find('(') if rawtag[-1] == ')' and begin > 0: tag = folia.PosAnnotation(None, cls=rawtag,set='http://ilk.uvt.nl/folia/sets/cgn') head = rawtag[0:begin] tag.append( folia.Feature, subset='head',cls=head) rawfeatures = rawtag[begin+1:-1].split(',') for rawfeature in rawfeatures: if rawfeature: found = False for subset, classes in subsets.items(): if rawfeature in classes: if subset in constraints: if not head in constraints[subset]: continue #constraint not met! found = True tag.append( folia.Feature, subset=subset,cls=rawfeature) break if not found: print("\t\tUnknown feature value: " + rawfeature + " in " + rawtag, file=stderr) if raisefeatureexceptions: raise InvalidFeatureException("Unknown feature value: " + rawfeature + " in " + rawtag) else: continue return tag else: raise InvalidTagException("Not a valid CGN tag") PyNLPl-1.1.2/pynlpl/formats/timbl.py0000644000175000001440000001074513024723323020161 0ustar proyconusers00000000000000############################################################### # PyNLPl - Timbl Classifier Output Library # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # Derived from code by Sander Canisius # # Licensed under GPLv3 # # This library offers a TimblOutput class for reading Timbl # classifier output. It supports full distributions (+v+db) and comment (#) # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout from pynlpl.statistics import Distribution class TimblOutput(object): """A class for reading Timbl classifier output, supports the +v+db option and ignores comments starting with #""" def __init__(self, stream, delimiter = ' ', ignorecolumns = [], ignorevalues = []): self.stream = stream self.delimiter = delimiter self.ignorecolumns = ignorecolumns #numbers, ignore the specified FEATURE columns: first column is 1 self.ignorevalues = ignorevalues #Ignore columns with the following values def __iter__(self): # Note: distance parsing (+v+di) works only if distributions (+v+db) are also enabled! for line in self.stream: endfvec = None line = line.strip() if line and line[0] != '#': #ignore empty lines and comments segments = [ x for i, x in enumerate(line.split(self.delimiter)) if x not in self.ignorevalues and i+1 not in self.ignorecolumns ] #segments = [ x for x in line.split() if x != "^" and not (len(x) == 3 and x[0:2] == "n=") ] #obtain segments, and filter null fields and "n=?" feature (in fixed-feature configuration) if not endfvec: try: # Modified by Ruben. There are some cases where one of the features is a {, and then # the module is not able to obtain the distribution of scores and senses # We have to look for the last { in the vector, and due to there is no rindex method # we obtain the reverse and then apply index. aux=list(reversed(segments)).index("{") endfvec=len(segments)-aux-1 #endfvec = segments.index("{") except ValueError: endfvec = None if endfvec and endfvec > 2: # only for +v+db try: enddistr = segments.index('}',endfvec) except ValueError: raise distribution = self.parseDistribution(segments, endfvec, enddistr) if len(segments) > enddistr + 1: distance = float(segments[-1]) else: distance = None else: endfvec = len(segments) distribution = None distance = None #features, referenceclass, predictedclass, distribution, distance yield segments[:endfvec - 2], segments[endfvec - 2], segments[endfvec - 1], distribution, distance def parseDistribution(self, instance, start,end= None): dist = {} i = start + 1 if not end: end = len(instance) - 1 while i < end: #instance[i] != "}": label = instance[i] try: score = float(instance[i+1].rstrip(",")) dist[label] = score except: print("ERROR: pynlpl.input.timbl.TimblOutput -- Could not fetch score for class '" + label + "', expected float, but found '"+instance[i+1].rstrip(",")+"'. Instance= " + " ".join(instance)+ ".. Attempting to compensate...",file=stderr) i = i - 1 i += 2 if not dist: print("ERROR: pynlpl.input.timbl.TimblOutput -- Did not find class distribution for ", instance,file=stderr) return Distribution(dist) PyNLPl-1.1.2/pynlpl/formats/sonar.py0000644000175000001440000002374212445064173020204 0ustar proyconusers00000000000000#--------------------------------------------------------------- # PyNLPl - Simple Read library for D-Coi/SoNaR format # by Maarten van Gompel, ILK, Universiteit van Tilburg # http://ilk.uvt.nl/~mvgompel # proycon AT anaproy DOT nl # # Licensed under GPLv3 # # This library facilitates parsing and reading corpora in # the SoNaR/D-Coi format. # #---------------------------------------------------------------- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import io import re import glob import os.path import sys from lxml import etree as ElementTree if sys.version < '3': from StringIO import StringIO else: from io import StringIO namespaces = { 'dcoi': "http://lands.let.ru.nl/projects/d-coi/ns/1.0", 'standalone':"http://ilk.uvt.nl/dutchsemcor-standalone", 'dsc':"http://ilk.uvt.nl/dutchsemcor", 'xml':"http://www.w3.org/XML/1998/namespace" } class CorpusDocument(object): """This class represent one document/text of the Corpus (read-only)""" def __init__(self, filename, encoding = 'iso-8859-15'): self.filename = filename self.id = os.path.basename(filename).split(".")[0] self.f = io.open(filename,'r', encoding=encoding) self.metadata = {} def _parseimdi(self,line): r = re.compile('(.*)') matches = r.findall(line) if matches: self.metadata['title'] = matches[0] if not 'date' in self.metadata: r = re.compile('(.*)') matches = r.findall(line) if matches: self.metadata['date'] = matches[0] def __iter__(self): """Iterate over all words, a four-tuple (word,id,pos,lemma), in the document""" r = re.compile('(.*)') for line in self.f.readlines(): matches = r.findall(line) for id, attribs, word in matches: pos = lemma = None m = re.findall('pos="([^"]+)"', attribs) if m: pos = m[0] m = re.findall('lemma="([^"]+)"', attribs) if m: lemma = m[0] yield word, id, pos, lemma if line.find('imdi:') != -1: self._parseimdi(line) def words(self): #alias return iter(self) def sentences(self): """Iterate over all sentences (sentence_id, sentence) in the document, sentence is a list of 4-tuples (word,id,pos,lemma)""" prevp = 0 prevs = 0 sentence = []; sentence_id = "" for word, id, pos, lemma in iter(self): try: doc_id, ptype, p, s, w = re.findall('([\w\d-]+)\.(p|head)\.(\d+)\.s\.(\d+)\.w\.(\d+)',id)[0] if ((p != prevp) or (s != prevs)) and sentence: yield sentence_id, sentence sentence = [] sentence_id = doc_id + '.' + ptype + '.' + str(p) + '.s.' + str(s) prevp = p except IndexError: doc_id, s, w = re.findall('([\w\d-]+)\.s\.(\d+)\.w\.(\d+)',id)[0] if s != prevs and sentence: yield sentence_id, sentence sentence = [] sentence_id = doc_id + '.s.' + str(s) sentence.append( (word,id,pos,lemma) ) prevs = s if sentence: yield sentence_id, sentence def paragraphs(self, with_id = False): """Extracts paragraphs, returns list of plain-text(!) paragraphs""" prevp = 0 partext = [] for word, id, pos, lemma in iter(self): doc_id, ptype, p, s, w = re.findall('([\w\d-]+)\.(p|head)\.(\d+)\.s\.(\d+)\.w\.(\d+)',id)[0] if prevp != p and partext: yield ( doc_id + "." + ptype + "." + prevp , " ".join(partext) ) partext = [] partext.append(word) prevp = p if partext: yield (doc_id + "." + ptype + "." + prevp, " ".join(partext) ) class Corpus: def __init__(self,corpusdir, extension = 'pos', restrict_to_collection = "", conditionf=lambda x: True, ignoreerrors=False): self.corpusdir = corpusdir self.extension = extension self.restrict_to_collection = restrict_to_collection self.conditionf = conditionf self.ignoreerrors = ignoreerrors def __iter__(self): if not self.restrict_to_collection: for f in glob.glob(self.corpusdir+"/*." + self.extension): if self.conditionf(f): try: yield CorpusDocument(f) except: print("Error, unable to parse " + f,file=sys.stderr) if not self.ignoreerrors: raise for d in glob.glob(self.corpusdir+"/*"): if (not self.restrict_to_collection or self.restrict_to_collection == os.path.basename(d)) and (os.path.isdir(d)): for f in glob.glob(d+ "/*." + self.extension): if self.conditionf(f): try: yield CorpusDocument(f) except: print("Error, unable to parse " + f,file=sys.stderr) if not self.ignoreerrors: raise ####################################################### def ns(namespace): """Resolves the namespace identifier to a full URL""" global namespaces return '{'+namespaces[namespace]+'}' class CorpusFiles(Corpus): def __iter__(self): if not self.restrict_to_collection: for f in glob.glob(self.corpusdir+"/*." + self.extension): if self.conditionf(f): yield f for d in glob.glob(self.corpusdir+"/*"): if (not self.restrict_to_collection or self.restrict_to_collection == os.path.basename(d)) and (os.path.isdir(d)): for f in glob.glob(d+ "/*." + self.extension): if self.conditionf(f): yield f class CorpusX(Corpus): def __iter__(self): if not self.restrict_to_collection: for f in glob.glob(self.corpusdir+"/*." + self.extension): if self.conditionf(f): try: yield CorpusDocumentX(f) except: print("Error, unable to parse " + f,file=sys.stderr) if not self.ignoreerrors: raise for d in glob.glob(self.corpusdir+"/*"): if (not self.restrict_to_collection or self.restrict_to_collection == os.path.basename(d)) and (os.path.isdir(d)): for f in glob.glob(d+ "/*." + self.extension): if self.conditionf(f): try: yield CorpusDocumentX(f) except: print("Error, unable to parse " + f,file=sys.stderr) if not self.ignoreerrors: raise class CorpusDocumentX: """This class represent one document/text of the Corpus, loaded into memory at once and retaining the full structure""" def __init__(self, filename, tree = None, index=True ): global namespaces self.filename = filename if not tree: self.tree = ElementTree.parse(self.filename) self.committed = True elif isinstance(tree, ElementTree._Element): self.tree = tree self.committed = False #Grab root element and determine if we run inline or standalone self.root = self.xpath("/dcoi:DCOI") if self.root: self.root = self.root[0] self.inline = True else: raise Exception("Not in DCOI/SoNaR format!") #self.root = self.xpath("/standalone:text") #self.inline = False #if not self.root: # raise FormatError() #build an index self.index = {} if index: self._index(self.root) def _index(self,node): if ns('xml') + 'id' in node.attrib: self.index[node.attrib[ns('xml') + 'id']] = node for subnode in node: #TODO: can we do this with xpath instead? self._index(subnode) def validate(self, formats_dir="../formats/"): """checks if the document is valid""" #TODO: download XSD from web if self.inline: xmlschema = ElementTree.XMLSchema(ElementTree.parse(StringIO("\n".join(open(formats_dir+"dcoi-dsc.xsd").readlines())))) xmlschema.assertValid(self.tree) #return xmlschema.validate(self) else: xmlschema = ElementTree.XMLSchema(ElementTree.parse(StringIO("\n".join(open(formats_dir+"dutchsemcor-standalone.xsd").readlines())))) xmlschema.assertValid(self.tree) #return xmlschema.validate(self) def xpath(self, expression): """Executes an xpath expression using the correct namespaces""" global namespaces return self.tree.xpath(expression, namespaces=namespaces) def __exists__(self, id): return (id in self.index) def __getitem__(self, id): return self.index[id] def paragraphs(self, node=None): """iterate over paragraphs""" if node == None: node = self return node.xpath("//dcoi:p") def sentences(self, node=None): """iterate over sentences""" if node == None: node = self return node.xpath("//dcoi:s") def words(self,node=None): """iterate over words""" if node == None: node = self return node.xpath("//dcoi:w") def save(self, filename=None, encoding='iso-8859-15'): if not filename: filename = self.filename self.tree.write(filename, encoding=encoding, method='xml', pretty_print=True, xml_declaration=True) PyNLPl-1.1.2/pynlpl/formats/__init__.py0000644000175000001440000000012512445064173020607 0ustar proyconusers00000000000000"""This package contains modules for reading and/or writing specific file formats""" PyNLPl-1.1.2/pynlpl/formats/imdi.py0000644000175000001440000017477212445064173020016 0ustar proyconusers00000000000000RELAXNG_IMDI = """ The root element for IMDI descriptions Instantiation of a VocabularyDef_Type Revision history of the metadata description Information on creation location for this data The name of a continent The name of a country The name of a geographic region The address List of a number of key name value pairs. Should be used to add information that is not covered by other metadata elements at this level Groups information about the languages used in the session Description for the list of languages spoken by this participant Groups information about access rights for this data Availability of the data Date when access rights were evaluated Name of owner resource Publisher responsible for distribution of this data Resource is preferably a metadata resource. In the case of a well-defined merged metadata/content format such as TEI or legacy resources for which no further metadata is available it is the resource itself. If the external resource is an IMDI session with written resources Type & SubType will be the same as the Type & SubType of the primary written resource in that session. If it is a session with IMDI multi-media resources the Type of the Media File will designate it. SubType is used only for written resources. Non-IMDI metadata resource types need to be mapped to IMDI types The type of the external (metadata) resource The sub type of the external (metadata) resource. Only used in case its metadata for a written resource The metadata format The URL of the external metadata record Project Information A short name or abbreviation for the project The full title of the project A unique identifier for the project Contact information for this project Description for this project Type for group of metadata pertaining to a session Groups information about the location where the session was created Groups information about the project for which the session was (originally) created Project keys Groups information about the content of the session. The content description takes place in several (overlapping) dimensions Groups information about all actors in the session Major genre classification Sub genre classification List of he major tasks carried out in the session List of modalities used in the session Classifies the subject of the session. Uses preferably an existing library classification scheme such as LCSH. The element has a scheme attribute that indicates what scheme is used. Comments: The element can be repeated but the user should guarantee consistency This groups information concerning the context of communication degree of interactivity Degree of planning of the event Indicates in how far the researcher was involved in the linguistic event Indicates the social context the event took place in Indicates the structure of the communication event Indicates the channel of the communication Description for the content of this session Description about the actors as a group Group of actors Functional role of the actor e.g. consultant, contributor, interviewer, researcher, publisher, collector, translator Name of the actor as used by others in the transcription Official name of the actor Short unique code to identify the actor as used in the transcription The family social role of the actor The actor languages The ethnic groups of the actor The age of the actor The birthdate of the actor The sex of the actor The education of the actor Indicates if real names or anonymized codes are used to identify the actor Contact information of the actor Actor keys Description for this individual actor Type for a corpus that points to either other corpora or sessions Name of the (sub-)corpus Title for the (sub-)corpus Description of the (sub-)corpus Link to other resource. Attribute name is for the benefit of browsing Type for group metadata pertaining to published corpora Name of the published corpus Title of the published corpus Identifier of the published corpus Description of the published corpus The languages used for documentation of the corpus Description for the list of languages The languages in the corpus that are subject of analysis Description for the list of languages Content type of the published corpus Publisher responsible for distribution of the published corpus Authors for the resources Human readabusle string that indicates total size of corpus Pricing info of the corpus Person to be contacted about the resource URL to the resource URL to the metadata for the resource List of any publications related to the resource Groups information of language resources connected to the session Groups all media resources Groups information about a Written Resource Groups information only pertaining to a Lexical resource Groups information only pertaining to a lexiconComponent Groups information about the source; e.g. media-carrier, book, newspaper archive etc. Groups data about name conversions for persons who are anonymised Groups information about external documentation associated with this session Every description is a reference Groups information about the media file URL to media file Major part of mime-type Minor part of mime-type Size of media file Quality of the recording describes technical conditions of recording Groups information about a Written Resource URL to file containing the annotations/transcription URL to media file from which the annotations/transcriptions originate Date when Written Resource was created The type of the WrittenResource The subtype of the WrittenResource File format used for Written Resource The size of the Written Resource file. Integer value with addition of M (mega) or K (kilo) How this document relates to another resource Character encoding used in the written resource Content encoding used in the written resource Language used in the resource Indicates if data has been anonymised. CV boolean Groups information only pertaining to a Lexical resource URL to lexical resource Date when lexical resource was created The type of the WrittenResource The format of the LexicalResource The character encoding of the LexicalResource The size of the LexicalResource in bytes The number of head entries of the LexicalResource The number of sub entries of the LexicalResource OCV: Sentence, Phrase, Wordform, Lemma, ... OCV: HyphenatedSpelling, SyllabifiedSpelling, ... OCV: Stem,StemALlomorphy, Segmentation, ... OCV: POS, Inflexion, Countability, ... OCV: Complementation, Alternation, Modification, ... OCV: Transcription, IPA Transcription, CV pattern, ... OCV: Sense dstinction A block to describe the languages that are used to define terms, to describe meaning Groups information only pertaining to a lexiconComponent URL to lexiconComponent Date when lexiconComponent was created The type of the lexiconComponent The format of the lexiconComponent The character encoding of the lexiconComponent The size of the lexiconComponent in bytes Describes the tree in which the component can be embedded Describes the possible parents of the lexiconComponent in the schema tree Descibes the preferred parent of the lexiconComponent in the schema tree Describes the possible component children of the lexiconComponent in the schema tree Describes the possible category children of the lexiconComponent in the schema tree Gives information on the lexical applications of the lexiconComponent Describes whether the lexiconComponent can be used to add orthography to the lexicon schema Describes whether the lexiconComponent can be used to add morphology to the lexicon schema. Describes whether the lexiconComponent can be used to add morphosyntactic features to the lexicon schema Describes whether the lexiconComponent can be used to add syntactic features to the lexicon schema Describes whether the lexiconComponent can be used to add phonology to the lexicon schema. Describes whether the lexiconComponent can be used to add a semantic element to the lexicon schema A block to describe the languages that are used to define terms, to describe meaning Groups information about the original source; e.g. media-carrier, book, newspaper archive etc. Unique code to identify the original source Physical storage format of the source Quality of original recording Description for the original source Groups data about name conversions for persons who are anonymised URL to information to convert pseudo named to real-names The definition of a vocabulary. Attributes: Date of creattion, Link to origin. Contails a Description be element to descr+++ ibe the domain of the vocabulary and a (unspecified) number of value enries Human readable description in the form of a text with language id specification and/or a link to a file with a description and language id specification. The name attribute is to name the link (if present) Contact information for this data The validation used for the resource CV: content, type, manual, automatic, semi-automatic Validation methodology Percentage of resource validated Specifies age of a person with differerent counting methods Specifies age of a person in the form of a range An element from a set of languages used in the session Unique code to identify a language Name of the language Is it the speakers mother tongue. Only applicable if used in the context of a speakers language Is it the speakers primary language. Only applicable if used in the context of a speakers language Is it the most frequently used language in the document. Only applicable if used in the context of the resource's language Direction of translation. Only applicable in case it is the context of a lexicon resource Direction of translation. Only applicable in case it is the context of a lexicon resource Description for this particular language Indicates if language is dominant language Indicates if language is source language Indicates if language is target language Description of the language Information on language name and id Unique code to identify a language The name of the language String type for single spaced, single line strings Comma separated string The age of a person ([0-9]+)*(;[0-9]+)*(.[0-9]+)*|Unknown|Unspecified The age of a person given as a range ([0-9]+)?(;[0-9]+)?(.[0-9]+)?(/([0-9]+)?(;[0-9]+)?(.[0-9]+)?)?|Unknown|Unspecified The age counting method SinceConception SinceBirth Vocabulary content and attributes Link to a vocabulary definition Position (start (+end) ) on a old fashioned tape without time indication Position in a media file or modern tape The start time position of a recording The end time position of a recording Quality indication Unspecified is a non-existing (null) value. Unknown is a informational value indicating that the real value is not known Unknown Unspecified empty string definition 0 Comma seperated string [^,]*(,[^,]+)* Loose boolean value where empty values are allowed xsd:boolean imdi:Empty_Value_Type xsd:boolean imdi:Empty_Value_Type integer + Unspecified and Unknown xsd:unsignedInt imdi:Empty_Value_Type xsd:unsignedInt imdi:Empty_Value_Type Defines a date that can also be empty or Unknown or Unspecified imdi:DateRange_Value_Type imdi:EmptyString_Value_Type imdi:Empty_Value_Typeimdi:DateRange_Value_Type imdi:EmptyString_Value_Type imdi:Empty_Value_Typeimdi:DateRange_Value_Type imdi:EmptyString_Value_Type imdi:Empty_Value_Type Defines a date range that can also be Unspecified or Unknown ([0-9]+)((-[0-9]+)(-[0-9]+)?)?(/([0-9]+)((-[0-9]+)(-[0-9]+)?)?)?|Unknown|Unspecified Language identifiers (ISO639(-1|-2|-3)?:.*)? (RFC3066:.*)? (RFC1766:.*)? (SIL:.*)? Unknown Unspecified Time position in the hh:mm:ss:ff format [0-9][0-9]:[0-9][0-9]:[0-9][0-9]:?[0-9]*|Unknown|Unspecified Quality values (1 .. 5) also allows empty values 1 2 3 4 5 All possible vocabulary type values ClosedVocabulary ClosedVocabularyList OpenVocabulary OpenVocabularyList Allowed values for metadata transcripts SESSION SESSION.Profile LEXICON_RESOURCE_BUNDLE LEXICON_RESOURCE_BUNDLE.Profile CATALOGUE CATALOGUE.Profile CORPUS CORPUS.Profile Attributes allowed for profiles """ PyNLPl-1.1.2/pynlpl/formats/folia.py0000644000175000001440000131130213024723325020140 0ustar proyconusers00000000000000# -*- coding: utf-8 -*- #---------------------------------------------------------------- # PyNLPl - FoLiA Format Module # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # # https://proycon.github.io/folia # httsp://github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Module for reading, editing and writing FoLiA XML # # Licensed under GPLv3 # #---------------------------------------------------------------- #pylint: disable=redefined-builtin,trailing-whitespace,superfluous-parens,bad-classmethod-argument,wrong-import-order,wrong-import-position,ungrouped-imports from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import sys from copy import copy, deepcopy from datetime import datetime import inspect import itertools import glob import os import re try: import io except ImportError: #old-Python 2.6 fallback import codecs as io import multiprocessing import bz2 import gzip import random from lxml import etree as ElementTree from lxml.builder import ElementMaker if sys.version < '3': from StringIO import StringIO #pylint: disable=import-error,wrong-import-order from urllib import urlopen #pylint: disable=no-name-in-module,wrong-import-order else: from io import StringIO, BytesIO #pylint: disable=wrong-import-order,ungrouped-imports from urllib.request import urlopen #pylint: disable=E0611,wrong-import-order,ungrouped-imports if sys.version < '3': from codecs import getwriter #pylint: disable=wrong-import-order,ungrouped-imports stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout from pynlpl.common import u, isstring from pynlpl.formats.foliaset import SetDefinition, DeepValidationError import pynlpl.algorithms LXE=True #use lxml instead of built-in ElementTree (default) #foliaspec:version:FOLIAVERSION #The FoLiA version FOLIAVERSION = "1.4.0" LIBVERSION = FOLIAVERSION + '.84' #== FoLiA version + library revision #0.9.1.31 is the first version with Python 3 support #foliaspec:namespace:NSFOLIA #The FoLiA XML namespace NSFOLIA = "http://ilk.uvt.nl/folia" NSDCOI = "http://lands.let.ru.nl/projects/d-coi/ns/1.0" nslen = len(NSFOLIA) + 2 nslendcoi = len(NSDCOI) + 2 TMPDIR = "/tmp/" #will be used for downloading temporary data (external subdocuments) DOCSTRING_GENERIC_ATTRIBS = """ id (str): An ID for the element. IDs must be unique for the entire document. They may not contain colons or spaces, and must start with a letter. (they must adhere to XML's NCName type). This is a generic FoLiA attribute. set (str): The FoLiA set for this element. This is a generic FoLiA attribute. cls (str): The class for this element. This is a generic FoLiA attribute. annotator (str): A name or ID for the annotator. This is a generic FoLiA attribute. annotatortype: Should be either ``AnnotatorType.MANUAL`` or ``AnnotatorType.AUTO``, indicating whether the annotation was performed manually or by an automated process. This is a generic FoLiA attribute. confidence (float): A value between 0 and 1 indicating the degree of confidence the annotator has that this the annotation is correct.. This is a generic FoLiA attribute. n (int): An index number to indicate the element is part of an sequence (does not affect the placement of the element). src (str): Speech annotation attribute, refers to a media file (audio/video) that this element describes. This is a generic FoLiA attribute. speaker (str): Speech annotation attribute: a name or ID of the speaker. This is a generic FoLiA attribute. begintime (str): Speech annotation attribute: the time (in ``hh:mm:ss.mmm`` format, relative to the media file in ``src``) when the audio that this element describes starts. This is a generic FoLiA attribute. endtime (str): Speech annotation attribute: the time (in ``hh:mm:ss.mmm`` format, relative to the media file in ``src``) when the audio that this element describes starts. This is a generic FoLiA attribute. contents (list): Alternative for ``*args``, exists for purely syntactic reasons. """ ILLEGAL_UNICODE_CONTROL_CHARACTERS = {} #XML does not like unicode control characters for ordinal in range(0x20): if chr(ordinal) not in '\t\r\n': ILLEGAL_UNICODE_CONTROL_CHARACTERS[ordinal] = None class Mode: MEMORY = 0 #The entire FoLiA structure will be loaded into memory. This is the default and is required for any kind of document manipulation. XPATH = 1 #The full XML structure will be loaded into memory, but conversion to FoLiA objects occurs only upon querying. The full power of XPath is available. class AnnotatorType: UNSET = None AUTO = "auto" MANUAL = "manual" #foliaspec:attributes #Defines all common FoLiA attributes (as part of the Attrib enumeration) class Attrib: ID, CLASS, ANNOTATOR, CONFIDENCE, N, DATETIME, BEGINTIME, ENDTIME, SRC, SPEAKER = range(10) #foliaspec:annotationtype #Defines all annotation types (as part of the AnnotationType enumeration) class AnnotationType: TEXT, TOKEN, DIVISION, PARAGRAPH, LIST, FIGURE, WHITESPACE, LINEBREAK, SENTENCE, POS, LEMMA, DOMAIN, SENSE, SYNTAX, CHUNKING, ENTITY, CORRECTION, ERRORDETECTION, PHON, SUBJECTIVITY, MORPHOLOGICAL, EVENT, DEPENDENCY, TIMESEGMENT, GAP, NOTE, ALIGNMENT, COMPLEXALIGNMENT, COREFERENCE, SEMROLE, METRIC, LANG, STRING, TABLE, STYLE, PART, UTTERANCE, ENTRY, TERM, DEFINITION, EXAMPLE, PHONOLOGICAL, PREDICATE, OBSERVATION, SENTIMENT, STATEMENT = range(46) #Alternative is a special one, not declared and not used except for ID generation class TextCorrectionLevel: #THIS IS NOW COMPLETELY OBSOLETE AND ONLY HERE FOR BACKWARD COMPATIBILITY! CORRECTED, UNCORRECTED, ORIGINAL, INLINE = range(4) class MetaDataType: #THIS IS NOW COMPLETELY OBSOLETE AND ONLY HERE FOR BACKWARD COMPATIBILITY! Metadata type is a free-fill field with only native predefined NATIVE = "native" CMDI = "cmdi" IMDI = "imdi" class NoSuchAnnotation(Exception): """Exception raised when the requested type of annotation does not exist for the selected element""" pass class NoSuchText(Exception): """Exception raised when the requested type of text content does not exist for the selected element""" pass class NoSuchPhon(Exception): """Exception raised when the requested type of phonetic content does not exist for the selected element""" pass class DuplicateAnnotationError(Exception): pass class DuplicateIDError(Exception): """Exception raised when an identifier that is already in use is assigned again to another element""" pass class NoDefaultError(Exception): pass class UnresolvableTextContent(Exception): pass class MalformedXMLError(Exception): pass class ModeError(Exception): pass class MetaDataError(Exception): pass class DocumentNotLoaded(Exception): #for alignments to external documents pass class GenerateIDException(Exception): pass class CorrectionHandling: EITHER,CURRENT, ORIGINAL = range(3) def checkversion(version): """Checks FoLiA version, returns 1 if the document is newer than the library, -1 if it is older, 0 if it is equal""" try: for refversion, docversion in zip([int(x) for x in FOLIAVERSION.split('.')], [int(x) for x in version.split('.')]): if docversion > refversion: return 1 #doc is newer than library elif docversion < refversion: return -1 #doc is older than library return 0 #versions are equal except ValueError: raise ValueError("Unable to parse document FoLiA version, invalid syntax") def parsetime(s): """Internal function to parse the time parses time in HH:MM:SS.mmm format. Returns: a four-tuple ``(hours,minutes,seconds,milliseconds)`` """ try: fields = s.split('.') subfields = fields[0].split(':') H = int(subfields[0]) M = int(subfields[1]) S = int(subfields[2]) if len(subfields) > 3: m = int(subfields[3]) else: m = 0 if len(fields) > 1: m = int(fields[1]) return (H,M,S,m) except: raise ValueError("Invalid timestamp, must be in HH:MM:SS.mmm format: " + s) def parsecommonarguments(object, doc, annotationtype, required, allowed, **kwargs): """Internal function to parse common FoLiA attributes and sets up the instance accordingly. Do not invoke directly.""" object.doc = doc #The FoLiA root document if required is None: required = tuple() if allowed is None: allowed = tuple() supported = required + allowed if 'generate_id_in' in kwargs: try: kwargs['id'] = kwargs['generate_id_in'].generate_id(object.__class__) except GenerateIDException: pass #ID could not be generated, just skip del kwargs['generate_id_in'] if 'id' in kwargs: if Attrib.ID not in supported: raise ValueError("ID is not supported on " + object.__class__.__name__) isncname(kwargs['id']) object.id = kwargs['id'] del kwargs['id'] elif Attrib.ID in required: raise ValueError("ID is required for " + object.__class__.__name__) else: object.id = None if 'set' in kwargs: if Attrib.CLASS not in supported and not object.SETONLY: raise ValueError("Set is not supported on " + object.__class__.__name__) if not kwargs['set']: object.set ="undefined" else: object.set = kwargs['set'] del kwargs['set'] if object.set: if doc and (not (annotationtype in doc.annotationdefaults) or not (object.set in doc.annotationdefaults[annotationtype])): if doc.autodeclare: doc.annotations.append( (annotationtype, object.set ) ) doc.annotationdefaults[annotationtype] = {object.set: {} } else: raise ValueError("Set '" + object.set + "' is used for " + object.__class__.__name__ + ", but has no declaration!") elif annotationtype in doc.annotationdefaults and len(doc.annotationdefaults[annotationtype]) == 1: object.set = list(doc.annotationdefaults[annotationtype].keys())[0] elif object.ANNOTATIONTYPE == AnnotationType.TEXT: object.set = "undefined" #text content needs never be declared (for backward compatibility) and is in set 'undefined' elif Attrib.CLASS in required: #or (hasattr(object,'SETONLY') and object.SETONLY): raise ValueError("Set is required for " + object.__class__.__name__) if 'class' in kwargs: if not Attrib.CLASS in supported: raise ValueError("Class is not supported for " + object.__class__.__name__) object.cls = kwargs['class'] del kwargs['class'] elif 'cls' in kwargs: if not Attrib.CLASS in supported: raise ValueError("Class is not supported on " + object.__class__.__name__) object.cls = kwargs['cls'] del kwargs['cls'] elif Attrib.CLASS in required: raise ValueError("Class is required for " + object.__class__.__name__) if object.cls and not object.set: if doc and doc.autodeclare: if not (annotationtype, 'undefined') in doc.annotations: doc.annotations.append( (annotationtype, 'undefined') ) doc.annotationdefaults[annotationtype] = {'undefined': {} } object.set = 'undefined' else: raise ValueError("Set is required for " + object.__class__.__name__ + ". Class '" + object.cls + "' assigned without set.") if 'annotator' in kwargs: if not Attrib.ANNOTATOR in supported: raise ValueError("Annotator is not supported for " + object.__class__.__name__) object.annotator = kwargs['annotator'] del kwargs['annotator'] elif doc and annotationtype in doc.annotationdefaults and object.set in doc.annotationdefaults[annotationtype] and 'annotator' in doc.annotationdefaults[annotationtype][object.set]: object.annotator = doc.annotationdefaults[annotationtype][object.set]['annotator'] elif Attrib.ANNOTATOR in required: raise ValueError("Annotator is required for " + object.__class__.__name__) if 'annotatortype' in kwargs: if not Attrib.ANNOTATOR in supported: raise ValueError("Annotatortype is not supported for " + object.__class__.__name__) if kwargs['annotatortype'] == 'auto' or kwargs['annotatortype'] == AnnotatorType.AUTO: object.annotatortype = AnnotatorType.AUTO elif kwargs['annotatortype'] == 'manual' or kwargs['annotatortype'] == AnnotatorType.MANUAL: object.annotatortype = AnnotatorType.MANUAL else: raise ValueError("annotatortype must be 'auto' or 'manual', got " + repr(kwargs['annotatortype'])) del kwargs['annotatortype'] elif doc and annotationtype in doc.annotationdefaults and object.set in doc.annotationdefaults[annotationtype] and 'annotatortype' in doc.annotationdefaults[annotationtype][object.set]: object.annotatortype = doc.annotationdefaults[annotationtype][object.set]['annotatortype'] elif Attrib.ANNOTATOR in required: raise ValueError("Annotatortype is required for " + object.__class__.__name__) if 'confidence' in kwargs: if not Attrib.CONFIDENCE in supported: raise ValueError("Confidence is not supported") if kwargs['confidence'] is not None: try: object.confidence = float(kwargs['confidence']) assert object.confidence >= 0.0 and object.confidence <= 1.0 except: raise ValueError("Confidence must be a floating point number between 0 and 1, got " + repr(kwargs['confidence']) ) del kwargs['confidence'] elif Attrib.CONFIDENCE in required: raise ValueError("Confidence is required for " + object.__class__.__name__) if 'n' in kwargs: if not Attrib.N in supported: raise ValueError("N is not supported for " + object.__class__.__name__) object.n = kwargs['n'] del kwargs['n'] elif Attrib.N in required: raise ValueError("N is required for " + object.__class__.__name__) if 'datetime' in kwargs: if not Attrib.DATETIME in supported: raise ValueError("Datetime is not supported") if isinstance(kwargs['datetime'], datetime): object.datetime = kwargs['datetime'] else: #try: object.datetime = parse_datetime(kwargs['datetime']) #except: # raise ValueError("Unable to parse datetime: " + str(repr(kwargs['datetime']))) del kwargs['datetime'] elif doc and annotationtype in doc.annotationdefaults and object.set in doc.annotationdefaults[annotationtype] and 'datetime' in doc.annotationdefaults[annotationtype][object.set]: object.datetime = doc.annotationdefaults[annotationtype][object.set]['datetime'] elif Attrib.DATETIME in required: raise ValueError("Datetime is required for " + object.__class__.__name__) if 'src' in kwargs: if not Attrib.SRC in supported: raise ValueError("Source is not supported for " + object.__class__.__name__) object.src = kwargs['src'] del kwargs['src'] elif Attrib.SRC in required: raise ValueError("Source is required for " + object.__class__.__name__) if 'begintime' in kwargs: if not Attrib.BEGINTIME in supported: raise ValueError("Begintime is not supported for " + object.__class__.__name__) object.begintime = parsetime(kwargs['begintime']) del kwargs['begintime'] elif Attrib.BEGINTIME in required: raise ValueError("Begintime is required for " + object.__class__.__name__) if 'endtime' in kwargs: if not Attrib.ENDTIME in supported: raise ValueError("Endtime is not supported for " + object.__class__.__name__) object.endtime = parsetime(kwargs['endtime']) del kwargs['endtime'] elif Attrib.ENDTIME in required: raise ValueError("Endtime is required for " + object.__class__.__name__) if 'speaker' in kwargs: if not Attrib.SPEAKER in supported: raise ValueError("Speaker is not supported for " + object.__class__.__name__) object.speaker = kwargs['speaker'] del kwargs['speaker'] elif Attrib.SPEAKER in required: raise ValueError("Speaker is required for " + object.__class__.__name__) if 'auth' in kwargs: if kwargs['auth'] in ('no','false'): object.auth = False else: object.auth = bool(kwargs['auth']) del kwargs['auth'] else: object.auth = object.__class__.AUTH if 'text' in kwargs: if kwargs['text']: object.settext(kwargs['text']) del kwargs['text'] if 'phon' in kwargs: if kwargs['phon']: object.setphon(kwargs['phon']) del kwargs['phon'] if object.XLINK: if 'href' in kwargs: object.href =kwargs['href'] del kwargs['href'] if 'xlinktype' in kwargs: object.xlinktype = kwargs['xlinktype'] del kwargs['xlinktype'] if 'xlinkrole' in kwargs: object.xlinkrole = kwargs['xlinkrole'] del kwargs['xlinkrole'] if 'xlinklabel' in kwargs: object.xlinklabel = kwargs['xlinklabel'] del kwargs['xlinklabel'] if 'xlinkshow' in kwargs: object.xlinkshow = kwargs['xlinkshow'] del kwargs['xlinklabel'] if 'xlinktitle' in kwargs: object.xlinktitle = kwargs['xlinktitle'] del kwargs['xlinktitle'] if doc and doc.debug >= 2: print(" @id = ", repr(object.id),file=stderr) print(" @set = ", repr(object.set),file=stderr) print(" @class = ", repr(object.cls),file=stderr) print(" @annotator = ", repr(object.annotator),file=stderr) print(" @annotatortype= ", repr(object.annotatortype),file=stderr) print(" @confidence = ", repr(object.confidence),file=stderr) print(" @n = ", repr(object.n),file=stderr) print(" @datetime = ", repr(object.datetime),file=stderr) #set index if object.id and doc: if object.id in doc.index: if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Duplicate ID not permitted:" + object.id,file=stderr) raise DuplicateIDError("Duplicate ID not permitted: " + object.id) else: if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Adding to index: " + object.id,file=stderr) doc.index[object.id] = object #Parse feature attributes (shortcut for feature specification for some elements) for c in object.ACCEPTED_DATA: if issubclass(c, Feature): if c.SUBSET in kwargs: if kwargs[c.SUBSET]: object.append(c,cls=kwargs[c.SUBSET]) del kwargs[c.SUBSET] return kwargs def parse_datetime(s): #source: http://stackoverflow.com/questions/2211362/how-to-parse-xsddatetime-format """Returns (datetime, tz offset in minutes) or (None, None).""" m = re.match(r""" ^ (?P-?[0-9]{4}) - (?P[0-9]{2}) - (?P[0-9]{2}) T (?P[0-9]{2}) : (?P[0-9]{2}) : (?P[0-9]{2}) (?P\.[0-9]{1,6})? (?P Z | (?P[-+][0-9]{2}) : (?P[0-9]{2}) )? $ """, s, re.X) if m is not None: values = m.groupdict() #if values["tz"] in ("Z", None): # tz = 0 #else: # tz = int(values["tz_hr"]) * 60 + int(values["tz_min"]) if values["microsecond"] is None: values["microsecond"] = 0 else: values["microsecond"] = values["microsecond"][1:] values["microsecond"] += "0" * (6 - len(values["microsecond"])) values = dict((k, int(v)) for k, v in values.items() if not k.startswith("tz")) try: return datetime(**values) # , tz except ValueError: pass return None def xmltreefromstring(s): """Internal function, deals with different Python versions, unicode strings versus bytes, and with the leak bug in lxml""" if sys.version < '3': #Python 2 if isinstance(s,unicode): #pylint: disable=undefined-variable s = s.encode('utf-8') try: return ElementTree.parse(StringIO(s), ElementTree.XMLParser(collect_ids=False)) except TypeError: return ElementTree.parse(StringIO(s), ElementTree.XMLParser()) #older lxml, may leak!!!! else: #Python 3 if isinstance(s,str): s = s.encode('utf-8') try: return ElementTree.parse(BytesIO(s), ElementTree.XMLParser(collect_ids=False)) except TypeError: return ElementTree.parse(BytesIO(s), ElementTree.XMLParser()) #older lxml, may leak!!!! def xmltreefromfile(filename): """Internal function to read an XML file""" try: return ElementTree.parse(filename, ElementTree.XMLParser(collect_ids=False)) except TypeError: return ElementTree.parse(filename, ElementTree.XMLParser()) #older lxml, may leak!! def makeelement(E, tagname, **kwargs): """Internal function""" if sys.version < '3': try: kwargs2 = {} for k,v in kwargs.items(): kwargs2[k.encode('utf-8')] = v.encode('utf-8') #return E._makeelement(tagname.encode('utf-8'), **{ k.encode('utf-8'): v.encode('utf-8') for k,v in kwargs.items() } ) #In one go fails on some older Python 2.6s return E._makeelement(tagname.encode('utf-8'), **kwargs2 ) #pylint: disable=protected-access except ValueError as e: try: #older versions of lxml may misbehave, compensate: e = E._makeelement(tagname.encode('utf-8')) #pylint: disable=protected-access for k,v in kwargs.items(): e.attrib[k.encode('utf-8')] = v return e except ValueError: print(e,file=stderr) print("tagname=",tagname,file=stderr) print("kwargs=",kwargs,file=stderr) raise e else: return E._makeelement(tagname,**kwargs) #pylint: disable=protected-access def commonancestors(Class, *args): """Generator function to find common ancestors of a particular type for any two or more FoLiA element instances. The function produces all common ancestors of the type specified, starting from the closest one up to the most distant one. Parameters: Class: The type of ancestor to find, should be the :class:`AbstractElement` class or any subclass thereof (not an instance!) *args: The elements to find the common ancestors of, elements are instances derived from :class:`AbstractElement` Yields: instance derived from :class:`AbstractElement`: A common ancestor of the arguments, an instance of the specified ``Class``. """ commonancestors = None #pylint: disable=redefined-outer-name for sibling in args: ancestors = list( sibling.ancestors(Class) ) if commonancestors is None: commonancestors = copy(ancestors) else: removeancestors = [] for a in commonancestors: #pylint: disable=not-an-iterable if not a in ancestors: removeancestors.append(a) for a in removeancestors: commonancestors.remove(a) if commonancestors: for commonancestor in commonancestors: yield commonancestor class AbstractElement(object): """Abstract base class from which all FoLiA elements are derived. This class implements many generic methods that are available on all FoLiA elements. To see if an element is a FoLiA element, as opposed to any other python object, do:: isinstance(x, AbstractElement) Generic FoLiA attributes can be accessed on all instances derived from this class: * ``element.id`` (str) - The unique identifier of the element * ``element.set`` (str) - The set the element pertains to. * ``element.cls`` (str) - The assigned class, i.e. the actual value of \ the annotation, defined in the set. Classes correspond with tagsets in this case of many annotation types. \ Note that since *class* is already a reserved keyword in python, the library consistently uses ``cls`` everywhere. * ``element.annotator`` (str) - The name or ID of the annotator who added/modified this element * ``element.annotatortype`` - The type of annotator, can be either ``folia.AnnotatorType.MANUAL`` or ``folia.AnnotatorType.AUTO`` * ``element.confidence`` (float) - A confidence value expressing * ``element.datetime`` (datetime.datetime) - The date and time when the element was added/modified. * ``element.n`` (str) - An ordinal label, used for instance in enumerated list contexts, numbered sections, etc.. The following generic attributes are specific to a speech context: * ``element.src`` (str) - A URL or filename referring the an audio or video file containing the speech. Access this attribute using the ``element.speaker_src()`` method, as it is inheritable from ancestors. * ``element.speaker`` (str) - The name of ID of the speaker. Access this attribute using the ``element.speech_speaker()`` method, as it is inheritable from ancestors. * ``element.begintime`` (4-tuple) - The time in the above source fragment when the phonetic content of this element starts, this is a ``(hours, minutes,seconds,milliseconds)`` tuple. * ``element.endtime`` (4-tuple) - The time in the above source fragment when the phonetic content of this element ends, this is a ``(hours, minutes,seconds,milliseconds)`` tuple. Not all attributes are allowed, unset or unavailable attributes will always default to ``None``. Note: This class should never be instantiated directly, as it is abstract! See also: :meth:`AbstractElement.__init__` """ def __init__(self, doc, *args, **kwargs): """Constructor for most FoLiA elements. Parameters: doc (:class:`Document`): The FoLiA document this element will pertain to. It will not be automatically added though. *args: Child elements to add to this element, mostly instances derived from :class:`AbstractElement` Keyword Arguments: {generic_attribs} generate_id_in (:class:`AbstractElement`): Instead of providing an explicit ID, the library can attempt to automatically generate an ID based on a convention where suffixes are applied to the ID of the parent element. This keyword argument takes the intended parent element (an instance derived from :class:`AbstractElement`) as value. Not all of the generic FoLiA attributes are applicable to all elements. The class properties ``REQUIRED_ATTRIBS`` and ``OPTIONAL_ATTRIBS`` prescribe which are required or allowed. """.format(generic_attribs=DOCSTRING_GENERIC_ATTRIBS) if not isinstance(doc, Document) and not doc is None: raise Exception("Expected first parameter to be instance of Document, got " + str(type(doc))) self.doc = doc self.parent = None self.data = [] kwargs = parsecommonarguments(self, doc, self.ANNOTATIONTYPE, self.REQUIRED_ATTRIBS, self.OPTIONAL_ATTRIBS,**kwargs) for child in args: self.append(child) if 'contents' in kwargs: if isinstance(kwargs['contents'], list): for child in kwargs['contents']: self.append(child) else: self.append(kwargs['contents']) del kwargs['contents'] for key in kwargs: if key[0] == '{': #this is a parameter in a different alien namespace, ignore it continue else: raise ValueError("Parameter '" + key + "' not supported by " + self.__class__.__name__) def __getattr__(self, attr): """Internal method""" #overriding getattr so we can get defaults here rather than needing a copy on each element, saves memory if attr in ('set','cls','confidence','annotator','annotatortype','datetime','n','href','src','speaker','begintime','endtime','xlinktype','xlinktitle','xlinklabel','xlinkrole','xlinkshow','label'): return None else: return super(AbstractElement, self).__getattribute__(attr) #def __del__(self): # if self.doc and self.doc.debug: # print >>stderr, "[PyNLPl FoLiA DEBUG] Removing " + repr(self) # for child in self.data: # del child # self.doc = None # self.parent = None # del self.data def description(self): """Obtain the description associated with the element. Raises: :class:`NoSuchAnnotation` if there is no associated description.""" for e in self: if isinstance(e, Description): return e.value raise NoSuchAnnotation def textcontent(self, cls='current', correctionhandling=CorrectionHandling.CURRENT): """Get the text content explicitly associated with this element (of the specified class). Unlike :meth:`text`, this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the :class:`TextContent` instance rather than the actual text! Parameters: cls (str): The class of the text content to obtain, defaults to ``current``. correctionhandling: Specifies what content to retrieve when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current content. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the content prior to correction, and ``CorrectionHandling.EITHER`` if you don't care. Returns: The phonetic content (:class:`TextContent`) Raises: :class:`NoSuchText` if there is no text content for the element See also: :meth:`text` :meth:`phoncontent` :meth:`phon` """ if not self.PRINTABLE: #only printable elements can hold text raise NoSuchText #Find explicit text content (same class) for e in self: if isinstance(e, TextContent): if cls is None or e.cls == cls: return e elif isinstance(e, Correction): try: return e.textcontent(cls, correctionhandling) except NoSuchText: pass raise NoSuchText def stricttext(self, cls='current'): """Alias for :meth:`text` with ``strict=True``""" return self.text(cls,strict=True) def toktext(self,cls='current'): """Alias for :meth:`text` with ``retaintokenisation=True``""" return self.text(cls,retaintokenisation=True) def text(self, cls='current', retaintokenisation=False, previousdelimiter="",strict=False, correctionhandling=CorrectionHandling.CURRENT): """Get the text associated with this element (of the specified class) The text will be constructed from child-elements whereever possible, as they are more specific. If no text can be obtained from the children and the element has itself text associated with it, then that will be used. Parameters: cls (str): The class of the text content to obtain, defaults to ``current``. retaintokenisation (bool): If set, the space attribute on words will be ignored, otherwise it will be adhered to and text will be detokenised as much as possible. Defaults to ``False``. previousdelimiter (str): Can be set to a delimiter that was last outputed, useful when chaining calls to :meth:`text`. Defaults to an empty string. strict (bool): Set this iif you are strictly interested in the text explicitly associated with the element, without recursing into children. Defaults to ``False``. correctionhandling: Specifies what text to retrieve when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current text. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the text prior to correction, and ``CorrectionHandling.EITHER`` if you don't care. Example:: word.text() Returns: The text of the element (``unicode`` instance in Python 2, ``str`` in Python 3) Raises: :class:`NoSuchText`: if no text is found at all. """ if strict: return self.textcontent(cls, correctionhandling).text() if self.TEXTCONTAINER: s = "" for e in self: if isstring(e): s += e else: if s: s += e.TEXTDELIMITER #for AbstractMarkup, will usually be "" s += e.text() return s elif not self.PRINTABLE: #only printable elements can hold text raise NoSuchText else: #Get text from children first delimiter = "" s = "" for e in self: if e.PRINTABLE and not isinstance(e, TextContent) and not isinstance(e, String): try: s += e.text(cls,retaintokenisation, delimiter,False,correctionhandling) #delimiter will be buffered and only printed upon next iteration, this prevents the delimiter being outputted at the end of a sequence and to be compounded with other delimiters delimiter = e.gettextdelimiter(retaintokenisation) except NoSuchText: #No text, that's okay, just continue continue if not s and self.hastext(cls, correctionhandling): s = self.textcontent(cls, correctionhandling).text() if s and previousdelimiter: return previousdelimiter + s elif s: return s else: #No text found at all :`( raise NoSuchText def phoncontent(self, cls='current', correctionhandling=CorrectionHandling.CURRENT): """Get the phonetic content explicitly associated with this element (of the specified class). Unlike :meth:`phon`, this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the PhonContent instance rather than the actual text! Parameters: cls (str): The class of the phonetic content to obtain, defaults to ``current``. correctionhandling: Specifies what content to retrieve when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current content. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the content prior to correction, and ``CorrectionHandling.EITHER`` if you don't care. Returns: The phonetic content (:class:`PhonContent`) Raises: :class:`NoSuchPhon` if there is no phonetic content for the element See also: :meth:`phon` :meth:`textcontent` :meth:`text` """ if not self.SPEAKABLE: #only printable elements can hold text raise NoSuchPhon #Find explicit text content (same class) for e in self: if isinstance(e, PhonContent): if cls is None or e.cls == cls: return e elif isinstance(e, Correction): try: return e.phoncontent(cls, correctionhandling) except NoSuchPhon: pass raise NoSuchPhon def speech_src(self): """Retrieves the URL/filename of the audio or video file associated with the element. The source is inherited from ancestor elements if none is specified. For this reason, always use this method rather than access the ``src`` attribute directly. Returns: str or None if not found """ if self.src: return self.src elif self.parent: return self.parent.speech_src() else: return None def speech_speaker(self): """Retrieves the speaker of the audio or video file associated with the element. The source is inherited from ancestor elements if none is specified. For this reason, always use this method rather than access the ``src`` attribute directly. Returns: str or None if not found """ if self.speaker: return self.speaker elif self.parent: return self.parent.speech_speaker() else: return None def phon(self, cls='current', previousdelimiter="", strict=False,correctionhandling=CorrectionHandling.CURRENT): """Get the phonetic representation associated with this element (of the specified class) The phonetic content will be constructed from child-elements whereever possible, as they are more specific. If no phonetic content can be obtained from the children and the element has itself phonetic content associated with it, then that will be used. Parameters: cls (str): The class of the phonetic content to obtain, defaults to ``current``. retaintokenisation (bool): If set, the space attribute on words will be ignored, otherwise it will be adhered to and phonetic content will be detokenised as much as possible. Defaults to ``False``. previousdelimiter (str): Can be set to a delimiter that was last outputed, useful when chaining calls to :meth:`phon`. Defaults to an empty string. strict (bool): Set this if you are strictly interested in the phonetic content explicitly associated with the element, without recursing into children. Defaults to ``False``. correctionhandling: Specifies what phonetic content to retrieve when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current phonetic content. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the phonetic content prior to correction, and ``CorrectionHandling.EITHER`` if you don't care. Example:: word.phon() Returns: The phonetic content of the element (``unicode`` instance in Python 2, ``str`` in Python 3) Raises: :class:`NoSuchPhon`: if no phonetic conent is found at all. See also: :meth:`phoncontent`: Retrieves the phonetic content as an element rather than a string :meth:`text` :meth:`textcontent` """ if strict: return self.phoncontent(cls,correctionhandling).phon() if self.PHONCONTAINER: s = "" for e in self: if isstring(e): s += e else: try: if s: s += e.TEXTDELIMITER #We use TEXTDELIMITER for phon too except AttributeError: pass s += e.phon() return s elif not self.SPEAKABLE: #only readable elements can hold phonetic content raise NoSuchPhon else: #Get text from children first delimiter = "" s = "" for e in self: if e.SPEAKABLE and not isinstance(e, PhonContent) and not isinstance(e,String): try: s += e.phon(cls, delimiter,False,correctionhandling) #delimiter will be buffered and only printed upon next iteration, this prevents the delimiter being outputted at the end of a sequence and to be compounded with other delimiters delimiter = e.gettextdelimiter() #We use TEXTDELIMITER for phon too except NoSuchPhon: #No text, that's okay, just continue continue if not s and self.hasphon(cls): s = self.phoncontent(cls,correctionhandling).phon() if s and previousdelimiter: return previousdelimiter + s elif s: return s else: #No text found at all :`( raise NoSuchPhon def originaltext(self,cls='original'): """Alias for retrieving the original uncorrect text. A call to :meth:`text` with ``correctionhandling=CorrectionHandling.ORIGINAL``""" return self.text(cls,correctionhandling=CorrectionHandling.ORIGINAL) def gettextdelimiter(self, retaintokenisation=False): """Return the text delimiter for this class. Uses the ``TEXTDELIMITER`` attribute but may return a customised one instead.""" if self.TEXTDELIMITER is None: #no text delimiter of itself, recurse into children to inherit delimiter for child in reversed(self): if isinstance(child, AbstractElement): return child.gettextdelimiter(retaintokenisation) return "" else: return self.TEXTDELIMITER def feat(self,subset): """Obtain the feature class value of the specific subset. If a feature occurs multiple times, the values will be returned in a list. Example:: sense = word.annotation(folia.Sense) synset = sense.feat('synset') Returns: str or list """ r = None for f in self: if isinstance(f, Feature) and f.subset == subset: if r: #support for multiclass features if isinstance(r,list): r.append(f.cls) else: r = [r, f.cls] else: r = f.cls if r is None: raise NoSuchAnnotation else: return r def __ne__(self, other): return not (self == other) def __eq__(self, other): #pylint: disable=too-many-return-statements """Equality method, tests whether two elements are equal. Elements are equal if all their attributes and children are equal.""" if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - " + repr(self) + " vs " + repr(other),file=stderr) #Check if we are of the same time if type(self) != type(other): #pylint: disable=unidiomatic-typecheck if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Type mismatch: " + str(type(self)) + " vs " + str(type(other)),file=stderr) return False #Check FoLiA attributes if self.id != other.id: if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - ID mismatch: " + str(self.id) + " vs " + str(other.id),file=stderr) return False if self.set != other.set: if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Set mismatch: " + str(self.set) + " vs " + str(other.set),file=stderr) return False if self.cls != other.cls: if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Class mismatch: " + repr(self.cls) + " vs " + repr(other.cls),file=stderr) return False if self.annotator != other.annotator: if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Annotator mismatch: " + repr(self.annotator) + " vs " + repr(other.annotator),file=stderr) return False if self.annotatortype != other.annotatortype: if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Annotator mismatch: " + repr(self.annotatortype) + " vs " + repr(other.annotatortype),file=stderr) return False #Check if we have same amount of children: mychildren = list(self) yourchildren = list(other) if len(mychildren) != len(yourchildren): if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Unequal amount of children",file=stderr) return False #Now check equality of children for mychild, yourchild in zip(mychildren, yourchildren): if mychild != yourchild: if self.doc and self.doc.debug: print("[PyNLPl FoLiA DEBUG] AbstractElement Equality Check - Child mismatch: " + repr(mychild) + " vs " + repr(yourchild) + " (in " + repr(self) + ", id: " + str(self.id) + ")",file=stderr) return False #looks like we made it! \o/ return True def __len__(self): """Returns the number of child elements under the current element.""" return len(self.data) def __nonzero__(self): #Python 2.x return True def __bool__(self): return True def __hash__(self): if self.id: return hash(self.id) else: raise TypeError("FoLiA elements are only hashable if they have an ID") def __iter__(self): """Iterate over all children of this element. Example:: for annotation in word: ... """ return iter(self.data) def __contains__(self, element): """Tests if the specified element is part of the children of the element""" return element in self.data def __getitem__(self, key): try: return self.data[key] except KeyError: raise def __unicode__(self): #Python 2 only """Alias for :meth:`text`. Python 2 only.""" return self.text() def __str__(self): """Alias for :meth:`text`""" return self.text() def copy(self, newdoc=None, idsuffix=""): """Make a deep copy of this element and all its children. Parameters: newdoc (:class:`Document`): The document the copy should be associated with. idsuffix (str or bool): If set to a string, the ID of the copy will be append with this (prevents duplicate IDs when making copies for the same document). If set to ``True``, a random suffix will be generated. Returns: a copy of the element """ if idsuffix is True: idsuffix = ".copy." + "%08x" % random.getrandbits(32) #random 32-bit hash for each copy, same one will be reused for all children c = deepcopy(self) if idsuffix: c.addidsuffix(idsuffix) c.setparents() c.setdoc(newdoc) return c def copychildren(self, newdoc=None, idsuffix=""): """Generator creating a deep copy of the children of this element. Invokes :meth:`copy` on all children, parameters are the same. """ if idsuffix is True: idsuffix = ".copy." + "%08x" % random.getrandbits(32) #random 32-bit hash for each copy, same one will be reused for all children for c in self: if isinstance(c, AbstractElement): yield c.copy(newdoc,idsuffix) def addidsuffix(self, idsuffix, recursive = True): """Appends a suffix to this element's ID, and optionally to all child IDs as well. There is sually no need to call this directly, invoked implicitly by :meth:`copy`""" if self.id: self.id += idsuffix if recursive: for e in self: try: e.addidsuffix(idsuffix, recursive) except Exception: pass def setparents(self): """Correct all parent relations for elements within the scop. There is sually no need to call this directly, invoked implicitly by :meth:`copy`""" for c in self: if isinstance(c, AbstractElement): c.parent = self c.setparents() def setdoc(self,newdoc): """Set a different document. Usually no need to call this directly, invoked implicitly by :meth:`copy`""" self.doc = newdoc if self.doc and self.id: self.doc.index[self.id] = self for c in self: if isinstance(c, AbstractElement): c.setdoc(newdoc) def hastext(self,cls='current',strict=True, correctionhandling=CorrectionHandling.CURRENT): #pylint: disable=too-many-return-statements """Does this element have text (of the specified class) By default, and unlike :meth:`text`, this checks strictly, i.e. the element itself must have the text and it is not inherited from its children. Parameters: cls (str): The class of the text content to obtain, defaults to ``current``. strict (bool): Set this if you are strictly interested in the text explicitly associated with the element, without recursing into children. Defaults to ``True``. correctionhandling: Specifies what text to check for when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current text. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the text prior to correction, and ``CorrectionHandling.EITHER`` if you don't care. Returns: bool """ if not self.PRINTABLE: #only printable elements can hold text return False elif self.TEXTCONTAINER: return True else: try: if strict: self.textcontent(cls, correctionhandling) #will raise NoSuchTextException when not found return True else: #Check children for e in self: if e.PRINTABLE and not isinstance(e, TextContent): if e.hastext(cls, strict, correctionhandling): return True self.textcontent(cls, correctionhandling) #will raise NoSuchTextException when not found return True except NoSuchText: return False def hasphon(self,cls='current',strict=True,correctionhandling=CorrectionHandling.CURRENT): #pylint: disable=too-many-return-statements """Does this element have phonetic content (of the specified class) By default, and unlike :meth:`phon`, this checks strictly, i.e. the element itself must have the phonetic content and it is not inherited from its children. Parameters: cls (str): The class of the phonetic content to obtain, defaults to ``current``. strict (bool): Set this if you are strictly interested in the phonetic content explicitly associated with the element, without recursing into children. Defaults to ``True``. correctionhandling: Specifies what phonetic content to check for when corrections are encountered. The default is ``CorrectionHandling.CURRENT``, which will retrieve the corrected/current phonetic content. You can set this to ``CorrectionHandling.ORIGINAL`` if you want the phonetic content prior to correction, and ``CorrectionHandling.EITHER`` if you don't care. Returns: bool """ if not self.SPEAKABLE: #only printable elements can hold text return False elif self.PHONCONTAINER: return True else: try: if strict: self.phoncontent(cls, correctionhandling) return True else: #Check children for e in self: if e.SPEAKABLE and not isinstance(e, PhonContent): if e.hasphon(cls, strict, correctionhandling): return True self.phoncontent(cls) #will raise NoSuchTextException when not found return True except NoSuchPhon: return False def settext(self, text, cls='current'): """Set the text for this element. Arguments: text (str): The text cls (str): The class of the text, defaults to ``current`` (leave this unless you know what you are doing). There may be only one text content element of each class associated with the element. """ self.replace(TextContent, value=text, cls=cls) def setdocument(self, doc): """Associate a document with this element. Arguments: doc (:class:`Document`): A document Each element must be associated with a FoLiA document. """ assert isinstance(doc, Document) if not self.doc: self.doc = doc if self.id: if self.id in doc: raise DuplicateIDError(self.id) else: self.doc.index[id] = self for e in self: #recursive for all children if isinstance(e,AbstractElement): e.setdocument(doc) @classmethod def addable(Class, parent, set=None, raiseexceptions=True): """Tests whether a new element of this class can be added to the parent. This method is mostly for internal use. This will use the ``OCCURRENCES`` property, but may be overidden by subclasses for more customised behaviour. Parameters: parent (:class:`AbstractElement`): The element that is being added to set (str or None): The set raiseexceptions (bool): Raise an exception if the element can't be added? Returns: bool Raises: ValueError """ if not Class in parent.ACCEPTED_DATA: #Class is not in accepted data, but perhaps any of its ancestors is? found = False c = Class try: while c.__base__: if c.__base__ in parent.ACCEPTED_DATA: found = True break c = c.__base__ except Exception: pass if not found: if raiseexceptions: if parent.id: extra = ' (id=' + parent.id + ')' else: extra = '' raise ValueError("Unable to add object of type " + Class.__name__ + " to " + parent.__class__.__name__ + " " + extra + ". Type not allowed as child.") else: return False if Class.OCCURRENCES > 0: #check if the parent doesn't have too many already count = parent.count(Class,None,True,[True, AbstractStructureElement]) #never descend into embedded structure annotatioton if count >= Class.OCCURRENCES: if raiseexceptions: if parent.id: extra = ' (id=' + parent.id + ')' else: extra = '' raise DuplicateAnnotationError("Unable to add another object of type " + Class.__name__ + " to " + parent.__class__.__name__ + " " + extra + ". There are already " + str(count) + " instances of this class, which is the maximum.") else: return False if Class.OCCURRENCES_PER_SET > 0 and set and Class.REQUIRED_ATTRIBS and Attrib.CLASS in Class.REQUIRED_ATTRIBS: count = parent.count(Class,set,True, [True, AbstractStructureElement]) if count >= Class.OCCURRENCES_PER_SET: if raiseexceptions: if parent.id: extra = ' (id=' + parent.id + ')' else: extra = '' raise DuplicateAnnotationError("Unable to add another object of set " + set + " and type " + Class.__name__ + " to " + parent.__class__.__name__ + " " + extra + ". There are already " + str(count) + " instances of this class, which is the maximum for the set.") else: return False return True def postappend(self): """This method will be called after an element is added to another and does some checks. It can do extra checks and if necessary raise exceptions to prevent addition. By default makes sure the right document is associated. This method is mostly for internal use. """ #If the element was not associated with a document yet, do so now (and for all unassociated children: if not self.doc and self.parent.doc: self.setdocument(self.parent.doc) if self.doc and self.doc.deepvalidation: self.deepvalidation() def addtoindex(self,norecurse=[]): """Makes sure this element (and all subelements), are properly added to the index. Mostly for internal use.""" if self.id: self.doc.index[self.id] = self for e in self.data: if all([not isinstance(e, C) for C in norecurse]): try: e.addtoindex(norecurse) except AttributeError: pass def deepvalidation(self): """Perform deep validation of this element. Raises: :class:`DeepValidationError` """ if self.doc and self.doc.deepvalidation and self.set and self.set[0] != '_': try: self.doc.setdefinitions[self.set].testclass(self.cls) except KeyError: if self.cls and not self.doc.allowadhocsets: raise DeepValidationError("Set definition " + self.set + " for " + self.XMLTAG + " not loaded!") except DeepValidationError as e: errormsg = str(e) + " (in set " + self.set+" for " + self.XMLTAG if self.id: errormsg += " with ID " + self.id errormsg += ")" raise DeepValidationError(errormsg) def append(self, child, *args, **kwargs): """Append a child element. Arguments: child (instance or class): 1) The instance to add (usually an instance derived from :class:`AbstractElement`. or 2) a class subclassed from :class:`AbstractElement`. Keyword Arguments: {generic_attribs} If an *instance* is passed as first argument, it will be appended If a *class* derived from :class:`AbstractElement` is passed as first argument, an instance will first be created and then appended. Keyword arguments: alternative (bool): If set to True, the element will be made into an alternative. (default to False) Generic example, passing a pre-generated instance:: word.append( folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) ) Generic example, passing a class to be generated:: word.append( folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) Generic example, setting text with a class: word.append( "house", cls='original' ) Returns: the added element Raises: ValueError: The element is not valid in this context :class:`DuplicateAnnotationError`: There is already such an annotation See also: :meth:`add` :meth:`insert` :meth:`replace` """.format(generic_attribs=DOCSTRING_GENERIC_ATTRIBS) #obtain the set (if available, necessary for checking addability) if 'set' in kwargs: set = kwargs['set'] else: try: set = child.set except AttributeError: set = None #Check if a Class rather than an instance was passed Class = None #do not set to child.__class__ if inspect.isclass(child): Class = child if Class.addable(self, set): if 'id' not in kwargs and 'generate_id_in' not in kwargs and ((Class.REQUIRED_ATTRIBS and (Attrib.ID in Class.REQUIRED_ATTRIBS)) or Class.AUTO_GENERATE_ID): kwargs['generate_id_in'] = self child = Class(self.doc, *args, **kwargs) elif args: raise Exception("Too many arguments specified. Only possible when first argument is a class and not an instance") dopostappend = True #Do the actual appending if not Class and isstring(child): if self.TEXTCONTAINER or self.PHONCONTAINER: #element is a text/phon container and directly allows strings as content, add the string as such: self.data.append(u(child)) dopostappend = False elif TextContent in self.ACCEPTED_DATA: #you can pass strings directly (just for convenience), will be made into textcontent automatically. child = TextContent(self.doc, child ) self.data.append(child) child.parent = self elif PhonContent in self.ACCEPTED_DATA: #you can pass strings directly (just for convenience), will be made into phoncontent automatically (note that textcontent always takes precedence, so you most likely will have to do it explicitly) child = PhonContent(self.doc, child ) #pylint: disable=redefined-variable-type self.data.append(child) child.parent = self else: raise ValueError("Unable to append object of type " + child.__class__.__name__ + " to " + self.__class__.__name__ + ". Type not allowed as child.") elif Class or (isinstance(child, AbstractElement) and child.__class__.addable(self, set)): #(prevents calling addable again if already done above) if 'alternative' in kwargs and kwargs['alternative']: child = Alternative(self.doc, child, generate_id_in=self) self.data.append(child) child.parent = self else: raise ValueError("Unable to append object of type " + child.__class__.__name__ + " to " + self.__class__.__name__ + ". Type not allowed as child.") if dopostappend: child.postappend() return child def insert(self, index, child, *args, **kwargs): """Insert a child element at specified index. Returns the added element If an *instance* is passed as first argument, it will be appended If a *class* derived from AbstractElement is passed as first argument, an instance will first be created and then appended. Arguments: index (int): The position where to insert the chldelement child: Instance or class Keyword arguments: alternative (bool): If set to True, the element will be made into an alternative. corrected (bool): Used only when passing strings to be made into TextContent elements. {generic_attribs} Generic example, passing a pre-generated instance:: word.insert( 3, folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) ) Generic example, passing a class to be generated:: word.insert( 3, folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) Generic example, setting text:: word.insert( 3, "house" ) Returns: the added element Raises: ValueError: The element is not valid in this context :class:`DuplicateAnnotationError`: There is already such an annotation See also: :meth:`append` :meth:`replace` """.format(generic_attribs=DOCSTRING_GENERIC_ATTRIBS) #obtain the set (if available, necessary for checking addability) if 'set' in kwargs: set = kwargs['set'] else: try: set = child.set except AttributeError: set = None #Check if a Class rather than an instance was passed Class = None #do not set to child.__class__ if inspect.isclass(child): Class = child if Class.addable(self, set): if 'id' not in kwargs and 'generate_id_in' not in kwargs and ((Class.REQUIRED_ATTRIBS and Attrib.ID in Class.REQUIRED_ATTRIBS) or (Class.OPTIONAL_ATTRIBS and Attrib.ID in Class.OPTIONAL_ATTRIBS)): kwargs['generate_id_in'] = self child = Class(self.doc, *args, **kwargs) elif args: raise Exception("Too many arguments specified. Only possible when first argument is a class and not an instance") #Do the actual appending if not Class and (isinstance(child,str) or (sys.version < '3' and isinstance(child,unicode))) and TextContent in self.ACCEPTED_DATA: #pylint: disable=undefined-variable #you can pass strings directly (just for convenience), will be made into textcontent automatically. child = TextContent(self.doc, child ) self.data.insert(index, child) child.parent = self elif Class or (isinstance(child, AbstractElement) and child.__class__.addable(self, set)): #(prevents calling addable again if already done above) if 'alternative' in kwargs and kwargs['alternative']: child = Alternative(self.doc, child, generate_id_in=self) #pylint: disable=redefined-variable-type self.data.insert(index, child) child.parent = self else: raise ValueError("Unable to append object of type " + child.__class__.__name__ + " to " + self.__class__.__name__ + ". Type not allowed as child.") child.postappend() return child def add(self, child, *args, **kwargs): """Add a child element. This is a higher level function that adds (appends) an annotation to an element, it will simply call :meth:`AbstractElement.append` for token annotation elements that fit within the scope. For span annotation, it will create and find or create the proper annotation layer and insert the element there. Arguments: child (instance or class): 1) The instance to add (usually an instance derived from :class:`AbstractElement`. or 2) a class subclassed from :class:`AbstractElement`. If an *instance* is passed as first argument, it will be appended If a *class* derived from :class:`AbstractElement` is passed as first argument, an instance will first be created and then appended. Keyword arguments: alternative (bool): If set to True, the element will be made into an alternative. (default to False) {generic_attribs} Generic example, passing a pre-generated instance:: word.add( folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) ) Generic example, passing a class to be generated:: word.add( folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) Generic example, setting text with a class:: word.add( "house", cls='original' ) Returns: the added element Raises: ValueError: The element is not valid in this context :class:`DuplicateAnnotationError`: There is already such an annotation See also: :meth:`add` :meth:`insert` :meth:`replace` """.format(generic_attribs=DOCSTRING_GENERIC_ATTRIBS) addspanfromspanned = False #add a span annotation element from that which is spanned (i.e. a Word, Morpheme) addspanfromstructure = False #add a span annotation elements from a structural parent which holds the span layers? (e.g. a Sentence, Paragraph) if (inspect.isclass(child) and issubclass(child, AbstractSpanAnnotation)) or (not inspect.isclass(child) and isinstance(child, AbstractSpanAnnotation)): layerclass = ANNOTATIONTYPE2LAYERCLASS[child.ANNOTATIONTYPE] if isinstance(self, (Word, Morpheme)): addspanfromspanned = True elif isinstance(self,AbstractStructureElement): #add a span addspanfromstructure = True if addspanfromspanned or addspanfromstructure: #get the set if 'set' in kwargs: set = kwargs['set'] else: try: set = self.doc.defaultset(layerclass) except KeyError: raise Exception("No set defined when adding span annotation and none could be inferred") if addspanfromspanned: #pylint: disable=too-many-nested-blocks #collect ancestors of the current element, allowedparents = [self] + list(self.ancestors(AbstractStructureElement)) #find common ancestors of structure elements in the arguments, and check whether it has the required annotation layer, create one if necessary for e in commonancestors(AbstractStructureElement, *[ x for x in args if isinstance(x, AbstractStructureElement)] ): if e in allowedparents: #is the element in the list of allowed parents according to this element? if AbstractAnnotationLayer in e.ACCEPTED_DATA or layerclass in e.ACCEPTED_DATA: try: layer = next(e.select(layerclass,set,True)) except StopIteration: layer = e.append(layerclass) if 'emptyspan' in kwargs and kwargs['emptyspan']: del kwargs['emptyspan'] return layer.append(child,*[],**kwargs) else: return layer.append(child,*args,**kwargs) raise Exception("Unable to find suitable common ancestor to create annotation layer") elif addspanfromstructure: layer = None for layer in self.layers(child.ANNOTATIONTYPE, set): pass #last one (only one actually) should be available in outer context if layer is None: layer = self.append(layerclass) return layer.append(child,*args,**kwargs) else: #normal behaviour, append return self.append(child,*args,**kwargs) @classmethod def findreplaceables(Class, parent, set=None,**kwargs): """Internal method to find replaceable elements. Auxiliary function used by :meth:`AbstractElement.replace`. Can be overriden for more fine-grained control.""" return list(parent.select(Class,set,False)) def updatetext(self): """Recompute textual value based on the text content of the children. Only supported on elements that are a ``TEXTCONTAINER``""" if self.TEXTCONTAINER: s = "" for child in self: if isinstance(child, AbstractElement): child.updatetext() s += child.text() elif isstring(child): s += child self.data = [s] def replace(self, child, *args, **kwargs): """Appends a child element like ``append()``, but replaces any existing child element of the same type and set. If no such child element exists, this will act the same as append() Keyword arguments: alternative (bool): If set to True, the *replaced* element will be made into an alternative. Simply use :meth:`AbstractElement.append` if you want the added element to be an alternative. See :meth:`AbstractElement.append` for more information and all parameters. """ if 'set' in kwargs: set = kwargs['set'] del kwargs['set'] else: try: set = child.set except AttributeError: set = None if inspect.isclass(child): Class = child replace = Class.findreplaceables(self, set, **kwargs) elif (self.TEXTCONTAINER or self.PHONCONTAINER) and isstring(child): #replace will replace ALL text content, removing text markup along the way! self.data = [] return self.append(child, *args,**kwargs) else: Class = child.__class__ kwargs['instance'] = child replace = Class.findreplaceables(self,set,**kwargs) del kwargs['instance'] kwargs['set'] = set #was deleted temporarily for findreplaceables if len(replace) == 0: #nothing to replace, simply call append if 'alternative' in kwargs: del kwargs['alternative'] #has other meaning in append() return self.append(child, *args, **kwargs) elif len(replace) > 1: raise Exception("Unable to replace. Multiple candidates found, unable to choose.") elif len(replace) == 1: if 'alternative' in kwargs and kwargs['alternative']: #old version becomes alternative if replace[0] in self.data: self.data.remove(replace[0]) alt = self.append(Alternative) alt.append(replace[0]) del kwargs['alternative'] #has other meaning in append() else: #remove old version competely self.remove(replace[0]) e = self.append(child, *args, **kwargs) self.updatetext() return e def ancestors(self, Class=None): """Generator yielding all ancestors of this element, effectively back-tracing its path to the root element. A tuple of multiple classes may be specified. Arguments: *Class: The class or classes (:class:`AbstractElement` or subclasses). Not instances! Yields: elements (instances derived from :class:`AbstractElement`) """ e = self while e: if e.parent: e = e.parent if not Class or isinstance(e,Class): yield e elif isinstance(Class, tuple): for C in Class: if isinstance(e,C): yield e else: break def ancestor(self, *Classes): """Find the most immediate ancestor of the specified type, multiple classes may be specified. Arguments: *Classes: The possible classes (:class:`AbstractElement` or subclasses) to select from. Not instances! Example:: paragraph = word.ancestor(folia.Paragraph) """ for e in self.ancestors(tuple(Classes)): return e raise NoSuchAnnotation def xml(self, attribs = None,elements = None, skipchildren = False): """Serialises the FoLiA element and all its contents to XML. Arguments are mostly for internal use. Returns: an lxml.etree.Element See also: :meth:`AbstractElement.xmlstring` - for direct string output """ E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"}) if not attribs: attribs = {} if not elements: elements = [] if self.id: attribs['{http://www.w3.org/XML/1998/namespace}id'] = self.id #Some attributes only need to be added if they are not the same as what's already set in the declaration if not isinstance(self, AbstractAnnotationLayer): if '{' + NSFOLIA + '}set' not in attribs: #do not override if overloaded function already set it try: if self.set: if not self.ANNOTATIONTYPE in self.doc.annotationdefaults or len(self.doc.annotationdefaults[self.ANNOTATIONTYPE]) != 1 or list(self.doc.annotationdefaults[self.ANNOTATIONTYPE].keys())[0] != self.set: if self.set != None: attribs['{' + NSFOLIA + '}set'] = self.set except AttributeError: pass if '{' + NSFOLIA + '}class' not in attribs: #do not override if caller already set it try: if self.cls: attribs['{' + NSFOLIA + '}class'] = self.cls except AttributeError: pass if '{' + NSFOLIA + '}annotator' not in attribs: #do not override if caller already set it try: if self.annotator and ((not (self.ANNOTATIONTYPE in self.doc.annotationdefaults)) or (not ( 'annotator' in self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set])) or (self.annotator != self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set]['annotator'])): attribs['{' + NSFOLIA + '}annotator'] = self.annotator if self.annotatortype and ((not (self.ANNOTATIONTYPE in self.doc.annotationdefaults)) or (not ('annotatortype' in self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set])) or (self.annotatortype != self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set]['annotatortype'])): if self.annotatortype == AnnotatorType.AUTO: attribs['{' + NSFOLIA + '}annotatortype'] = 'auto' elif self.annotatortype == AnnotatorType.MANUAL: attribs['{' + NSFOLIA + '}annotatortype'] = 'manual' except AttributeError: pass if '{' + NSFOLIA + '}confidence' not in attribs: #do not override if caller already set it if self.confidence: attribs['{' + NSFOLIA + '}confidence'] = str(self.confidence) if '{' + NSFOLIA + '}n' not in attribs: #do not override if caller already set it if self.n: attribs['{' + NSFOLIA + '}n'] = str(self.n) if '{' + NSFOLIA + '}auth' not in attribs: #do not override if caller already set it try: if not self.AUTH or not self.auth: #(former is static, latter isn't) attribs['{' + NSFOLIA + '}auth'] = 'no' except AttributeError: pass if '{' + NSFOLIA + '}datetime' not in attribs: #do not override if caller already set it if self.datetime and ((not (self.ANNOTATIONTYPE in self.doc.annotationdefaults)) or (not ( 'datetime' in self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set])) or (self.datetime != self.doc.annotationdefaults[self.ANNOTATIONTYPE][self.set]['datetime'])): attribs['{' + NSFOLIA + '}datetime'] = self.datetime.strftime("%Y-%m-%dT%H:%M:%S") if '{' + NSFOLIA + '}src' not in attribs: #do not override if caller already set it if self.src: attribs['{' + NSFOLIA + '}src'] = self.src if '{' + NSFOLIA + '}speaker' not in attribs: #do not override if caller already set it if self.speaker: attribs['{' + NSFOLIA + '}speaker'] = self.speaker if '{' + NSFOLIA + '}begintime' not in attribs: #do not override if caller already set it if self.begintime: attribs['{' + NSFOLIA + '}begintime'] = "%02d:%02d:%02d.%03d" % self.begintime if '{' + NSFOLIA + '}endtime' not in attribs: #do not override if caller already set it if self.endtime: attribs['{' + NSFOLIA + '}endtime'] = "%02d:%02d:%02d.%03d" % self.endtime if self.XLINK: if self.href: attribs['{http://www.w3.org/1999/xlink}href'] = self.href if not self.xlinktype: attribs['{http://www.w3.org/1999/xlink}type'] = "simple" if self.xlinktype: attribs['{http://www.w3.org/1999/xlink}type'] = self.xlinktype if self.xlinklabel: attribs['{http://www.w3.org/1999/xlink}label'] = self.xlinklabel if self.xlinkrole: attribs['{http://www.w3.org/1999/xlink}role'] = self.xlinkrole if self.xlinkshow: attribs['{http://www.w3.org/1999/xlink}show'] = self.xlinkshow if self.xlinktitle: attribs['{http://www.w3.org/1999/xlink}title'] = self.xlinktitle omitchildren = [] #Are there predetermined Features in ACCEPTED_DATA? for c in self.ACCEPTED_DATA: if issubclass(c, Feature) and c.SUBSET: #Do we have any of those? for c2 in self.data: if c2.__class__ is c and c.SUBSET == c2.SUBSET and c2.cls: #Yes, serialize them as attributes attribs[c2.SUBSET] = c2.cls omitchildren.append(c2) #and skip them as elements break #only one e = makeelement(E, '{' + NSFOLIA + '}' + self.XMLTAG, **attribs) if not skipchildren and self.data: #append children, # we want make sure that text elements are in the right order, 'current' class first # so we first put them in a list textelements = [] otherelements = [] for child in self: if isinstance(child, TextContent): if child.cls == 'current': textelements.insert(0, child) else: textelements.append(child) elif not child in omitchildren: otherelements.append(child) for child in textelements+otherelements: if (self.TEXTCONTAINER or self.PHONCONTAINER) and isstring(child): if len(e) == 0: if e.text: e.text += child else: e.text = child else: #add to tail of last child if e[-1].tail: e[-1].tail += child else: e[-1].tail = child else: xml = child.xml() #may return None in rare occassions, meaning we wan to skip if not xml is None: e.append(xml) if elements: #extra elements for e2 in elements: if isinstance(e2, str) or (sys.version < '3' and isinstance(e2, unicode)): if e.text is None: e.text = e2 else: e.text += e2 else: e.append(e2) return e def json(self, attribs=None, recurse=True, ignorelist=False): """Serialises the FoLiA element and all its contents to a Python dictionary suitable for serialisation to JSON. Example:: import json json.dumps(word.json()) Returns: dict """ jsonnode = {} jsonnode['type'] = self.XMLTAG if self.id: jsonnode['id'] = self.id if self.set: jsonnode['set'] = self.set if self.cls: jsonnode['class'] = self.cls if self.annotator: jsonnode['annotator'] = self.annotator if self.annotatortype: if self.annotatortype == AnnotatorType.AUTO: jsonnode['annotatortype'] = "auto" elif self.annotatortype == AnnotatorType.MANUAL: jsonnode['annotatortype'] = "manual" if self.confidence is not None: jsonnode['confidence'] = self.confidence if self.n: jsonnode['n'] = self.n if self.auth: jsonnode['auth'] = self.auth if self.datetime: jsonnode['datetime'] = self.datetime.strftime("%Y-%m-%dT%H:%M:%S") if recurse: #pylint: disable=too-many-nested-blocks jsonnode['children'] = [] if self.TEXTCONTAINER: jsonnode['text'] = self.text() if self.PHONCONTAINER: jsonnode['phon'] = self.phon() for child in self: if self.TEXTCONTAINER and isstring(child): jsonnode['children'].append(child) elif not self.PHONCONTAINER: #check ignore list ignore = False if ignorelist: for e in ignorelist: if isinstance(child,e): ignore = True break if not ignore: jsonnode['children'].append(child.json(attribs,recurse,ignorelist)) if attribs: for attrib in attribs: jsonnode[attrib] = attribs return jsonnode def xmlstring(self, pretty_print=False): """Serialises this FoLiA element and all its contents to XML. Returns: str: a string with XML representation for this element and all its children""" s = ElementTree.tostring(self.xml(), xml_declaration=False, pretty_print=pretty_print, encoding='utf-8') if sys.version < '3': if isinstance(s, str): s = unicode(s,'utf-8') #pylint: disable=undefined-variable else: if isinstance(s,bytes): s = str(s,'utf-8') s = s.replace('ns0:','') #ugly patch to get rid of namespace prefix s = s.replace(':ns0','') return s def select(self, Class, set=None, recursive=True, ignore=True, node=None): #pylint: disable=bad-classmethod-argument,redefined-builtin """Select child elements of the specified class. A further restriction can be made based on set. Arguments: Class (class): The class to select; any python class (not instance) subclassed off :class:`AbstractElement` Set (str): The set to match against, only elements pertaining to this set will be returned. If set to None (default), all elements regardless of set will be returned. recursive (bool): Select recursively? Descending into child elements? Defaults to ``True``. ignore: A list of Classes to ignore, if set to ``True`` instead of a list, all non-authoritative elements will be skipped (this is the default behaviour and corresponds to the following elements: :class:`Alternative`, :class:`AlternativeLayer`, :class:`Suggestion`, and :class:`folia.Original`. These elements and those contained within are never *authorative*. You may also include the boolean True as a member of a list, if you want to skip additional tags along the predefined non-authoritative ones. * ``node``: Reserved for internal usage, used in recursion. Yields: Elements (instances derived from :class:`AbstractElement`) Example:: for sense in text.select(folia.Sense, 'cornetto', True, [folia.Original, folia.Suggestion, folia.Alternative] ): .. """ #if ignorelist is True: # ignorelist = default_ignore if not node: node = self for e in self.data: #pylint: disable=too-many-nested-blocks if (not self.TEXTCONTAINER and not self.PHONCONTAINER) or isinstance(e, AbstractElement): if ignore is True: try: if not e.auth: continue except AttributeError: #not all elements have auth attribute.. pass elif ignore: #list doignore = False for c in ignore: if c is True: try: if not e.auth: doignore =True break except AttributeError: #not all elements have auth attribute.. pass elif c == e.__class__ or issubclass(e.__class__,c): doignore = True break if doignore: continue if isinstance(e, Class): if not set is None: try: if e.set != set: continue except AttributeError: continue yield e if recursive: for e2 in e.select(Class, set, recursive, ignore, e): if not set is None: try: if e2.set != set: continue except AttributeError: continue yield e2 def count(self, Class, set=None, recursive=True, ignore=True, node=None): """Like :meth:`AbstractElement.select`, but instead of returning the elements, it merely counts them. Returns: int """ return sum(1 for i in self.select(Class,set,recursive,ignore,node) ) def items(self, founditems=[]): #pylint: disable=dangerous-default-value """Returns a depth-first flat list of *all* items below this element (not limited to AbstractElement)""" l = [] for e in self.data: if e not in founditems: #prevent going in recursive loops l.append(e) if isinstance(e, AbstractElement): l += e.items(l) return l def getindex(self, child, recursive=True, ignore=True): """Get the index at which an element occurs, recursive by default! Returns: int """ #breadth first search for i, c in enumerate(self.data): if c is child: return i if recursive: #pylint: disable=too-many-nested-blocks for i, c in enumerate(self.data): if ignore is True: try: if not c.auth: continue except AttributeError: #not all elements have auth attribute.. pass elif ignore: #list doignore = False for e in ignore: if e is True: try: if not c.auth: doignore =True break except AttributeError: #not all elements have auth attribute.. pass elif e == c.__class__ or issubclass(c.__class__,e): doignore = True break if doignore: continue if isinstance(c, AbstractElement): j = c.getindex(child, recursive) if j != -1: return i #yes, i ... not j! return -1 def next(self, Class=True, scope=True, reverse=False): """Returns the next element, if it is of the specified type and if it does not cross the boundary of the defined scope. Returns None if no next element is found. Non-authoritative elements are never returned. Arguments: * ``Class``: The class to select; any python class subclassed off `'AbstractElement``, may also be a tuple of multiple classes. Set to ``True`` to constrain to the same class as that of the current instance, set to ``None`` to not constrain at all * ``scope``: A list of classes which are never crossed looking for a next element. Set to ``True`` to constrain to a default list of structure elements (Sentence,Paragraph,Division,Event, ListItem,Caption), set to ``None`` to not constrain at all. """ if Class is True: Class = self.__class__ if scope is True: scope = STRUCTURESCOPE structural = Class is not None and issubclass(Class,AbstractStructureElement) if reverse: order = reversed descendindex = -1 else: order = lambda x: x #pylint: disable=redefined-variable-type descendindex = 0 child = self parent = self.parent while parent: #pylint: disable=too-many-nested-blocks if len(parent) > 1: returnnext = False for e in order(parent): if e is child: #we found the current item, next item will be the one to return returnnext = True elif returnnext and e.auth and not isinstance(e,AbstractAnnotationLayer) and (not structural or (structural and (not isinstance(e,(AbstractTokenAnnotation,TextContent)) ) )): if structural and isinstance(e,Correction): if not list(e.select(AbstractStructureElement)): #skip-over non-structural correction continue if Class is None or (isinstance(Class,tuple) and (any(isinstance(e,C) for C in Class))) or isinstance(e,Class): return e else: #this is not yet the element of the type we are looking for, we are going to descend again in the very leftmost (rightmost if reversed) branch only while e.data: e = e.data[descendindex] if not isinstance(e, AbstractElement): return None #we've gone too far if e.auth and not isinstance(e,AbstractAnnotationLayer): if Class is None or (isinstance(Class,tuple) and (any(isinstance(e,C) for C in Class))) or isinstance(e,Class): return e else: #descend deeper continue return None #generational iteration child = parent if scope is not None and child.__class__ in scope: #you shall not pass! break parent = parent.parent return None def previous(self, Class=True, scope=True): """Returns the previous element, if it is of the specified type and if it does not cross the boundary of the defined scope. Returns None if no next element is found. Non-authoritative elements are never returned. Arguments: * ``Class``: The class to select; any python class subclassed off `'AbstractElement``. Set to ``True`` to constrain to the same class as that of the current instance, set to ``None`` to not constrain at all * ``scope``: A list of classes which are never crossed looking for a next element. Set to ``True`` to constrain to a default list of structure elements (Sentence,Paragraph,Division,Event, ListItem,Caption), set to ``None`` to not constrain at all. """ return self.next(Class,scope, True) def leftcontext(self, size, placeholder=None, scope=None): """Returns the left context for an element, as a list. This method crosses sentence/paragraph boundaries by default, which can be restricted by setting scope""" if size == 0: return [] #for efficiency context = [] e = self while len(context) < size: e = e.previous(True,scope) if not e: break context.append(e) if placeholder: while len(context) < size: context.append(placeholder) context.reverse() return context def rightcontext(self, size, placeholder=None, scope=None): """Returns the right context for an element, as a list. This method crosses sentence/paragraph boundaries by default, which can be restricted by setting scope""" if size == 0: return [] #for efficiency context = [] e = self while len(context) < size: e = e.next(True,scope) if not e: break context.append(e) if placeholder: while len(context) < size: context.append(placeholder) return context def context(self, size, placeholder=None, scope=None): """Returns this word in context, {size} words to the left, the current word, and {size} words to the right""" return self.leftcontext(size, placeholder,scope) + [self] + self.rightcontext(size, placeholder,scope) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None, origclass = None): """Returns a RelaxNG definition for this element (as an XML element (lxml.etree) rather than a string)""" E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" }) if origclass: cls = origclass preamble = [] try: if cls.__doc__: E2 = ElementMaker(namespace="http://relaxng.org/ns/annotation/0.9", nsmap={'a':'http://relaxng.org/ns/annotation/0.9'} ) preamble.append(E2.documentation(cls.__doc__)) except AttributeError: pass if cls.REQUIRED_ATTRIBS is None: cls.REQUIRED_ATTRIBS = () #bit hacky if cls.OPTIONAL_ATTRIBS is None: cls.OPTIONAL_ATTRIBS = () #bit hacky attribs = [ ] if cls.REQUIRED_ATTRIBS and Attrib.ID in cls.REQUIRED_ATTRIBS: attribs.append( E.attribute(name='id', ns="http://www.w3.org/XML/1998/namespace") ) elif Attrib.ID in cls.OPTIONAL_ATTRIBS: attribs.append( E.optional( E.attribute(name='id', ns="http://www.w3.org/XML/1998/namespace") ) ) if Attrib.CLASS in cls.REQUIRED_ATTRIBS: #Set is a tough one, we can't require it as it may be defined in the declaration: we make it optional and need schematron to resolve this later attribs.append( E.attribute(name='class') ) attribs.append( E.optional( E.attribute( name='set' ) ) ) elif Attrib.CLASS in cls.OPTIONAL_ATTRIBS: attribs.append( E.optional( E.attribute(name='class') ) ) attribs.append( E.optional( E.attribute( name='set' ) ) ) if Attrib.ANNOTATOR in cls.REQUIRED_ATTRIBS or Attrib.ANNOTATOR in cls.OPTIONAL_ATTRIBS: #Similarly tough attribs.append( E.optional( E.attribute(name='annotator') ) ) attribs.append( E.optional( E.attribute(name='annotatortype') ) ) if Attrib.CONFIDENCE in cls.REQUIRED_ATTRIBS: attribs.append( E.attribute(E.data(type='double',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='confidence') ) elif Attrib.CONFIDENCE in cls.OPTIONAL_ATTRIBS: attribs.append( E.optional( E.attribute(E.data(type='double',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='confidence') ) ) if Attrib.N in cls.REQUIRED_ATTRIBS: attribs.append( E.attribute( name='n') ) elif Attrib.N in cls.OPTIONAL_ATTRIBS: attribs.append( E.optional( E.attribute( name='n') ) ) if Attrib.DATETIME in cls.REQUIRED_ATTRIBS: attribs.append( E.attribute(E.data(type='dateTime',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='datetime') ) elif Attrib.DATETIME in cls.OPTIONAL_ATTRIBS: attribs.append( E.optional( E.attribute( E.data(type='dateTime',datatypeLibrary='http://www.w3.org/2001/XMLSchema-datatypes'), name='datetime') ) ) if Attrib.BEGINTIME in cls.REQUIRED_ATTRIBS: attribs.append(E.attribute(name='begintime') ) elif Attrib.BEGINTIME in cls.OPTIONAL_ATTRIBS: attribs.append( E.optional( E.attribute(name='begintime') ) ) if Attrib.ENDTIME in cls.REQUIRED_ATTRIBS: attribs.append(E.attribute(name='endtime') ) elif Attrib.ENDTIME in cls.OPTIONAL_ATTRIBS: attribs.append( E.optional( E.attribute(name='endtime') ) ) if Attrib.SRC in cls.REQUIRED_ATTRIBS: attribs.append(E.attribute(name='src') ) elif Attrib.SRC in cls.OPTIONAL_ATTRIBS: attribs.append( E.optional( E.attribute(name='src') ) ) if Attrib.SPEAKER in cls.REQUIRED_ATTRIBS: attribs.append(E.attribute(name='speaker') ) elif Attrib.SPEAKER in cls.OPTIONAL_ATTRIBS: attribs.append( E.optional( E.attribute(name='speaker') ) ) if cls.XLINK: attribs += [ #loose interpretation of specs, not checking whether xlink combinations are valid E.optional(E.attribute(name='href',ns="http://www.w3.org/1999/xlink"),E.attribute(name='type',ns="http://www.w3.org/1999/xlink") ), E.optional(E.attribute(name='role',ns="http://www.w3.org/1999/xlink")), E.optional(E.attribute(name='title',ns="http://www.w3.org/1999/xlink")), E.optional(E.attribute(name='label',ns="http://www.w3.org/1999/xlink")), E.optional(E.attribute(name='show',ns="http://www.w3.org/1999/xlink")), ] attribs.append( E.optional( E.attribute( name='auth' ) ) ) if extraattribs: for e in extraattribs: attribs.append(e) #s attribs.append( E.ref(name="allow_foreign_attributes") ) elements = [] #(including attributes) if cls.TEXTCONTAINER or cls.PHONCONTAINER: elements.append( E.text()) #We actually want to require non-empty text (E.text() is not sufficient) #but this is not solved yet, see https://github.com/proycon/folia/issues/19 #elements.append( E.data(E.param(r".+",name="pattern"),type='string')) #elements.append( E.data(E.param(r"(.|\n|\r)*\S+(.|\n|\r)*",name="pattern"),type='string')) done = {} if includechildren and cls.ACCEPTED_DATA: #pylint: disable=too-many-nested-blocks for c in cls.ACCEPTED_DATA: if c.__name__[:8] == 'Abstract' and inspect.isclass(c): for c2 in globals().values(): try: if inspect.isclass(c2) and issubclass(c2, c): try: if c2.XMLTAG and c2.XMLTAG not in done: if c2.OCCURRENCES == 1: elements.append( E.optional( E.ref(name=c2.XMLTAG) ) ) else: elements.append( E.zeroOrMore( E.ref(name=c2.XMLTAG) ) ) if c2.XMLTAG == 'item': #nasty hack for backward compatibility with deprecated listitem element elements.append( E.zeroOrMore( E.ref(name='listitem') ) ) done[c2.XMLTAG] = True except AttributeError: continue except TypeError: pass elif issubclass(c, Feature) and c.SUBSET: attribs.append( E.optional( E.attribute(name=c.SUBSET))) #features as attributes else: try: if c.XMLTAG and c.XMLTAG not in done: if cls.REQUIRED_DATA and c in cls.REQUIRED_DATA: if c.OCCURRENCES == 1: elements.append( E.ref(name=c.XMLTAG) ) else: elements.append( E.oneOrMore( E.ref(name=c.XMLTAG) ) ) elif c.OCCURRENCES == 1: elements.append( E.optional( E.ref(name=c.XMLTAG) ) ) else: elements.append( E.zeroOrMore( E.ref(name=c.XMLTAG) ) ) if c.XMLTAG == 'item': #nasty hack for backward compatibility with deprecated listitem element elements.append( E.zeroOrMore( E.ref(name='listitem') ) ) done[c.XMLTAG] = True except AttributeError: continue if extraelements: for e in extraelements: elements.append( e ) if elements: if len(elements) > 1: attribs.append( E.interleave(*elements) ) else: attribs.append( *elements ) if not attribs: attribs.append( E.empty() ) if cls.XMLTAG in ('desc','comment'): return E.define( E.element(E.text(), *(preamble + attribs), **{'name': cls.XMLTAG}), name=cls.XMLTAG, ns=NSFOLIA) else: return E.define( E.element(*(preamble + attribs), **{'name': cls.XMLTAG}), name=cls.XMLTAG, ns=NSFOLIA) @classmethod def parsexml(Class, node, doc, **kwargs): #pylint: disable=bad-classmethod-argument """Internal class method used for turning an XML element into an instance of the Class. Args: * ``node`` - XML Element * ``doc`` - Document Returns: An instance of the current Class. """ assert issubclass(Class, AbstractElement) if doc.preparsexmlcallback: result = doc.preparsexmlcallback(node) if not result: return None if isinstance(result, AbstractElement): return result dcoi = node.tag.startswith('{' + NSDCOI + '}') args = [] if not kwargs: kwargs = {} text = None #for dcoi support if (Class.TEXTCONTAINER or Class.PHONCONTAINER) and node.text: args.append(node.text) for subnode in node: #pylint: disable=too-many-nested-blocks #don't trip over comments if not isinstance(subnode, ElementTree._Comment): #pylint: disable=protected-access if subnode.tag.startswith('{' + NSFOLIA + '}'): if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Processing subnode " + subnode.tag[nslen:],file=stderr) e = doc.parsexml(subnode, Class) if e is not None: args.append(e) if (Class.TEXTCONTAINER or Class.PHONCONTAINER) and subnode.tail: args.append(subnode.tail) elif subnode.tag.startswith('{' + NSDCOI + '}'): #Dcoi support if Class is Text and subnode.tag[nslendcoi:] == 'body': for subsubnode in subnode: if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Processing DCOI subnode " + subnode.tag[nslendcoi:],file=stderr) e = doc.parsexml(subsubnode, Class) if e is not None: args.append(e) else: if doc.debug >= 1: print( "[PyNLPl FoLiA DEBUG] Processing DCOI subnode " + subnode.tag[nslendcoi:],file=stderr) e = doc.parsexml(subnode, Class) if e is not None: args.append(e) elif doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Ignoring subnode outside of FoLiA namespace: " + subnode.tag,file=stderr) if dcoi: dcoipos = dcoilemma = dcoicorrection = dcoicorrectionoriginal = None for key, value in node.attrib.items(): if key[0] == '{' or key =='XMLid': if key == '{http://www.w3.org/XML/1998/namespace}id' or key == 'XMLid': key = 'id' elif key.startswith( '{' + NSFOLIA + '}'): key = key[nslen:] if key == 'id': #ID in FoLiA namespace is always a reference, passed in kwargs as follows: key = 'idref' elif Class.XLINK and key.startswith('{http://www.w3.org/1999/xlink}'): key = key[30:] if key != 'href': key = 'xlink' + key #xlinktype, xlinkrole, xlinklabel, xlinkshow, etc.. elif key.startswith('{' + NSDCOI + '}'): key = key[nslendcoi:] #D-Coi support: if dcoi: if Class is Word and key == 'pos': dcoipos = value continue elif Class is Word and key == 'lemma': dcoilemma = value continue elif Class is Word and key == 'correction': dcoicorrection = value #class continue elif Class is Word and key == 'original': dcoicorrectionoriginal = value continue elif Class is Gap and key == 'reason': key = 'class' elif Class is Gap and key == 'hand': key = 'annotator' elif Class is Division and key == 'type': key = 'cls' kwargs[key] = value #D-Coi support: if dcoi and TextContent in Class.ACCEPTED_DATA and node.text: text = node.text.strip() kwargs['text'] = text if not AnnotationType.TOKEN in doc.annotationdefaults: doc.declare(AnnotationType.TOKEN, set='http://ilk.uvt.nl/folia/sets/ilktok.foliaset') if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found " + node.tag[nslen:],file=stderr) instance = Class(doc, *args, **kwargs) #if id: # if doc.debug >= 1: print >>stderr, "[PyNLPl FoLiA DEBUG] Adding to index: " + id # doc.index[id] = instance if dcoi: if dcoipos: if not AnnotationType.POS in doc.annotationdefaults: doc.declare(AnnotationType.POS, set='http://ilk.uvt.nl/folia/sets/cgn-legacy.foliaset') instance.append( PosAnnotation(doc, cls=dcoipos) ) if dcoilemma: if not AnnotationType.LEMMA in doc.annotationdefaults: doc.declare(AnnotationType.LEMMA, set='http://ilk.uvt.nl/folia/sets/mblem-nl.foliaset') instance.append( LemmaAnnotation(doc, cls=dcoilemma) ) if dcoicorrection and dcoicorrectionoriginal and text: if not AnnotationType.CORRECTION in doc.annotationdefaults: doc.declare(AnnotationType.CORRECTION, set='http://ilk.uvt.nl/folia/sets/dcoi-corrections.foliaset') instance.correct(generate_id_in=instance, cls=dcoicorrection, original=dcoicorrectionoriginal, new=text) if doc.parsexmlcallback: result = doc.parsexmlcallback(instance) if not result: return None if isinstance(result, AbstractElement): return result return instance def resolveword(self, id): return None def remove(self, child): """Removes the child element""" if not isinstance(child, AbstractElement): raise ValueError("Expected AbstractElement, got " + str(type(child))) if child.parent == self: child.parent = None self.data.remove(child) #delete from index if child.id and self.doc and child.id in self.doc.index: del self.doc.index[child.id] def incorrection(self): """Is this element part of a correction? If it is, it returns the Correction element (evaluating to True), otherwise it returns None""" e = self.parent while e: if isinstance(e, Correction): return e if isinstance(e, AbstractStructureElement): break e = e.parent return None class Description(AbstractElement): """Description is an element that can be used to associate a description with almost any other FoLiA element""" def __init__(self,doc, *args, **kwargs): """Required keyword arguments: * ``value=``: The text content for the description (``str`` or ``unicode``) """ if 'value' in kwargs: if kwargs['value'] is None: self.value = "" elif isstring(kwargs['value']): self.value = u(kwargs['value']) else: if sys.version < '3': raise Exception("value= parameter must be unicode or str instance, got " + str(type(kwargs['value']))) else: raise Exception("value= parameter must be str instance, got " + str(type(kwargs['value']))) del kwargs['value'] else: raise Exception("Description expects value= parameter") super(Description,self).__init__(doc, *args, **kwargs) def __nonzero__(self): #Python 2.x return bool(self.value) def __bool__(self): return bool(self.value) def __unicode__(self): return self.value def __str__(self): return self.value def xml(self, attribs = None,elements = None, skipchildren = False): return super(Description, self).xml(attribs, [self.value],skipchildren) def json(self,attribs =None, recurse=True, ignorelist=False): jsonnode = {'type': self.XMLTAG, 'value': self.value} if attribs: for attrib in attribs: jsonnode[attrib] = attrib return jsonnode @classmethod def parsexml(Class, node, doc, **kwargs): if not kwargs: kwargs = {} kwargs['value'] = node.text return super(Description,Class).parsexml(node, doc, **kwargs) class Comment(AbstractElement): """Comment is an element that can be used to associate a comment with almost any other FoLiA element""" def __init__(self,doc, *args, **kwargs): """Required keyword arguments: * ``value=``: The text content for the comment (``str`` or ``unicode``) """ if 'value' in kwargs: if kwargs['value'] is None: self.value = "" elif isstring(kwargs['value']): self.value = u(kwargs['value']) else: if sys.version < '3': raise Exception("value= parameter must be unicode or str instance, got " + str(type(kwargs['value']))) else: raise Exception("value= parameter must be str instance, got " + str(type(kwargs['value']))) del kwargs['value'] else: raise Exception("Comment expects value= parameter") super(Comment,self).__init__(doc, *args, **kwargs) def __nonzero__(self): #Python 2.x return bool(self.value) def __bool__(self): return bool(self.value) def __unicode__(self): return self.value def __str__(self): return self.value def xml(self, attribs = None,elements = None, skipchildren = False): return super(Comment, self).xml(attribs, [self.value],skipchildren) def json(self,attribs =None, recurse=True, ignorelist=False): jsonnode = {'type': self.XMLTAG, 'value': self.value} if attribs: for attrib in attribs: jsonnode[attrib] = attrib return jsonnode @classmethod def parsexml(Class, node, doc, **kwargs): if not kwargs: kwargs = {} kwargs['value'] = node.text return super(Comment,Class).parsexml(node, doc, **kwargs) class AllowCorrections(object): def correct(self, **kwargs): """Apply a correction (TODO: documentation to be written still)""" if 'insertindex_offset' in kwargs: del kwargs['insertindex_offset'] #dealt with in an earlier stage if 'confidence' in kwargs and kwargs['confidence'] is None: del kwargs['confidence'] if 'reuse' in kwargs: #reuse an existing correction instead of making a new one if isinstance(kwargs['reuse'], Correction): c = kwargs['reuse'] else: #assume it's an index try: c = self.doc.index[kwargs['reuse']] assert isinstance(c, Correction) except: raise ValueError("reuse= must point to an existing correction (id or instance)! Got " + str(kwargs['reuse'])) suggestionsonly = (not c.hasnew(True) and not c.hasoriginal(True) and c.hassuggestions(True)) if 'new' in kwargs and c.hascurrent(): #can't add new if there's current, so first set original to current, and then delete current if 'current' in kwargs: raise Exception("Can't set both new= and current= !") if 'original' not in kwargs: kwargs['original'] = c.current() c.remove(c.current()) else: if 'id' not in kwargs and 'generate_id_in' not in kwargs: kwargs['generate_id_in'] = self kwargs2 = copy(kwargs) for x in ['new','original','suggestion', 'suggestions','current', 'insertindex','nooriginal']: if x in kwargs2: del kwargs2[x] c = Correction(self.doc, **kwargs2) addnew = False if 'insertindex' in kwargs: insertindex = int(kwargs['insertindex']) del kwargs['insertindex'] else: insertindex = -1 #append if 'nooriginal' in kwargs and kwargs['nooriginal']: nooriginal = True del kwargs['nooriginal'] else: nooriginal = False if 'current' in kwargs: if 'original' in kwargs or 'new' in kwargs: raise Exception("When setting current=, original= and new= can not be set!") if not isinstance(kwargs['current'], list) and not isinstance(kwargs['current'], tuple): kwargs['current'] = [kwargs['current']] #support both lists (for multiple elements at once), as well as single element c.replace(Current(self.doc, *kwargs['current'])) for o in kwargs['current']: #delete current from current element if o in self and isinstance(o, AbstractElement): #pylint: disable=unsupported-membership-test if insertindex == -1: insertindex = self.data.index(o) self.remove(o) del kwargs['current'] if 'new' in kwargs: if not isinstance(kwargs['new'], list) and not isinstance(kwargs['new'], tuple): kwargs['new'] = [kwargs['new']] #support both lists (for multiple elements at once), as well as single element addnew = New(self.doc, *kwargs['new']) #pylint: disable=redefined-variable-type c.replace(addnew) for current in c.select(Current): #delete current if present c.remove(current) del kwargs['new'] if 'original' in kwargs and kwargs['original']: if not isinstance(kwargs['original'], list) and not isinstance(kwargs['original'], tuple): kwargs['original'] = [kwargs['original']] #support both lists (for multiple elements at once), as well as single element c.replace(Original(self.doc, *kwargs['original'])) for o in kwargs['original']: #delete original from current element if o in self and isinstance(o, AbstractElement): #pylint: disable=unsupported-membership-test if insertindex == -1: insertindex = self.data.index(o) self.remove(o) for o in kwargs['original']: #make sure IDs are still properly set after removal o.addtoindex() for current in c.select(Current): #delete current if present c.remove(current) del kwargs['original'] elif addnew and not nooriginal: #original not specified, find automagically: original = [] for new in addnew: kwargs2 = {} if isinstance(new, TextContent): kwargs2['cls'] = new.cls try: set = new.set except AttributeError: set = None #print("DEBUG: Finding replaceables within " + str(repr(self)) + " for ", str(repr(new)), " set " ,set , " args " ,repr(kwargs2),file=sys.stderr) replaceables = new.__class__.findreplaceables(self, set, **kwargs2) #print("DEBUG: " , len(replaceables) , " found",file=sys.stderr) original += replaceables if not original: #print("DEBUG: ", self.xmlstring(),file=sys.stderr) raise Exception("No original= specified and unable to automatically infer on " + str(repr(self)) + " for " + str(repr(new)) + " with set " + set) else: c.replace( Original(self.doc, *original)) for current in c.select(Current): #delete current if present c.remove(current) if addnew and not nooriginal: for original in c.original(): if original in self: #pylint: disable=unsupported-membership-test self.remove(original) if 'suggestion' in kwargs: kwargs['suggestions'] = [kwargs['suggestion']] del kwargs['suggestion'] if 'suggestions' in kwargs: for suggestion in kwargs['suggestions']: if isinstance(suggestion, Suggestion): c.append(suggestion) elif isinstance(suggestion, list) or isinstance(suggestion, tuple): c.append(Suggestion(self.doc, *suggestion)) else: c.append(Suggestion(self.doc, suggestion)) del kwargs['suggestions'] if 'reuse' in kwargs: if addnew and suggestionsonly: #What was previously only a suggestion, now becomes a real correction #If annotator, annotatortypes #are associated with the correction as a whole, move it to the suggestions #correction-wide annotator, annotatortypes might be overwritten for suggestion in c.suggestions(): if c.annotator and not suggestion.annotator: suggestion.annotator = c.annotator if c.annotatortype and not suggestion.annotatortype: suggestion.annotatortype = c.annotatortype if 'annotator' in kwargs: c.annotator = kwargs['annotator'] #pylint: disable=attribute-defined-outside-init if 'annotatortype' in kwargs: c.annotatortype = kwargs['annotatortype'] #pylint: disable=attribute-defined-outside-init if 'confidence' in kwargs: c.confidence = float(kwargs['confidence']) #pylint: disable=attribute-defined-outside-init c.addtoindex() del kwargs['reuse'] else: c.addtoindex() if insertindex == -1: self.append(c) else: self.insert(insertindex, c) return c class AllowTokenAnnotation(AllowCorrections): """Elements that allow token annotation (including extended annotation) must inherit from this class""" def annotations(self,Class,set=None): """Obtain child elements (annotations) of the specified class. A further restriction can be made based on set. Arguments: Class (class): The class to select; any python class (not instance) subclassed off :class:`AbstractElement` Set (str): The set to match against, only elements pertaining to this set will be returned. If set to None (default), all elements regardless of set will be returned. Yields: Elements (instances derived from :class:`AbstractElement`) Example:: for sense in text.annotations(folia.Sense, 'http://some/path/cornetto'): .. See also: :meth:`AbstractElement.select` Raises: :meth:`AllowTokenAnnotation.annotations` :class:`NoSuchAnnotation` if no such annotation exists """ found = False for e in self.select(Class,set,True,default_ignore_annotations): found = True yield e if not found: raise NoSuchAnnotation() def hasannotation(self,Class,set=None): """Returns an integer indicating whether such as annotation exists, and if so, how many. See :meth:`AllowTokenAnnotation.annotations`` for a description of the parameters.""" return sum( 1 for _ in self.select(Class,set,True,default_ignore_annotations)) def annotation(self, type, set=None): """Obtain a single annotation element. A further restriction can be made based on set. Arguments: Class (class): The class to select; any python class (not instance) subclassed off :class:`AbstractElement` Set (str): The set to match against, only elements pertaining to this set will be returned. If set to None (default), all elements regardless of set will be returned. Returns: An element (instance derived from :class:`AbstractElement`) Example:: sense = word.annotation(folia.Sense, 'http://some/path/cornetto').cls See also: :meth:`AllowTokenAnnotation.annotations` :meth:`AbstractElement.select` Raises: :class:`NoSuchAnnotation` if no such annotation exists """ """Will return a **single** annotation (even if there are multiple). Raises a ``NoSuchAnnotation`` exception if none was found""" for e in self.select(type,set,True,default_ignore_annotations): return e raise NoSuchAnnotation() def alternatives(self, Class=None, set=None): """Generator over alternatives, either all or only of a specific annotation type, and possibly restrained also by set. Arguments: Class (class): The python Class you want to retrieve (e.g. PosAnnotation). Or set to ``None`` to select all alternatives regardless of what type they are. set (str): The set you want to retrieve (defaults to ``None``, which selects irregardless of set) Yields: :class:`Alternative` elements """ for e in self.select(Alternative,None, True, []): #pylint: disable=too-many-nested-blocks if Class is None: yield e elif len(e) >= 1: #child elements? for e2 in e: try: if isinstance(e2, Class): try: if set is None or e2.set == set: yield e #not e2 break #yield an alternative only once (in case there are multiple matches) except AttributeError: continue except AttributeError: continue class AllowGenerateID(object): """Classes inherited from this class allow for automatic ID generation, using the convention of adding a period, the name of the element , another period, and a sequence number""" def _getmaxid(self, xmltag): try: if xmltag in self.maxid: return self.maxid[xmltag] else: return 0 except AttributeError: return 0 def _setmaxid(self, child): #print "set maxid on " + repr(self) + " for " + repr(child) try: self.maxid except AttributeError: self.maxid = {}#pylint: disable=attribute-defined-outside-init try: if child.id and child.XMLTAG: fields = child.id.split(self.doc.IDSEPARATOR) if len(fields) > 1 and fields[-1].isdigit(): if not child.XMLTAG in self.maxid: self.maxid[child.XMLTAG] = int(fields[-1]) #print "set maxid on " + repr(self) + ", " + child.XMLTAG + " to " + fields[-1] else: if self.maxid[child.XMLTAG] < int(fields[-1]): self.maxid[child.XMLTAG] = int(fields[-1]) #print "set maxid on " + repr(self) + ", " + child.XMLTAG + " to " + fields[-1] except AttributeError: pass def generate_id(self, cls): if isinstance(cls,str): xmltag = cls else: try: xmltag = cls.XMLTAG except: raise GenerateIDException("Unable to generate ID, expected a class such as Alternative, Correction, etc...") maxid = self._getmaxid(xmltag) id = None if self.id: id = self.id else: #this element has no ID, fall back to closest parent ID: e = self while e.parent: if e.id: id = e.id break e = e.parent if id is None: raise GenerateIDException("Unable to generate ID, no parent ID could be found") origid = id while True: maxid += 1 id = origid + '.' + xmltag + '.' + str(maxid) if not self.doc or id not in self.doc.index: #extra check break try: self.maxid except AttributeError: self.maxid = {}#pylint: disable=attribute-defined-outside-init self.maxid[xmltag] = maxid #Set MAX ID return id class AbstractStructureElement(AbstractElement, AllowTokenAnnotation, AllowGenerateID): """Abstract element, all structure elements inherit from this class. Never instantiated directly.""" def __init__(self, doc, *args, **kwargs): super(AbstractStructureElement,self).__init__(doc, *args, **kwargs) def resolveword(self, id): for child in self: r = child.resolveword(id) if r: return r return None def append(self, child, *args, **kwargs): """See ``AbstractElement.append()``""" e = super(AbstractStructureElement,self).append(child, *args, **kwargs) self._setmaxid(e) return e def words(self, index = None): """Returns a generator of Word elements found (recursively) under this element. Arguments: * ``index``: If set to an integer, will retrieve and return the n'th element (starting at 0) instead of returning the list of all """ if index is None: return self.select(Word,None,True,default_ignore_structure) else: if index < 0: index = self.count(Word,None,True,default_ignore_structure) + index for i, e in enumerate(self.select(Word,None,True,default_ignore_structure)): if i == index: return e raise IndexError def paragraphs(self, index = None): """Returns a generator of Paragraph elements found (recursively) under this element. Arguments: index (int or None): If set to an integer, will retrieve and return the n'th element (starting at 0) instead of returning the generator of all """ if index is None: return self.select(Paragraph,None,True,default_ignore_structure) else: if index < 0: index = self.count(Paragraph,None,True,default_ignore_structure) + index for i,e in enumerate(self.select(Paragraph,None,True,default_ignore_structure)): if i == index: return e raise IndexError def sentences(self, index = None): """Returns a generator of Sentence elements found (recursively) under this element Arguments: index (int or None): If set to an integer, will retrieve and return the n'th element (starting at 0) instead of returning a generator of all """ if index is None: return self.select(Sentence,None,True,default_ignore_structure) else: if index < 0: index = self.count(Sentence,None,True,default_ignore_structure) + index for i,e in enumerate(self.select(Sentence,None,True,default_ignore_structure)): if i == index: return e raise IndexError def layers(self, annotationtype=None,set=None): """Returns a list of annotation layers found *directly* under this element, does not include alternative layers""" if inspect.isclass(annotationtype): annotationtype = annotationtype.ANNOTATIONTYPE return [ x for x in self.select(AbstractAnnotationLayer,set,False,True) if annotationtype is None or x.ANNOTATIONTYPE == annotationtype ] def hasannotationlayer(self, annotationtype=None,set=None): """Does the specified annotation layer exist?""" l = self.layers(annotationtype, set) return (len(l) > 0) def __eq__(self, other): return super(AbstractStructureElement, self).__eq__(other) class AbstractTokenAnnotation(AbstractElement, AllowGenerateID): """Abstract element, all token annotation elements are derived from this class""" def append(self, child, *args, **kwargs): """See ``AbstractElement.append()``""" e = super(AbstractTokenAnnotation,self).append(child, *args, **kwargs) self._setmaxid(e) return e class AbstractExtendedTokenAnnotation(AbstractTokenAnnotation): pass class AbstractTextMarkup(AbstractElement): """Abstract class for text markup elements, elements that appear with the :class:`TextContent` (``t``) element. Markup elements pertain primarily to styling, but also have other roles. Iterating over the element of a :class:`TextContent` element will first and foremost produce strings, but also uncover these markup elements when present. """ def __init__(self, doc, *args, **kwargs): """See :meth:`AbstractElement.__init__`, text is passed as a string in ``*args``.""" if 'idref' in kwargs: self.idref = kwargs['idref'] del kwargs['idref'] else: self.idref = None if 'value' in kwargs: #for backward compatibility kwargs['text'] = kwargs['value'] del kwargs['value'] super(AbstractTextMarkup,self).__init__(doc, *args, **kwargs) #if self.value and (self.value != self.value.translate(ILLEGAL_UNICODE_CONTROL_CHARACTERS)): # raise ValueError("There are illegal unicode control characters present in Text Markup Content: " + repr(self.value)) def settext(self, text): """Sets the text content of the markup element. Arguments: text (str) """ self.data = [text] if not self.data: raise ValueError("Empty text content elements are not allowed") def resolve(self): if self.idref: return self.doc[self.idref] else: return self def xml(self, attribs = None,elements = None, skipchildren = False): """See :meth:`AbstractElement.xml`""" if not attribs: attribs = {} if self.idref: attribs['id'] = self.idref return super(AbstractTextMarkup,self).xml(attribs,elements, skipchildren) def json(self,attribs =None, recurse=True, ignorelist=False): """See :meth:`AbstractElement.json`""" if not attribs: attribs = {} if self.idref: attribs['id'] = self.idref return super(AbstractTextMarkup,self).json(attribs,recurse, ignorelist) @classmethod def parsexml(Class, node, doc, **kwargs): if not kwargs: kwargs ={} if 'id' in node.attrib: kwargs['idref'] = node.attrib['id'] del node.attrib['id'] return super(AbstractTextMarkup,Class).parsexml(node, doc, **kwargs) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" }) if not extraattribs: extraattribs = [] extraattribs.append( E.optional(E.attribute(name='id' ))) #id reference return super(AbstractTextMarkup, cls).relaxng(includechildren, extraattribs, extraelements) class TextMarkupString(AbstractTextMarkup): """Markup element to mark arbitrary substrings in text content (:class:`TextContent`)""" class TextMarkupGap(AbstractTextMarkup): """Markup element to mark gaps in text content (:class:`TextContent`) Only consider this element for gaps in spans of untokenised text. The use of structural element :class:`Gap` is preferred. """ class TextMarkupCorrection(AbstractTextMarkup): """Markup element to mark corrections in text content (:class:`TextContent`). Only consider this element for corrections on untokenised text. The use of :class:`Correction` is preferred. """ def __init__(self, doc, *args, **kwargs): if 'original' in kwargs: self.original = kwargs['original'] del kwargs['original'] else: self.original = None super(TextMarkupCorrection,self).__init__(doc, *args, **kwargs) def xml(self, attribs = None,elements = None, skipchildren = False): if not attribs: attribs = {} if self.original: attribs['original'] = self.original return super(TextMarkupCorrection,self).xml(attribs,elements, skipchildren) def json(self,attribs =None, recurse=True, ignorelist=False): if not attribs: attribs = {} if self.original: attribs['original'] = self.original return super(TextMarkupCorrection,self).json(attribs,recurse,ignorelist) @classmethod def parsexml(Class, node, doc, **kwargs): if not kwargs: kwargs = {} if 'original' in node.attrib: kwargs['original'] = node.attrib['original'] del node.attrib['original'] return super(TextMarkupCorrection,Class).parsexml(node, doc, **kwargs) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" }) if not extraattribs: extraattribs = [] extraattribs.append( E.optional(E.attribute(name='original' ))) return super(TextMarkupCorrection, cls).relaxng(includechildren, extraattribs, extraelements) class TextMarkupError(AbstractTextMarkup): """Markup element to mark gaps in text content (:class:`TextContent`) Only consider this element for gaps in spans of untokenised text. The use of structural element :class:`ErrorDetection` is preferred. """ class TextMarkupStyle(AbstractTextMarkup): """Markup element to style text content (:class:`TextContent`), e.g. make text bold, italics, underlined, coloured, etc..""" class TextContent(AbstractElement): """Text content element (``t``), holds text to be associated with whatever element the text content element is a child of. Text content elements on structure elements like :class:`Paragraph` and :class:`Sentence` are by definition untokenised. Only on :class:`Word`` level and deeper they are by definition tokenised. Text content elements can specify offset that refer to text at a higher parent level. Use the following keyword arguments: * ``ref=``: The instance to point to, this points to the element holding the text content element, not the text content element itself. * ``offset=``: The offset where this text is found, offsets start at 0 """ def __init__(self, doc, *args, **kwargs): """ Example:: text = folia.TextContent(doc, 'test') text = folia.TextContent(doc, 'test',cls='original') """ #for backward compatibility: if 'value' in kwargs: kwargs['text'] = kwargs['value'] del kwargs['value'] if 'offset' in kwargs: #offset self.offset = int(kwargs['offset']) del kwargs['offset'] else: self.offset = None if 'ref' in kwargs: #reference to offset if isinstance(kwargs['ref'], AbstractElement): self.ref = kwargs['ref'] else: try: self.ref = doc.index[kwargs['ref']] except: raise UnresolvableTextContent("Unable to resolve textcontent reference: " + kwargs['ref'] + " (class=" + self.cls+")") del kwargs['ref'] else: self.ref = None #will be set upon parent.append() #If no class is specified, it defaults to 'current'. (FoLiA uncharacteristically predefines two classes for t: current and original) if 'cls' not in kwargs and 'class' not in kwargs: kwargs['cls'] = 'current' super(TextContent,self).__init__(doc, *args, **kwargs) doc.textclasses.add(self.cls) if not self.data: raise ValueError("Empty text content elements are not allowed") #if isstring(self.data[0]) and (self.data[0] != self.data[0].translate(ILLEGAL_UNICODE_CONTROL_CHARACTERS)): # raise ValueError("There are illegal unicode control characters present in TextContent: " + repr(self.data[0])) def text(self): """Obtain the text (unicode instance)""" return super(TextContent,self).text() #AbstractElement will handle it now, merely overridden to get rid of parameters that dont make sense in this context def settext(self, text): self.data = [text] if not self.data: raise ValueError("Empty text content elements are not allowed") #if isstring(self.data[0]) and (self.data[0] != self.data[0].translate(ILLEGAL_UNICODE_CONTROL_CHARACTERS)): # raise ValueError("There are illegal unicode control characters present in TextContent: " + repr(self.data[0])) def validateref(self): """Validates the Text Content's references. Raises UnresolvableTextContent when invalid""" if self.offset is None: return True #nothing to test if self.ref: ref = self.ref else: ref = self.finddefaultreference() if not ref: raise UnresolvableTextContent("Default reference for textcontent not found!") elif ref.hastext(self.cls): raise UnresolvableTextContent("Reference has no such text (class=" + self.cls+")") elif self.text() != ref.textcontent(self.cls).text()[self.offset:self.offset+len(self.data[0])]: raise UnresolvableTextContent("Referenced found but does not match!") else: #finally, we made it! return True def deepvalidation(self): return True def __unicode__(self): return self.text() def __str__(self): return self.text() def __eq__(self, other): if isinstance(other, TextContent): return self.text() == other.text() elif isstring(other): return self.text() == u(other) else: return False def finddefaultreference(self): """Find the default reference for text offsets: The parent of the current textcontent's parent (counting only Structure Elements and Subtoken Annotation Elements) Note: This returns not a TextContent element, but its parent. Whether the textcontent actually exists is checked later/elsewhere """ depth = 0 e = self while True: if e.parent: e = e.parent #pylint: disable=redefined-variable-type else: #no parent, breaking return False if isinstance(e,AbstractStructureElement) or isinstance(e,AbstractSubtokenAnnotation): depth += 1 if depth == 2: return e return False #Change in behaviour (FoLiA 0.10), iter() no longer iterates over the text itself!! #Change in behaviour (FoLiA 0.10), len() no longer return the length of the text!! @classmethod def findreplaceables(Class, parent, set, **kwargs): """(Method for internal usage, see AbstractElement)""" #some extra behaviour for text content elements, replace also based on the 'corrected' attribute: if 'cls' not in kwargs: kwargs['cls'] = 'current' replace = super(TextContent, Class).findreplaceables(parent, set, **kwargs) replace = [ x for x in replace if x.cls == kwargs['cls']] del kwargs['cls'] #always delete what we processed return replace @classmethod def parsexml(Class, node, doc, **kwargs): """(Method for internal usage, see AbstractElement)""" if not kwargs: kwargs = {} if 'offset' in node.attrib: kwargs['offset'] = int(node.attrib['offset']) if 'ref' in node.attrib: kwargs['ref'] = node.attrib['ref'] return super(TextContent,Class).parsexml(node,doc, **kwargs) def xml(self, attribs = None,elements = None, skipchildren = False): """See :meth:`AbstractElement.xml`""" attribs = {} if not self.offset is None: attribs['{' + NSFOLIA + '}offset'] = str(self.offset) if self.parent and self.ref: attribs['{' + NSFOLIA + '}ref'] = self.ref.id #if self.cls != 'current' and not (self.cls == 'original' and any( isinstance(x, Original) for x in self.ancestors() ) ): # attribs['{' + NSFOLIA + '}class'] = self.cls #else: # if '{' + NSFOLIA + '}class' in attribs: # del attribs['{' + NSFOLIA + '}class'] #return E.t(self.value, **attribs) e = super(TextContent,self).xml(attribs,elements,skipchildren) if '{' + NSFOLIA + '}class' in e.attrib and e.attrib['{' + NSFOLIA + '}class'] == "current": #delete 'class=current' del e.attrib['{' + NSFOLIA + '}class'] return e def json(self, attribs =None, recurse =True,ignorelist=False): """See :meth:`AbstractElement.json`""" attribs = {} if not self.offset is None: attribs['offset'] = self.offset if self.parent and self.ref: attribs['ref'] = self.ref.id return super(TextContent,self).json(attribs, recurse,ignorelist) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" }) if not extraattribs: extraattribs = [] extraattribs.append( E.optional(E.attribute(name='offset' ))) extraattribs.append( E.optional(E.attribute(name='ref' ))) return super(TextContent, cls).relaxng(includechildren, extraattribs, extraelements) def postappend(self): super(TextContent,self).postappend() found = set() for c in self.parent: if isinstance(c,TextContent): if c.cls in found: raise DuplicateAnnotationError("Can not add multiple text content elements with the same class (" + c.cls + ") to the same structural element!") else: found.add(c.cls) class PhonContent(AbstractElement): """Phonetic content element (``ph``), holds a phonetic representation to be associated with whatever element the phonetic content element is a child of. Phonetic content elements behave much like text content elements. Phonetic content elements can specify offset that refer to phonetic content at a higher parent level. Use the following keyword arguments: * ``ref=``: The instance to point to, this points to the element holding the text content element, not the text content element itself. * ``offset=``: The offset where this text is found, offsets start at 0 """ def __init__(self, doc, *args, **kwargs): """ Example:: phon = folia.PhonContent(doc, 'hɛˈləʊ̯') phon = folia.PhonContent(doc, 'hɛˈləʊ̯', cls="original") """ if 'offset' in kwargs: #offset self.offset = int(kwargs['offset']) del kwargs['offset'] else: self.offset = None if 'ref' in kwargs: #reference to offset if isinstance(kwargs['ref'], AbstractElement): self.ref = kwargs['ref'] else: try: self.ref = doc.index[kwargs['ref']] except: raise UnresolvableTextContent("Unable to resolve phonetic content reference: " + kwargs['ref'] + " (class=" + self.cls+")") del kwargs['ref'] else: self.ref = None #will be set upon parent.append() #If no class is specified, it defaults to 'current'. (FoLiA uncharacteristically predefines two classes for t: current and original) if 'cls' not in kwargs and 'class' not in kwargs: kwargs['cls'] = 'current' super(PhonContent,self).__init__(doc, *args, **kwargs) if not self.data: raise ValueError("Empty phonetic content elements are not allowed") #if isstring(self.data[0]) and (self.data[0] != self.data[0].translate(ILLEGAL_UNICODE_CONTROL_CHARACTERS)): # raise ValueError("There are illegal unicode control characters present in TextContent: " + repr(self.data[0])) def phon(self): """Obtain the actual phonetic representation (unicode/str instance)""" return super(PhonContent,self).phon() #AbstractElement will handle it now, merely overridden to get rid of parameters that dont make sense in this context def setphon(self, phon): """Set the representation for the phonetic content (unicode instance), called whenever phon= is passed as a keyword argument to an element constructor """ self.data = [phon] if not self.data: raise ValueError("Empty phonetic content elements are not allowed") #if isstring(self.data[0]) and (self.data[0] != self.data[0].translate(ILLEGAL_UNICODE_CONTROL_CHARACTERS)): # raise ValueError("There are illegal unicode control characters present in TextContent: " + repr(self.data[0])) def validateref(self): """Validates the Phonetic Content's references. Raises UnresolvableTextContent when invalid""" if self.offset is None: return True #nothing to test if self.ref: ref = self.ref else: ref = self.finddefaultreference() if not ref: raise UnresolvableTextContent("Default reference for phonetic content not found!") elif ref.hasphon(self.cls): raise UnresolvableTextContent("Reference has no such phonetic content (class=" + self.cls+")") elif self.phon() != ref.textcontent(self.cls).phon()[self.offset:self.offset+len(self.data[0])]: raise UnresolvableTextContent("Referenced found but does not match!") else: #finally, we made it! return True def deepvalidation(self): return True def __unicode__(self): return self.phon() def __str__(self): return self.phon() def __eq__(self, other): if isinstance(other, PhonContent): return self.phon() == other.phon() elif isstring(other): return self.phon() == u(other) else: return False #append is implemented, the default suffices def postappend(self): """(Method for internal usage, see ``AbstractElement.postappend()``)""" if isinstance(self.parent, Original): if self.cls == 'current': self.cls = 'original' #pylint: disable=attribute-defined-outside-init super(PhonContent, self).postappend() def finddefaultreference(self): """Find the default reference for text offsets: The parent of the current textcontent's parent (counting only Structure Elements and Subtoken Annotation Elements) Note: This returns not a TextContent element, but its parent. Whether the textcontent actually exists is checked later/elsewhere """ depth = 0 e = self while True: if e.parent: e = e.parent #pylint: disable=redefined-variable-type else: #no parent, breaking return False if isinstance(e,AbstractStructureElement) or isinstance(e,AbstractSubtokenAnnotation): depth += 1 if depth == 2: return e return False #Change in behaviour (FoLiA 0.10), iter() no longer iterates over the text itself!! #Change in behaviour (FoLiA 0.10), len() no longer return the length of the text!! @classmethod def findreplaceables(Class, parent, set, **kwargs):#pylint: disable=bad-classmethod-argument """(Method for internal usage, see AbstractElement)""" #some extra behaviour for text content elements, replace also based on the 'corrected' attribute: if 'cls' not in kwargs: kwargs['cls'] = 'current' replace = super(PhonContent, Class).findreplaceables(parent, set, **kwargs) replace = [ x for x in replace if x.cls == kwargs['cls']] del kwargs['cls'] #always delete what we processed return replace @classmethod def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument """(Method for internal usage, see AbstractElement)""" if not kwargs: kwargs = {} if 'offset' in node.attrib: kwargs['offset'] = int(node.attrib['offset']) if 'ref' in node.attrib: kwargs['ref'] = node.attrib['ref'] return super(PhonContent,Class).parsexml(node,doc, **kwargs) def xml(self, attribs = None,elements = None, skipchildren = False): attribs = {} if not self.offset is None: attribs['{' + NSFOLIA + '}offset'] = str(self.offset) if self.parent and self.ref: attribs['{' + NSFOLIA + '}ref'] = self.ref.id #if self.cls != 'current' and not (self.cls == 'original' and any( isinstance(x, Original) for x in self.ancestors() ) ): # attribs['{' + NSFOLIA + '}class'] = self.cls #else: # if '{' + NSFOLIA + '}class' in attribs: # del attribs['{' + NSFOLIA + '}class'] #return E.t(self.value, **attribs) e = super(PhonContent,self).xml(attribs,elements,skipchildren) if '{' + NSFOLIA + '}class' in e.attrib and e.attrib['{' + NSFOLIA + '}class'] == "current": #delete 'class=current' del e.attrib['{' + NSFOLIA + '}class'] return e def json(self, attribs =None, recurse =True,ignorelist=False): attribs = {} if not self.offset is None: attribs['offset'] = self.offset if self.parent and self.ref: attribs['ref'] = self.ref.id return super(PhonContent,self).json(attribs, recurse, ignorelist) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" }) if not extraattribs: extraattribs = [] extraattribs.append( E.optional(E.attribute(name='offset' ))) extraattribs.append( E.optional(E.attribute(name='ref' ))) return super(PhonContent, cls).relaxng(includechildren, extraattribs, extraelements) class Content(AbstractElement): #used for raw content, subelement for Gap """A container element that takes raw content, used by :class:`Gap`""" def __init__(self,doc, *args, **kwargs): if 'value' in kwargs: if isstring(kwargs['value']): self.value = u(kwargs['value']) elif kwargs['value'] is None: self.value = "" else: raise Exception("value= parameter must be unicode or str instance") del kwargs['value'] else: raise Exception("Content expects value= parameter") super(Content,self).__init__(doc, *args, **kwargs) def __nonzero__(self): return bool(self.value) def __bool__(self): return bool(self.value) def __unicode__(self): return self.value def __str__(self): return self.value def xml(self, attribs = None,elements = None, skipchildren = False): E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"}) if not attribs: attribs = {} return E.content(self.value, **attribs) def json(self,attribs =None, recurse=True, ignorelist=False): jsonnode = {'type': self.XMLTAG, 'value': self.value} if attribs: for attrib in attribs: jsonnode[attrib] = attrib return jsonnode @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) return E.define( E.element(E.text(), name=cls.XMLTAG), name=cls.XMLTAG, ns=NSFOLIA) @classmethod def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument if not kwargs: kwargs = {} kwargs['value'] = node.text return Content(doc, **kwargs) class Part(AbstractStructureElement): """Generic structure element used to mark a part inside another block. Do **not** use this for morphology, use :class:`Morpheme` instead. """ class Gap(AbstractElement): """Gap element, represents skipped portions of the text. Usually contains :class:`Content` and possibly also a :class:`Description` element""" def __init__(self, doc, *args, **kwargs): if 'content' in kwargs: self.value = kwargs['content'] del kwargs['content'] elif 'description' in kwargs: self.description = kwargs['description'] del kwargs['description'] super(Gap,self).__init__(doc, *args, **kwargs) def content(self): for e in self: if isinstance(e, Content): return e.value return "" class Linebreak(AbstractStructureElement, AbstractTextMarkup): #this element has a double role!! """Line break element, signals a line break. This element acts both as a structure element as well as a text markup element. """ def __init__(self, doc, *args, **kwargs): if 'linenr' in kwargs: self.linenr = kwargs['linenr'] del kwargs['linenr'] else: self.linenr = None if 'pagenr' in kwargs: self.pagenr = kwargs['pagenr'] del kwargs['pagenr'] else: self.pagenr = None if 'newpage' in kwargs and kwargs['newpage']: self.newpage = True del kwargs['newpage'] else: self.newpage = False super(Linebreak, self).__init__(doc, *args, **kwargs) def text(self, cls='current', retaintokenisation=False, previousdelimiter="", strict=False, correctionhandling=None): return previousdelimiter.strip(' ') + "\n" @classmethod def parsexml(Class, node, doc):#pylint: disable=bad-classmethod-argument kwargs = {} if 'linenr' in node.attrib: kwargs['linenr'] = node.attrib['linenr'] if 'pagenr' in node.attrib: kwargs['pagenr'] = node.attrib['pagenr'] if 'newpage' in node.attrib and node.attrib['newpage'] == 'yes': kwargs['newpage'] = True return Linebreak(doc, **kwargs) def xml(self, attribs = None,elements = None, skipchildren = False): if attribs is None: attribs = {} if self.linenr is not None: attribs['{' + NSFOLIA + '}linenr'] = str(self.linenr) if self.pagenr is not None: attribs['{' + NSFOLIA + '}pagenr'] = str(self.pagenr) if self.newpage: attribs['{' + NSFOLIA + '}newpage'] = "yes" return super(Linebreak, self).xml(attribs,elements,skipchildren) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) attribs = [] attribs.append(E.optional(E.attribute(name='pagenr'))) attribs.append(E.optional(E.attribute(name='linenr'))) attribs.append(E.optional(E.attribute(name='newpage'))) return super(Linebreak,cls).relaxng(includechildren,attribs,extraelements) class Whitespace(AbstractStructureElement): """Whitespace element, signals a vertical whitespace""" def text(self, cls='current', retaintokenisation=False, previousdelimiter="", strict=False,correctionhandling=None): return previousdelimiter.strip(' ') + "\n\n" class Word(AbstractStructureElement, AllowCorrections): """Word (aka token) element. Holds a word/token and all its related token annotations.""" #will actually be determined by gettextdelimiter() def __init__(self, doc, *args, **kwargs): """Constructor for words. See :class:`AbstractElement.__init__` for all inherited keyword arguments and parameters. Keyword arguments: * space (bool): Indicates whether this token is followed by a space (defaults to True) Example:: sentence.append( folia.Word, 'This') sentence.append( folia.Word, 'is') sentence.append( folia.Word, 'a') sentence.append( folia.Word, 'test', space=False) sentence.append( folia.Word, '.') See also: :class:`AbstractElement.__init__` """ self.space = True if 'space' in kwargs: self.space = kwargs['space'] del kwargs['space'] super(Word,self).__init__(doc, *args, **kwargs) def sentence(self): """Obtain the sentence this word is a part of, otherwise return None""" return self.ancestor(Sentence) def paragraph(self): """Obtain the paragraph this word is a part of, otherwise return None""" return self.ancestor(Paragraph) def division(self): """Obtain the deepest division this word is a part of, otherwise return None""" return self.ancestor(Division) def pos(self,set=None): """Shortcut: returns the FoLiA class of the PoS annotation (will return only one if there are multiple!)""" return self.annotation(PosAnnotation,set).cls def lemma(self, set=None): """Shortcut: returns the FoLiA class of the lemma annotation (will return only one if there are multiple!)""" return self.annotation(LemmaAnnotation,set).cls def sense(self,set=None): """Shortcut: returns the FoLiA class of the sense annotation (will return only one if there are multiple!)""" return self.annotation(SenseAnnotation,set).cls def domain(self,set=None): """Shortcut: returns the FoLiA class of the domain annotation (will return only one if there are multiple!)""" return self.annotation(DomainAnnotation,set).cls def morphemes(self,set=None): """Generator yielding all morphemes (in a particular set if specified). For retrieving one specific morpheme by index, use morpheme() instead""" for layer in self.select(MorphologyLayer): for m in layer.select(Morpheme, set): yield m def phonemes(self,set=None): """Generator yielding all phonemes (in a particular set if specified). For retrieving one specific morpheme by index, use morpheme() instead""" for layer in self.select(PhonologyLayer): for p in layer.select(Phoneme, set): yield p def morpheme(self,index, set=None): """Returns a specific morpheme, the n'th morpheme (given the particular set if specified).""" for layer in self.select(MorphologyLayer): for i, m in enumerate(layer.select(Morpheme, set)): if index == i: return m raise NoSuchAnnotation def phoneme(self,index, set=None): """Returns a specific phoneme, the n'th morpheme (given the particular set if specified).""" for layer in self.select(PhonologyLayer): for i, p in enumerate(layer.select(Phoneme, set)): if index == i: return p raise NoSuchAnnotation def gettextdelimiter(self, retaintokenisation=False): """Returns the text delimiter""" if self.space or retaintokenisation: return ' ' else: return '' def resolveword(self, id): if id == self.id: return self else: return None def getcorrection(self,set=None,cls=None): try: return self.getcorrections(set,cls)[0] except: raise NoSuchAnnotation def getcorrections(self, set=None,cls=None): try: l = [] for correction in self.annotations(Correction): if ((not set or correction.set == set) and (not cls or correction.cls == cls)): l.append(correction) return l except NoSuchAnnotation: raise @classmethod def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument assert Class is Word instance = super(Word,Class).parsexml(node, doc, **kwargs) #we do this the old way (no kwargs used, because for some reason I forgot we need to whether instance evaluates to True) if 'space' in node.attrib and instance: if node.attrib['space'] == 'no': instance.space = False return instance def xml(self, attribs = None,elements = None, skipchildren = False): if not attribs: attribs = {} if not self.space: attribs['space'] = 'no' return super(Word,self).xml(attribs,elements, False) def json(self,attribs =None, recurse=True, ignorelist=False): if not attribs: attribs = {} if not self.space: attribs['space'] = 'no' return super(Word,self).json(attribs, recurse,ignorelist) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) if not extraattribs: extraattribs = [ E.optional(E.attribute(name='space')) ] else: extraattribs.append( E.optional(E.attribute(name='space')) ) return AbstractStructureElement.relaxng(includechildren, extraattribs, extraelements, cls) def split(self, *newwords, **kwargs): self.sentence().splitword(self, *newwords, **kwargs) def findspans(self, type,set=None): """Yields span annotation elements of the specified type that include this word. Arguments: type: The annotation type, can be passed as using any of the :class:`AnnotationType` member, or by passing the relevant :class:`AbstractSpanAnnotation` or :class:`AbstractAnnotationLayer` class. set (str or None): Constrain by set Example:: for chunk in word.findspans(folia.Chunk): print(" Chunk class=", chunk.cls, " words=") for word2 in chunk.wrefs(): #print all words in the chunk (of which the word is a part) print(word2, end="") print() Yields: Matching span annotation instances (derived from :class:`AbstractSpanAnnotation`) """ if issubclass(type, AbstractAnnotationLayer): layerclass = type else: layerclass = ANNOTATIONTYPE2LAYERCLASS[type.ANNOTATIONTYPE] e = self while True: if not e.parent: break e = e.parent for layer in e.select(layerclass,set,False): if type is layerclass: for e2 in layer.select(AbstractSpanAnnotation,set,True, (True, Word, Morpheme)): if not isinstance(e2, AbstractSpanRole) and self in e2.wrefs(): yield e2 else: for e2 in layer.select(type,set,True, (True, Word, Morpheme)): if not isinstance(e2, AbstractSpanRole) and self in e2.wrefs(): yield e2 #for e2 in layer: # if (type is layerclass and isinstance(e2, AbstractSpanAnnotation)) or (type is not layerclass and isinstance(e2, type)): # if self in e2.wrefs(): # yield e2 class Feature(AbstractElement): """Feature elements can be used to associate subsets and subclasses with almost any annotation element""" def __init__(self,doc, *args, **kwargs): #pylint: disable=super-init-not-called """Constructor. Keyword Arguments: subset (str): the subset cls (str): the class """ self.id = None self.set = None self.data = [] self.annotator = None self.annotatortype = None self.confidence = None self.n = None self.datetime = None if not isinstance(doc, Document) and not (doc is None): raise Exception("First argument of Feature constructor must be a Document instance, not " + str(type(doc))) self.doc = doc self.auth = True if self.SUBSET: self.subset = self.SUBSET elif 'subset' in kwargs: self.subset = kwargs['subset'] else: raise Exception("No subset specified for " + self.__class__.__name__) if 'cls' in kwargs: self.cls = kwargs['cls'] elif 'class' in kwargs: self.cls = kwargs['class'] else: raise Exception("No class specified for " + self.__class__.__name__) if isinstance(self.cls, datetime): self.cls = self.cls.strftime("%Y-%m-%dT%H:%M:%S") def xml(self): E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"}) attribs = {} if self.subset != self.SUBSET: attribs['{' + NSFOLIA + '}subset'] = self.subset attribs['{' + NSFOLIA + '}class'] = self.cls return makeelement(E,'{' + NSFOLIA + '}' + self.XMLTAG, **attribs) def json(self,attribs=None, recurse=True, ignorelist=False): jsonnode= {'type': Feature.XMLTAG} jsonnode['subset'] = self.subset jsonnode['class'] = self.cls return jsonnode @classmethod def relaxng(cls, includechildren=True, extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) return E.define( E.element(E.attribute(name='subset'), E.attribute(name='class'),name=cls.XMLTAG), name=cls.XMLTAG,ns=NSFOLIA) def deepvalidation(self): """Perform deep validation of this element. Raises: :class:`DeepValidationError` """ if self.doc and self.doc.deepvalidation and self.parent.set and self.parent.set[0] != '_': try: self.doc.setdefinitions[self.parent.set].testsubclass(self.parent.cls, self.subset, self.cls) except KeyError as e: if self.parent.cls and not self.doc.allowadhocsets: raise DeepValidationError("Set definition " + self.parent.set + " for " + self.parent.XMLTAG + " not loaded (feature validation failed)!") except DeepValidationError as e: errormsg = str(e) + " (in set " + self.parent.set+" for " + self.parent.XMLTAG if self.parent.id: errormsg += " with ID " + self.parent.id errormsg += ")" raise DeepValidationError(errormsg) class ValueFeature(Feature): """Value feature, to be used within :class:`Metric`""" pass class Metric(AbstractElement): """Metric elements provide a key/value pair to allow the annotation of any kind of metric with any kind of annotation element. It is used for example for statistical measures to be added to elements as annotation.""" pass class AbstractSubtokenAnnotation(AbstractElement, AllowGenerateID): """Abstract element, all subtoken annotation elements are derived from this class""" pass class AbstractSpanAnnotation(AbstractElement, AllowGenerateID, AllowCorrections): """Abstract element, all span annotation elements are derived from this class""" def xml(self, attribs = None,elements = None, skipchildren = False): """See :meth:`AbstractElement.xml`""" if not attribs: attribs = {} E = ElementMaker(namespace="http://ilk.uvt.nl/folia",nsmap={None: "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) e = super(AbstractSpanAnnotation,self).xml(attribs, elements, True) for child in self: if isinstance(child, Word) or isinstance(child, Morpheme) or isinstance(child, Phoneme): #Include REFERENCES to word items instead of word items themselves attribs['{' + NSFOLIA + '}id'] = child.id if child.PRINTABLE and child.hastext(): attribs['{' + NSFOLIA + '}t'] = child.text() e.append( E.wref(**attribs) ) elif not (isinstance(child, Feature) and child.SUBSET): #Don't add pre-defined features, they are already added as attributes e.append( child.xml() ) return e def append(self, child, *args, **kwargs): """See :meth:`AbstractElement.append`""" if (isinstance(child, Word) or isinstance(child, Morpheme) or isinstance(child, Phoneme)) and WordReference in self.ACCEPTED_DATA: #Accept Word instances instead of WordReference, references will be automagically used upon serialisation self.data.append(child) return child else: return super(AbstractSpanAnnotation,self).append(child, *args, **kwargs) def setspan(self, *args): """Sets the span of the span element anew, erases all data inside. Arguments: *args: Instances of :class:`Word`, :class:`Morpheme` or :class:`Phoneme` """ self.data = [] for child in args: self.append(child) def add(self, child, *args, **kwargs): #alias for append return self.append(child, *args, **kwargs) def hasannotation(self,Class,set=None): """Returns an integer indicating whether such as annotation exists, and if so, how many. See ``annotations()`` for a description of the parameters.""" return self.count(Class,set,True,default_ignore_annotations) def annotation(self, type, set=None): """Will return a **single** annotation (even if there are multiple). Raises a ``NoSuchAnnotation`` exception if none was found""" l = list(self.select(type,set,True,default_ignore_annotations)) if len(l) >= 1: return l[0] else: raise NoSuchAnnotation() def annotations(self,Class,set=None): """Obtain annotations. Very similar to ``select()`` but raises an error if the annotation was not found. Arguments: * ``Class`` - The Class you want to retrieve (e.g. PosAnnotation) * ``set`` - The set you want to retrieve (defaults to None, which selects irregardless of set) Yields: elements Raises: ``NoSuchAnnotation`` if the specified annotation does not exist. """ found = False for e in self.select(Class,set,True,default_ignore_annotations): found = True yield e if not found: raise NoSuchAnnotation() def _helper_wrefs(self, targets, recurse=True): """Internal helper function""" for c in self: if isinstance(c,Word) or isinstance(c,Morpheme) or isinstance(c, Phoneme): targets.append(c) elif isinstance(c,WordReference): try: targets.append(self.doc[c.id]) #try to resolve except KeyError: targets.append(c) #add unresolved elif isinstance(c, AbstractSpanAnnotation) and recurse: #recursion c._helper_wrefs(targets) #pylint: disable=protected-access elif isinstance(c, Correction) and c.auth: #recurse into corrections for e in c: if isinstance(e, AbstractCorrectionChild) and e.auth: for e2 in e: if isinstance(e2, AbstractSpanAnnotation): #recursion e2._helper_wrefs(targets) #pylint: disable=protected-access def wrefs(self, index = None, recurse=True): """Returns a list of word references, these can be Words but also Morphemes or Phonemes. Arguments: index (int or None): If set to an integer, will retrieve and return the n'th element (starting at 0) instead of returning the list of all """ targets =[] self._helper_wrefs(targets, recurse) if index is None: return targets else: return targets[index] def addtoindex(self,norecurse=None): """Makes sure this element (and all subelements), are properly added to the index""" if not norecurse: norecurse = (Word, Morpheme, Phoneme) if self.id: self.doc.index[self.id] = self for e in self.data: if all([not isinstance(e, C) for C in norecurse]): try: e.addtoindex(norecurse) except AttributeError: pass def copychildren(self, newdoc=None, idsuffix=""): """Generator creating a deep copy of the children of this element. If idsuffix is a string, if set to True, a random idsuffix will be generated including a random 32-bit hash""" if idsuffix is True: idsuffix = ".copy." + "%08x" % random.getrandbits(32) #random 32-bit hash for each copy, same one will be reused for all children for c in self: if isinstance(c, Word): yield WordReference(newdoc, id=c.id) else: yield c.copy(newdoc,idsuffix) def postappend(self): super(AbstractSpanAnnotation,self).postappend() #If a span annotation element with wrefs x y z is added in the scope of parent span annotation element with wrefs u v w x y z, then x y z is removed from the parent span (no duplication, implicit through recursion) e = self.parent directwrefs = None #will be populated on first iteration while isinstance(e, AbstractSpanAnnotation): if directwrefs is None: directwrefs = self.wrefs(recurse=False) for wref in directwrefs: try: e.data.remove(wref) except ValueError: pass e = e.parent class AbstractAnnotationLayer(AbstractElement, AllowGenerateID, AllowCorrections): """Annotation layers for Span Annotation are derived from this abstract base class""" def __init__(self, doc, *args, **kwargs): if 'set' in kwargs: self.set = kwargs['set'] elif self.ANNOTATIONTYPE in doc.annotationdefaults and len(doc.annotationdefaults[self.ANNOTATIONTYPE]) == 1: self.set = list(doc.annotationdefaults[self.ANNOTATIONTYPE].keys())[0] else: self.set = False # ok, let's not raise an error yet, may may still be able to derive a set from elements that are appended super(AbstractAnnotationLayer,self).__init__(doc, *args, **kwargs) def xml(self, attribs = None,elements = None, skipchildren = False): """See :meth:`AbstractElement.xml`""" if self.set is False or self.set is None: if len(self.data) == 0: #just skip if there are no children return None else: raise ValueError("No set specified or derivable for annotation layer " + self.__class__.__name__) return super(AbstractAnnotationLayer, self).xml(attribs, elements, skipchildren) def append(self, child, *args, **kwargs): """See :meth:`AbstractElement.append`""" #if no set is associated with the layer yet, we learn it from span annotation elements that are added if self.set is False or self.set is None: if inspect.isclass(child): if issubclass(child,AbstractSpanAnnotation): if 'set' in kwargs: self.set = kwargs['set'] elif isinstance(child, AbstractSpanAnnotation): if child.set: self.set = child.set elif isinstance(child, Correction): #descend into corrections to find the proper set for this layer (derived from span annotation elements) for e in itertools.chain( child.new(), child.original(), child.suggestions() ): if isinstance(e, AbstractSpanAnnotation) and e.set: self.set = e.set break return super(AbstractAnnotationLayer, self).append(child, *args, **kwargs) def add(self, child, *args, **kwargs): #alias for append return self.append(child, *args, **kwargs) def annotations(self,Class,set=None): """Obtain annotations. Very similar to ``select()`` but raises an error if the annotation was not found. Arguments: * ``Class`` - The Class you want to retrieve (e.g. PosAnnotation) * ``set`` - The set you want to retrieve (defaults to None, which selects irregardless of set) Yields: elements Raises: ``NoSuchAnnotation`` if the specified annotation does not exist. """ found = False for e in self.select(Class,set,True,default_ignore_annotations): found = True yield e if not found: raise NoSuchAnnotation() def hasannotation(self,Class,set=None): """Returns an integer indicating whether such as annotation exists, and if so, how many. See ``annotations()`` for a description of the parameters.""" return self.count(Class,set,True,default_ignore_annotations) def annotation(self, type, set=None): """Will return a **single** annotation (even if there are multiple). Raises a ``NoSuchAnnotation`` exception if none was found""" for e in self.select(type,set,True,default_ignore_annotations): return e raise NoSuchAnnotation() def alternatives(self, Class=None, set=None): """Generator over alternatives, either all or only of a specific annotation type, and possibly restrained also by set. Arguments: * ``Class`` - The Class you want to retrieve (e.g. PosAnnotation). Or set to None to select all alternatives regardless of what type they are. * ``set`` - The set you want to retrieve (defaults to None, which selects irregardless of set) Returns: Generator over Alternative elements """ for e in self.select(AlternativeLayers,None, True, ['Original','Suggestion']): #pylint: disable=too-many-nested-blocks if Class is None: yield e elif len(e) >= 1: #child elements? for e2 in e: try: if isinstance(e2, Class): try: if set is None or e2.set == set: yield e #not e2 break #yield an alternative only once (in case there are multiple matches) except AttributeError: continue except AttributeError: continue def findspan(self, *words): """Returns the span element which spans over the specified words or morphemes. See also: :meth:`Word.findspans` """ for span in self.select(AbstractSpanAnnotation,None,True): if tuple(span.wrefs()) == words: return span raise NoSuchAnnotation @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None, origclass = None): """Returns a RelaxNG definition for this element (as an XML element (lxml.etree) rather than a string)""" E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" }) if not extraattribs: extraattribs = [] extraattribs.append(E.optional(E.attribute(E.text(), name='set')) ) return AbstractElement.relaxng(includechildren, extraattribs, extraelements, cls) def deepvalidation(self): return True # class AbstractSubtokenAnnotationLayer(AbstractElement, AllowGenerateID): # """Annotation layers for Subtoken Annotation are derived from this abstract base class""" # OPTIONAL_ATTRIBS = () # PRINTABLE = False # def __init__(self, doc, *args, **kwargs): # if 'set' in kwargs: # self.set = kwargs['set'] # del kwargs['set'] # super(AbstractSubtokenAnnotationLayer,self).__init__(doc, *args, **kwargs) class String(AbstractElement, AllowTokenAnnotation): """String""" pass class AbstractCorrectionChild(AbstractElement): def generate_id(self, cls): #Delegate ID generation to parent return self.parent.generate_id(cls) def deepvalidation(self): return True class Reference(AbstractStructureElement): """A structural element that denotes a reference, internal or external. Examples are references to footnotes, bibliographies, hyperlinks.""" def __init__(self, doc, *args, **kwargs): if 'idref' in kwargs: self.idref = kwargs['idref'] del kwargs['idref'] else: self.idref = None if 'type' in kwargs: self.type = kwargs['type'] del kwargs['type'] else: self.type = None if 'format' in kwargs: self.format = kwargs['format'] del kwargs['format'] else: self.format = "text/folia+xml" super(Reference,self).__init__(doc, *args, **kwargs) def xml(self, attribs = None,elements = None, skipchildren = False): if not attribs: attribs = {} if self.idref: attribs['id'] = self.idref if self.type: attribs['type'] = self.type if self.format and self.format != "text/folia+xml": attribs['format'] = self.format return super(Reference,self).xml(attribs,elements, skipchildren) def json(self, attribs=None, recurse=True, ignorelist=False): if attribs is None: attribs = {} if self.idref: attribs['idref'] = self.idref if self.type: attribs['type'] = self.type if self.format: attribs['format'] = self.format return super(Reference,self).json(attribs,recurse,ignorelist) def resolve(self): if self.idref: return self.doc[self.idref] else: return self @classmethod def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument if not kwargs: kwargs = {} if 'id' in node.attrib: kwargs['idref'] = node.attrib['id'] del node.attrib['id'] if 'type' in node.attrib: kwargs['type'] = node.attrib['type'] del node.attrib['type'] if 'format' in node.attrib: kwargs['format'] = node.attrib['format'] del node.attrib['format'] return super(Reference,Class).parsexml(node, doc, **kwargs) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" }) if not extraattribs: extraattribs = [] extraattribs.append( E.attribute(name='id')) #id reference extraattribs.append( E.optional(E.attribute(name='type' ))) extraattribs.append( E.optional(E.attribute(name='format' ))) return super(Reference, cls).relaxng(includechildren, extraattribs, extraelements) class AlignReference(AbstractElement): """The AlignReference element is used to point to specific elements inside the aligned source. It is used with :class:`Alignment` which is responsible for pointing to the external resource.""" def __init__(self, doc, *args, **kwargs): #pylint: disable=super-init-not-called #Special constructor, not calling super constructor if 'id' not in kwargs: raise Exception("ID required for AlignReference") if 'type' in kwargs: if isinstance(kwargs['type'], AbstractElement) or inspect.isclass(kwargs['type']): self.type = kwargs['type'].XMLTAG else: self.type = kwargs['type'] else: self.type = None if 't' in kwargs: self.t = kwargs['t'] else: self.t = None assert(isinstance(doc,Document)) self.doc = doc self.id = kwargs['id'] self.annotator = None self.annotatortype = None self.confidence = None self.n = None self.datetime = None self.auth = False self.set = None self.cls = None self.data = [] @classmethod def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument assert Class is AlignReference or issubclass(Class, AlignReference) #special handling for word references if not kwargs: kwargs = {} kwargs['id'] = node.attrib['id'] if not 'type' in node.attrib: raise ValueError("No type in alignment reference") if 't' in node.attrib: kwargs['t'] = node.attrib['t'] try: kwargs['type'] = node.attrib['type'] except KeyError: raise ValueError("No such type: " + node.attrib['type']) return AlignReference(doc,**kwargs) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) return E.define( E.element(E.attribute(E.text(), name='id'), E.optional(E.attribute(E.text(), name='t')), E.optional(E.attribute(E.text(), name='type')), name=cls.XMLTAG), name=cls.XMLTAG, ns=NSFOLIA) def resolve(self, alignmentcontext=None, documents={}): if not alignmentcontext or not hasattr(alignmentcontext, 'href') or not alignmentcontext.href: #no target document, same document return self.doc[self.id] else: #other document if alignmentcontext.href in documents: return documents[alignmentcontext.href][self.id] else: raise DocumentNotLoaded() def xml(self, attribs = None,elements = None, skipchildren = False): E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"}) if not attribs: attribs = {} attribs['id'] = self.id if self.type: attribs['type'] = self.type if self.t: attribs['t'] = self.t return E.aref( **attribs) def json(self, attribs=None, recurse=True, ignorelist=False): return {} #alignment not supported yet, TODO class Alignment(AbstractElement): """ The Alignment element is a form of higher-order annotation taht is used to point to an external resource. It concerns references as annotation rather than references which are explicitly part of the text, such as hyperlinks and :class:`Reference`. Inside the Alignment element, the :class:`AlignReference` element may be used to point to specific elements (multiple denotes a span). """ def __init__(self, doc, *args, **kwargs): if 'format' in kwargs: self.format = kwargs['format'] del kwargs['format'] else: self.format = "text/folia+xml" super(Alignment,self).__init__(doc, *args, **kwargs) @classmethod def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument if 'format' in node.attrib: kwargs['format'] = node.attrib['format'] del node.attrib['format'] return super(Alignment,Class).parsexml(node, doc, **kwargs) def xml(self, attribs = None,elements = None, skipchildren = False): if not attribs: attribs = {} if self.format and self.format != "text/folia+xml": attribs['format'] = self.format return super(Alignment,self).xml(attribs,elements, skipchildren) def json(self, attribs =None, recurse=True, ignorelist=False): return {} #alignment not supported yet, TODO def resolve(self, documents=None): if documents is None: documents = {} #documents is a dictionary of urls to document instances, to aid in resolving cross-document alignments for x in self.select(AlignReference,None,True,False): yield x.resolve(self, documents) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) if extraattribs is None: extraattribs = [] extraattribs.append(E.optional(E.attribute(name="format"))) return super(Alignment,cls).relaxng(includechildren, extraattribs, extraelements) class ErrorDetection(AbstractExtendedTokenAnnotation): """The ErrorDetection element is used to signal the presence of errors in a structural element.""" pass class Suggestion(AbstractCorrectionChild): """Suggestions are used in the context of :class:`Correction`, but rather than provide an authoritative correction, it instead offers a suggestion for correction.""" def __init__(self, doc, *args, **kwargs): if 'split' in kwargs: self.split = kwargs['split'] del kwargs['split'] else: self.split = None if 'merge' in kwargs: self.merge = kwargs['merge'] del kwargs['merge'] else: self.merge = None super(Suggestion,self).__init__(doc, *args, **kwargs) @classmethod def parsexml(Class, node, doc, **kwargs): #pylint: disable=bad-classmethod-argument if not kwargs: kwargs = {} if 'split' in node.attrib: kwargs['split'] = node.attrib['split'] if 'merge' in node.attrib: kwargs['merge'] = node.attrib['merge'] return super(Suggestion,Class).parsexml(node, doc, **kwargs) def xml(self, attribs = None,elements = None, skipchildren = False): if not attribs: attribs= {} if self.split: attribs['split'] = self.split if self.merge: attribs['merge'] = self.merge return super(Suggestion, self).xml(attribs, elements, skipchildren) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace",'a':"http://relaxng.org/ns/annotation/0.9" }) if not extraattribs: extraattribs = [] extraattribs.append( E.optional(E.attribute(name='split' ))) extraattribs.append( E.optional(E.attribute(name='merge' ))) return super(Suggestion, cls).relaxng(includechildren, extraattribs, extraelements) def json(self, attribs = None, recurse=True,ignorelist=False): if self.split: if not attribs: attribs = {} attribs['split'] = self.split if self.merge: if not attribs: attribs = {} attribs['merge'] = self.merge return super(Suggestion, self).json(attribs, recurse, ignorelist) class New(AbstractCorrectionChild): @classmethod def addable(Class, parent, set=None, raiseexceptions=True):#pylint: disable=bad-classmethod-argument if not super(New,Class).addable(parent,set,raiseexceptions): return False if any( ( isinstance(c, Current) for c in parent ) ): if raiseexceptions: raise ValueError("Can't add New element to Correction if there is a Current item") else: return False return True def correct(self, **kwargs): return self.parent.correct(**kwargs) class Original(AbstractCorrectionChild): """Used in the context of :class:`Correction` to encapsulate the original annotations *prior* to correction.""" @classmethod def addable(Class, parent, set=None, raiseexceptions=True):#pylint: disable=bad-classmethod-argument if not super(Original,Class).addable(parent,set,raiseexceptions): return False if any( ( isinstance(c, Current) for c in parent ) ): if raiseexceptions: raise Exception("Can't add Original item to Correction if there is a Current item") else: return False return True class Current(AbstractCorrectionChild): """Used in the context of :class:`Correction` to encapsulate the currently authoritative annotations. Needed only when suggestions for correction are proposed (:class:`Suggestion`) for structural elements. """ @classmethod def addable(Class, parent, set=None, raiseexceptions=True): if not super(Current,Class).addable(parent,set,raiseexceptions): return False if any( ( isinstance(c, New) or isinstance(c, Original) for c in parent ) ): if raiseexceptions: raise Exception("Can't add Current element to Correction if there is a New or Original element") else: return False return True def correct(self, **kwargs): return self.parent.correct(**kwargs) class Correction(AbstractElement, AllowGenerateID): """ Corrections are one of the most complex annotation types in FoLiA. Corrections can be applied not just over text, but over any type of structure annotation, token annotation or span annotation. Corrections explicitly preserve the original, and recursively so if corrections are done over other corrections. Despite their complexity, the library treats correction transparently. Whenever you query for a particular element, and it is part of a correction, you get the corrected version rather than the original. The original is always *non-authoritative* and normal selection methods will ignore it. This class takes four classes as children, that in turn encapsulate the actual annotations: * :class:`New` - Encapsulates the newly corrected annotation(s) * :class:`Original` - Encapsulated the old original annotation(s) * :class:`Current` - Encapsulates the current authoritative annotation(s) * :class:`Suggestions` - Encapsulates the annotation(s) that are a non-authoritative suggestion for correction """ def append(self, child, *args, **kwargs): """See ``AbstractElement.append()``""" e = super(Correction,self).append(child, *args, **kwargs) self._setmaxid(e) return e def hasnew(self,allowempty=False): """Does the correction define new corrected annotations?""" for e in self.select(New,None,False, False): if not allowempty and len(e) == 0: continue return True return False def hasoriginal(self,allowempty=False): """Does the correction record the old annotations prior to correction?""" for e in self.select(Original,None,False, False): if not allowempty and len(e) == 0: continue return True return False def hascurrent(self, allowempty=False): """Does the correction record the current authoritative annotation (needed only in a structural context when suggestions are proposed)""" for e in self.select(Current,None,False, False): if not allowempty and len(e) == 0: continue return True return False def hassuggestions(self,allowempty=False): """Does the correction propose suggestions for correction?""" for e in self.select(Suggestion,None,False, False): if not allowempty and len(e) == 0: continue return True return False def textcontent(self, cls='current', correctionhandling=CorrectionHandling.CURRENT): """See :meth:`AbstractElement.textcontent`""" if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER): for e in self: if isinstance(e, New) or isinstance(e, Current): return e.textcontent(cls,correctionhandling) if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER): for e in self: if isinstance(e, Original): return e.textcontent(cls,correctionhandling) raise NoSuchText def phoncontent(self, cls='current', correctionhandling=CorrectionHandling.CURRENT): """See :meth:`AbstractElement.phoncontent`""" if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER): for e in self: if isinstance(e, New) or isinstance(e, Current): return e.phoncontent(cls, correctionhandling) if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER): for e in self: if isinstance(e, Original): return e.phoncontent(cls, correctionhandling) raise NoSuchPhon def hastext(self, cls='current',strict=True, correctionhandling=CorrectionHandling.CURRENT): """See :meth:`AbstractElement.hastext`""" if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER): for e in self: if isinstance(e, New) or isinstance(e, Current): return e.hastext(cls,strict, correctionhandling) if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER): for e in self: if isinstance(e, Original): return e.hastext(cls,strict, correctionhandling) return False def text(self, cls = 'current', retaintokenisation=False, previousdelimiter="",strict=False, correctionhandling=CorrectionHandling.CURRENT): """See :meth:`AbstractElement.text`""" if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER): for e in self: if isinstance(e, New) or isinstance(e, Current): return previousdelimiter + e.text(cls, retaintokenisation,"", strict, correctionhandling) if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER): for e in self: if isinstance(e, Original): return previousdelimiter + e.text(cls, retaintokenisation,"", strict, correctionhandling) raise NoSuchText def hasphon(self, cls='current',strict=True, correctionhandling=CorrectionHandling.CURRENT): """See :meth:`AbstractElement.hasphon`""" if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER): for e in self: if isinstance(e, New) or isinstance(e, Current): return e.hasphon(cls,strict, correctionhandling) if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER): for e in self: if isinstance(e, Original): return e.hasphon(cls,strict, correctionhandling) return False def phon(self, cls = 'current', previousdelimiter="",strict=False, correctionhandling=CorrectionHandling.CURRENT): """See :meth:`AbstractElement.phon`""" if cls == 'original': correctionhandling = CorrectionHandling.ORIGINAL #backward compatibility if correctionhandling in (CorrectionHandling.CURRENT, CorrectionHandling.EITHER): for e in self: if isinstance(e, New) or isinstance(e, Current): return previousdelimiter + e.phon(cls, "", strict, correctionhandling) if correctionhandling in (CorrectionHandling.ORIGINAL, CorrectionHandling.EITHER): for e in self: if isinstance(e, Original): return previousdelimiter + e.phon(cls, "", correctionhandling) raise NoSuchPhon def gettextdelimiter(self, retaintokenisation=False): """See :meth:`AbstractElement.gettextdelimiter`""" for e in self: if isinstance(e, New) or isinstance(e, Current): return e.gettextdelimiter(retaintokenisation) return "" def new(self,index = None): """Get the new corrected annotation. This returns only one annotation if multiple exist, use `index` to select another in the sequence. Returns: an annotation element (:class:`AbstractElement`) Raises: :class:`NoSuchAnnotation` """ if index is None: try: return next(self.select(New,None,False)) except StopIteration: raise NoSuchAnnotation else: for e in self.select(New,None,False): return e[index] raise NoSuchAnnotation def original(self,index=None): """Get the old annotation prior to correction. This returns only one annotation if multiple exist, use `index` to select another in the sequence. Returns: an annotation element (:class:`AbstractElement`) Raises: :class:`NoSuchAnnotation` """ if index is None: try: return next(self.select(Original,None,False, False)) except StopIteration: raise NoSuchAnnotation else: for e in self.select(Original,None,False, False): return e[index] raise NoSuchAnnotation def current(self,index=None): """Get the current authoritative annotation (used with suggestions in a structural context) This returns only one annotation if multiple exist, use `index` to select another in the sequence. Returns: an annotation element (:class:`AbstractElement`) Raises: :class:`NoSuchAnnotation` """ if index is None: try: return next(self.select(Current,None,False)) except StopIteration: raise NoSuchAnnotation else: for e in self.select(Current,None,False): return e[index] raise NoSuchAnnotation def suggestions(self,index=None): """Get suggestions for correction. Yields: :class:`Suggestion` element that encapsulate the suggested annotations (if index is ``None``, default) Returns: a :class:`Suggestion` element that encapsulate the suggested annotations (if index is set) Raises: :class:`IndexError` """ if index is None: return self.select(Suggestion,None,False, False) else: for i, e in enumerate(self.select(Suggestion,None,False, False)): if index == i: return e raise IndexError def __unicode__(self): return str(self) def __str__(self): return self.text(self, 'current', False, "",False, CorrectionHandling.EITHER) def correct(self, **kwargs): if 'new' in kwargs: if 'nooriginal' not in kwargs: #if not an insertion kwargs['original'] = self elif 'current' in kwargs: kwargs['current'] = self if 'insertindex' in kwargs: #recompute insertindex index = self.parent.getindex(self) if index != -1: kwargs['insertindex'] = index if 'insertindex_offset' in kwargs: kwargs['insertindex'] += kwargs['insertindex_offset'] del kwargs['insetindex_offset'] else: raise Exception("Can't find insertion point for higher-order correction") return self.parent.correct(**kwargs) #obsolete #def select(self, cls, set=None, recursive=True, ignorelist=[], node=None): # """Select on Correction only descends in either "NEW" or "CURRENT" branch""" # if ignorelist is False: # #to override and go into all branches, set ignorelist explictly to False # return super(Correction,self).select(cls,set,recursive, ignorelist, node) # else: # if ignorelist is True: # ignorelist = copy(default_ignore) # else: # ignorelist = copy(ignorelist) #we don't want to alter a passed ignorelist (by ref) # ignorelist.append(Original) # ignorelist.append(Suggestion) # return super(Correction,self).select(cls,set,recursive, ignorelist, node) class Alternative(AbstractElement, AllowTokenAnnotation, AllowGenerateID): """Element grouping alternative token annotation(s). Multiple alternative elements may occur, each denoting a different alternative. Elements grouped inside an alternative block are considered dependent. A key feature of FoLiA is its ability to make explicit alternative annotations, for token annotations, this class is used to this end. Alternative annotations are embedded in this structure. This implies the annotation is *not authoritative*, but is merely an alternative to the actual annotation (if any). Alternatives may typically occur in larger numbers, representing a distribution each with a confidence value (not mandatory). Each alternative is wrapped in its an instance of this class, as multiple elements inside a single alternative are considered dependent and part of the same alternative. Combining multiple annotation in one alternative makes sense for mixed annotation types, where for instance a pos tag alternative is tied to a particular lemma. """ def deepvalidation(self): return True class AlternativeLayers(AbstractElement): """Element grouping alternative subtoken annotation(s). Multiple altlayers elements may occur, each denoting a different alternative. Elements grouped inside an alternative block are considered dependent.""" def deepvalidation(self): return True class External(AbstractElement): def __init__(self, doc, *args, **kwargs): #pylint: disable=super-init-not-called #Special constructor, not calling super constructor if 'source' not in kwargs: raise Exception("Source required for External") assert(isinstance(doc,Document)) self.doc = doc self.id = None self.source = kwargs['source'] if 'include' in kwargs and kwargs['include'] != 'no': self.include = bool(kwargs['include']) else: self.include = False self.annotator = None self.annotatortype = None self.confidence = None self.n = None self.datetime = None self.auth = False self.data = [] self.subdoc = None if self.include: if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Loading subdocument for inclusion: " + self.source,file=stderr) #load subdocument #check if it is already loaded, if multiple references are made to the same doc we reuse the instance if self.source in self.doc.subdocs: self.subdoc = self.doc.subdocs[self.source] elif self.source[:7] == 'http://' or self.source[:8] == 'https://': #document is remote, download (in memory) try: f = urlopen(self.source) except: raise DeepValidationError("Unable to download subdocument for inclusion: " + self.source) try: content = u(f.read()) except IOError: raise DeepValidationError("Unable to download subdocument for inclusion: " + self.source) f.close() self.subdoc = Document(string=content, parentdoc = self.doc, setdefinitions=self.doc.setdefinitions) elif os.path.exists(self.source): #document is on disk: self.subdoc = Document(file=self.source, parentdoc = self.doc, setdefinitions=self.doc.setdefinitions) else: #document not found raise DeepValidationError("Unable to find subdocument for inclusion: " + self.source) self.subdoc.parentdoc = self.doc self.doc.subdocs[self.source] = self.subdoc #TODO: verify there are no clashes in declarations between parent and child #TODO: check validity of elements under subdoc/text with respect to self.parent @classmethod def parsexml(Class, node, doc, **kwargs): assert Class is External or issubclass(Class, External) if not kwargs: kwargs = {} #special handling for external source = node.attrib['src'] if 'include' in node.attrib: kwargs['include'] = node.attrib['include'] else: kwargs['include'] = False if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found external",file=stderr) return External(doc, source=source, include=include) def xml(self, attribs = None,elements = None, skipchildren = False): if not attribs: attribs= {} attribs['src'] = self.source if self.include: attribs['include'] = 'yes' else: attribs['include'] = 'no' return super(External, self).xml(attribs, elements, skipchildren) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) return E.define( E.element(E.attribute(E.text(), name='src'), E.optional(E.attribute(E.text(), name='include')), name=cls.XMLTAG), name=cls.XMLTAG, ns=NSFOLIA) def select(self, Class, set=None, recursive=True, ignore=True, node=None): """See :meth:`AbstractElement.select`""" if self.include: return self.subdoc.data[0].select(Class,set,recursive, ignore, node) #pass it on to the text node of the subdoc else: return iter([]) class WordReference(AbstractElement): """Word reference. Used to refer to words or morphemes from span annotation elements. The Python class will only be used when word reference can not be resolved, if they can, Word or Morpheme objects will be used""" def __init__(self, doc, *args, **kwargs): #pylint: disable=super-init-not-called #Special constructor, not calling super constructor if 'idref' not in kwargs and 'id' not in kwargs: raise Exception("ID required for WordReference") assert isinstance(doc,Document) self.doc = doc if 'idref' in kwargs: self.id = kwargs['idref'] else: self.id = kwargs['id'] self.annotator = None self.annotatortype = None self.confidence = None self.n = None self.datetime = None self.data = [] self.set = None self.cls = None self.auth = True @classmethod def parsexml(Class, node, doc, **kwargs):#pylint: disable=bad-classmethod-argument assert Class is WordReference or issubclass(Class, WordReference) #special handling for word references id = node.attrib['id'] if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found word reference",file=stderr) try: return doc[id] except KeyError: if doc.debug >= 1: print("[PyNLPl FoLiA DEBUG] ...Unresolvable!",file=stderr) return WordReference(doc, id=id) @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) return E.define( E.element(E.attribute(E.text(), name='id'), E.optional(E.attribute(E.text(), name='t')), name=cls.XMLTAG), name=cls.XMLTAG, ns=NSFOLIA) def xml(self, attribs = None,elements = None, skipchildren = False): """Serialises the FoLiA element to XML, by returning an XML Element (in lxml.etree) for this element and all its children. For string output, consider the xmlstring() method instead.""" E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"}) if not attribs: attribs = {} if not elements: elements = [] if self.id: attribs['id'] = self.id try: w = self.doc[self.id] attribs['t'] = w.text() except KeyError: pass e = makeelement(E, '{' + NSFOLIA + '}' + self.XMLTAG, **attribs) return e class SyntacticUnit(AbstractSpanAnnotation): """Syntactic Unit, span annotation element to be used in :class:`SyntaxLayer`""" pass class Chunk(AbstractSpanAnnotation): """Chunk element, span annotation element to be used in :class:`ChunkingLayer`""" pass class Entity(AbstractSpanAnnotation): """Entity element, for entities such as named entities, multi-word expressions, temporal entities. This is a span annotation element to be used in :class:`EntitiesLayer`""" pass class AbstractSpanRole(AbstractSpanAnnotation): #TODO: span roles don't take classes, derived off spanannotation allows too much pass class Headspan(AbstractSpanRole): #generic head element """The headspan role is used to mark the head of a span annotation. It can be used in various contexts, for instance to mark the head of a :class:`Dependency`. It is allowed by most span annotations. """ DependencyHead = Headspan #alias, backwards compatibility with FoLiA 0.8 class DependencyDependent(AbstractSpanRole): """Span role element that marks the dependent in a dependency relation. Used in :class:`Dependency`. :class:`Headspan` in turn is used to mark the head of a dependency relation.""" pass class Source(AbstractSpanRole): """The source span role is used to mark the source in a :class:`Sentiment` or :class:`Statement` """ class Target(AbstractSpanRole): """The target span role is used to mark the target in a :class:`Sentiment` """ class Relation(AbstractSpanRole): """The relation span role is used to mark the relation between the content of a statement and its source in a :class:`Statement`""" class Dependency(AbstractSpanAnnotation): """Span annotation element to encode dependency relations""" def head(self): """Returns the head of the dependency relation. Instance of :class:`Headspan`""" return next(self.select(Headspan)) def dependent(self): """Returns the dependent of the dependency relation. Instance of :class:`DependencyDependent`""" return next(self.select(DependencyDependent)) class ModalityFeature(Feature): """Modality feature, to be used with coreferences""" class TimeFeature(Feature): """Time feature, to be used with coreferences""" class LevelFeature(Feature): """Level feature, to be used with coreferences""" class CoreferenceLink(AbstractSpanRole): """Coreference link. Used in :class:`CoreferenceChain`""" class CoreferenceChain(AbstractSpanAnnotation): """Coreference chain. Holds :class:`CoreferenceLink` instances.""" class SemanticRole(AbstractSpanAnnotation): """Semantic Role""" class Predicate(AbstractSpanAnnotation): """Predicate, used within :class:`SemanticRolesLayer`, takes :class:`SemanticRole` annotations as children, but has its own annotation type and separate declaration""" class Sentiment(AbstractSpanAnnotation): """Sentiment. Takes span roles :class:`Headspan`, :class:`Source` and :class:`Target` as children""" class Statement(AbstractSpanAnnotation): """Statement. Takes span roles :class:`Headspan`, :class:`Source` and :class:`Relation` as children""" class Observation(AbstractSpanAnnotation): """Observation.""" class ComplexAlignment(AbstractElement): """Complex Alignment""" #same as for AbstractSpanAnnotation, which this technically is not (hence copy) def hasannotation(self,Class,set=None): """Returns an integer indicating whether such as annotation exists, and if so, how many. See ``annotations()`` for a description of the parameters.""" return self.count(Class,set,True,default_ignore_annotations) #same as for AbstractSpanAnnotation, which this technically is not (hence copy) def annotation(self, type, set=None): """Will return a **single** annotation (even if there are multiple). Raises a ``NoSuchAnnotation`` exception if none was found""" l = self.count(type,set,True,default_ignore_annotations) if len(l) >= 1: return l[0] else: raise NoSuchAnnotation() class FunctionFeature(Feature): """Function feature, to be used with :class:`Morpheme`""" class Morpheme(AbstractStructureElement): """Morpheme element, represents one morpheme in morphological analysis, subtoken annotation element to be used in :class:`MorphologyLayer`""" def findspans(self, type,set=None): """Find span annotation of the specified type that include this word""" if issubclass(type, AbstractAnnotationLayer): layerclass = type else: layerclass = ANNOTATIONTYPE2LAYERCLASS[type.ANNOTATIONTYPE] e = self while True: if not e.parent: break e = e.parent for layer in e.select(layerclass,set,False): for e2 in layer: if isinstance(e2, AbstractSpanAnnotation): if self in e2.wrefs(): yield e2 class Phoneme(AbstractStructureElement): """Phone element, represents one phone in phonetic analysis, subtoken annotation element to be used in :class:`PhonologyLayer`""" def findspans(self, type,set=None): #TODO: this is a copy of the methods in Morpheme in Word, abstract into separate class and inherit """Find span annotation of the specified type that include this phoneme. See :meth:`Word.findspans` for usage. """ if issubclass(type, AbstractAnnotationLayer): layerclass = type else: layerclass = ANNOTATIONTYPE2LAYERCLASS[type.ANNOTATIONTYPE] e = self while True: if not e.parent: break e = e.parent for layer in e.select(layerclass,set,False): for e2 in layer: if isinstance(e2, AbstractSpanAnnotation): if self in e2.wrefs(): yield e2 #class Subentity(AbstractSubtokenAnnotation): # """Subentity element, for named entities within a single token, subtoken annotation element to be used in SubentitiesLayer""" # ACCEPTED_DATA = (Feature,TextContent, Metric) # ANNOTATIONTYPE = AnnotationType.SUBENTITY # XMLTAG = 'subentity' class SyntaxLayer(AbstractAnnotationLayer): """Syntax Layer: Annotation layer for :class:`SyntacticUnit` span annotation elements""" class ChunkingLayer(AbstractAnnotationLayer): """Chunking Layer: Annotation layer for :class:`Chunk` span annotation elements""" class EntitiesLayer(AbstractAnnotationLayer): """Entities Layer: Annotation layer for :class:`Entity` span annotation elements. For named entities.""" class DependenciesLayer(AbstractAnnotationLayer): """Dependencies Layer: Annotation layer for :class:`Dependency` span annotation elements. For dependency entities.""" class MorphologyLayer(AbstractAnnotationLayer): """Morphology Layer: Annotation layer for :class:`Morpheme` subtoken annotation elements. For morphological analysis.""" class PhonologyLayer(AbstractAnnotationLayer): """Phonology Layer: Annotation layer for :class:`Phoneme` subtoken annotation elements. For phonetic analysis.""" class CoreferenceLayer(AbstractAnnotationLayer): """Syntax Layer: Annotation layer for :class:`SyntacticUnit` span annotation elements""" class SemanticRolesLayer(AbstractAnnotationLayer): """Syntax Layer: Annotation layer for :class:`SemanticRole` span annotation elements""" class StatementLayer(AbstractAnnotationLayer): """Statement Layer: Annotation layer for :class:`Statement` span annotation elements, used for attribution annotation.""" class SentimentLayer(AbstractAnnotationLayer): """Sentiment Layer: Annotation layer for :class:`Sentiment` span annotation elements, used for sentiment analysis.""" class ObservationLayer(AbstractAnnotationLayer): """Observation Layer: Annotation layer for :class:`Observation` span annotation elements.""" class ComplexAlignmentLayer(AbstractAnnotationLayer): """Complex alignment layer""" ACCEPTED_DATA = (ComplexAlignment,Description,Correction) XMLTAG = 'complexalignments' ANNOTATIONTYPE = AnnotationType.COMPLEXALIGNMENT class HeadFeature(Feature): """Head feature, to be used within :class:`PosAnnotation`""" class PosAnnotation(AbstractTokenAnnotation): """Part-of-Speech annotation: a token annotation element""" class LemmaAnnotation(AbstractTokenAnnotation): """Lemma annotation: a token annotation element""" class LangAnnotation(AbstractExtendedTokenAnnotation): """Language annotation: an extended token annotation element""" #class PhonAnnotation(AbstractTokenAnnotation): #DEPRECATED in v0.9 # """Phonetic annotation: a token annotation element""" # ANNOTATIONTYPE = AnnotationType.PHON # ACCEPTED_DATA = (Feature,Description, Metric) # XMLTAG = 'phon' class DomainAnnotation(AbstractExtendedTokenAnnotation): """Domain annotation: an extended token annotation element""" class SynsetFeature(Feature): """Synset feature, to be used within :class:`Sense`""" class ActorFeature(Feature): """Actor feature, to be used within :class:`Event`""" class PolarityFeature(Feature): """Polarity feature, to be used within :class:`Sentiment`""" class StrengthFeature(Feature): """Strength feature, to be used within :class:`Sentiment`""" class BegindatetimeFeature(Feature): """Begindatetime feature, to be used within :class:`Event`""" class EnddatetimeFeature(Feature): """Enddatetime feature, to be used within :class:`Event`""" class StyleFeature(Feature): pass class Note(AbstractStructureElement): """Element used for notes, such as footnotes or warnings or notice blocks.""" class Definition(AbstractStructureElement): """Element used in :class:`Entry` for the portion that provides a definition for the entry.""" class Term(AbstractStructureElement): """A term, often used in contect of :class:`Entry`""" class Example(AbstractStructureElement): """Element that provides an example. Used for instance in the context of :class:`Entry`""" class Entry(AbstractStructureElement): """Represents an entry in a glossary/lexicon/dictionary.""" class TimeSegment(AbstractSpanAnnotation): """A time segment""" TimedEvent = TimeSegment #alias for FoLiA 0.8 compatibility class TimingLayer(AbstractAnnotationLayer): """Timing layer: Annotation layer for :class:`TimeSegment` span annotation elements. """ class SenseAnnotation(AbstractTokenAnnotation): """Sense annotation: a token annotation element""" class SubjectivityAnnotation(AbstractTokenAnnotation): """Subjectivity annotation/Sentiment analysis: a token annotation element""" class Quote(AbstractStructureElement): """Quote: a structure element. For quotes/citations. May hold :class:`Word`, :class:`Sentence` or :class:`Paragraph` data.""" def __init__(self, doc, *args, **kwargs): super(Quote,self).__init__(doc, *args, **kwargs) def resolveword(self, id): for child in self: r = child.resolveword(id) if r: return r return None def append(self, child, *args, **kwargs): #Quotes have some more complex ACCEPTED_DATA behaviour depending on what lever they are used on #Note that Sentences under quotes may occur if the parent of the quote is a sentence already insentence = len(list(self.ancestors(Sentence))) > 0 inparagraph = len(list(self.ancestors(Paragraph))) > 0 if inspect.isclass(child): if (insentence or inparagraph) and (child is Paragraph or child is Division): raise Exception("Can't add paragraphs or divisions to a quote when the quote is in a sentence or paragraph!") else: if (insentence or inparagraph) and (isinstance(child, Paragraph) or isinstance(child, Division)): raise Exception("Can't add paragraphs or divisions to a quote when the quote is in a sentence or paragraph!") return super(Quote, self).append(child, *args, **kwargs) def gettextdelimiter(self, retaintokenisation=False): #no text delimiter of itself, recurse into children to inherit delimiter for child in reversed(self): if isinstance(child, Sentence): return "" #if a quote ends in a sentence, we don't want any delimiter else: return child.gettextdelimiter(retaintokenisation) return self.TEXTDELIMITER class Sentence(AbstractStructureElement): """Sentence element. A structure element. Represents a sentence and holds all its words (:class:`Word`), and possibly other structure such as :class:`LineBreak`, :class:`Whitespace` and :class:`Quote`""" def __init__(self, doc, *args, **kwargs): """ Example:: sentence = paragraph.append( folia.Sentence) sentence.append( folia.Word, 'This') sentence.append( folia.Word, 'is') sentence.append( folia.Word, 'a') sentence.append( folia.Word, 'test', space=False) sentence.append( folia.Word, '.') Example:: sentence = folia.Sentence( doc, folia.Word(doc, 'This'), folia.Word(doc, 'is'), folia.Word(doc, 'a'), folia.Word(doc, 'test', space=False), folia.Word(doc, '.') ) paragraph.append(sentence) See also: :meth:`AbstractElement.__init__` """ super(Sentence,self).__init__(doc, *args, **kwargs) def resolveword(self, id): for child in self: r = child.resolveword(id) if r: return r return None def corrections(self): """Are there corrections in this sentence? Returns: bool """ return bool(self.select(Correction)) def paragraph(self): """Obtain the paragraph this sentence is a part of (None otherwise). Shortcut for :meth:`AbstractElement.ancestor`""" return self.ancestor(Paragraph) def division(self): """Obtain the division this sentence is a part of (None otherwise). Shortcut for :meth:`AbstractElement.ancestor`""" return self.ancestor(Division) def correctwords(self, originalwords, newwords, **kwargs): """Generic correction method for words. You most likely want to use the helper functions :meth:`Sentence.splitword` , :meth:`Sentence.mergewords`, :meth:`deleteword`, :meth:`insertword` instead""" for w in originalwords: if not isinstance(w, Word): raise Exception("Original word is not a Word instance: " + str(type(w))) elif w.sentence() != self: raise Exception("Original not found as member of sentence!") for w in newwords: if not isinstance(w, Word): raise Exception("New word is not a Word instance: " + str(type(w))) if 'suggest' in kwargs and kwargs['suggest']: del kwargs['suggest'] return self.correct(suggestion=newwords,current=originalwords, **kwargs) else: return self.correct(original=originalwords, new=newwords, **kwargs) def splitword(self, originalword, *newwords, **kwargs): """TODO: Write documentation""" if isstring(originalword): originalword = self.doc[u(originalword)] return self.correctwords([originalword], newwords, **kwargs) def mergewords(self, newword, *originalwords, **kwargs): """TODO: Write documentation""" return self.correctwords(originalwords, [newword], **kwargs) def deleteword(self, word, **kwargs): """TODO: Write documentation""" if isstring(word): word = self.doc[u(word)] return self.correctwords([word], [], **kwargs) def insertword(self, newword, prevword, **kwargs): """Inserts a word **as a correction** after an existing word. This method automatically computes the index of insertion and calls :meth:`AbstractElement.insert` Arguments: newword (:class:`Word`): The new word to insert prevword (:class:`Word`): The word to insert after Keyword Arguments: suggest (bool): Do a suggestion for correction rather than the default authoritive correction See also: :meth:`AbstractElement.insert` and :meth:`AbstractElement.getindex` If you do not want to do corrections """ if prevword: if isstring(prevword): prevword = self.doc[u(prevword)] if not prevword in self or not isinstance(prevword, Word): raise Exception("Previous word not found or not instance of Word!") if isinstance(newword, list) or isinstance(newword, tuple): if not all([ isinstance(x, Word) for x in newword ]): raise Exception("New word (iterable) constains non-Word instances!") elif not isinstance(newword, Word): raise Exception("New word no instance of Word!") kwargs['insertindex'] = self.getindex(prevword) + 1 else: kwargs['insertindex'] = 0 kwargs['nooriginal'] = True if isinstance(newword, list) or isinstance(newword, tuple): return self.correctwords([], newword, **kwargs) else: return self.correctwords([], [newword], **kwargs) def insertwordleft(self, newword, nextword, **kwargs): """Inserts a word **as a correction** before an existing word. Reverse of :meth:`Sentence.insertword`. """ if nextword: if isstring(nextword): nextword = self.doc[u(nextword)] if not nextword in self or not isinstance(nextword, Word): raise Exception("Next word not found or not instance of Word!") if isinstance(newword, list) or isinstance(newword, tuple): if not all([ isinstance(x, Word) for x in newword ]): raise Exception("New word (iterable) constains non-Word instances!") elif not isinstance(newword, Word): raise Exception("New word no instance of Word!") kwargs['insertindex'] = self.getindex(nextword) else: kwargs['insertindex'] = 0 kwargs['nooriginal'] = True if isinstance(newword, list) or isinstance(newword, tuple): return self.correctwords([], newword, **kwargs) else: return self.correctwords([], [newword], **kwargs) def gettextdelimiter(self, retaintokenisation=False): #no text delimiter of itself, recurse into children to inherit delimiter for child in reversed(self): if isinstance(child, Linebreak) or isinstance(child, Whitespace): return "" #if a sentence ends in a linebreak, we don't want any delimiter else: break return self.TEXTDELIMITER class Utterance(AbstractStructureElement): """Utterance element. A structure element for speech annotation.""" class Event(AbstractStructureElement): """Structural element representing events, often used in new media contexts for things such as tweets,chat messages and forum posts.""" class Caption(AbstractStructureElement): """Element used for captions for :class:`Figure` or :class:`Table`""" class Label(AbstractStructureElement): """Element used for labels. Mostly in within list item. Contains words.""" class ListItem(AbstractStructureElement): """Single element in a List. Structure element. Contained within :class:`List` element.""" class List(AbstractStructureElement): """Element for enumeration/itemisation. Structure element. Contains :class:`ListItem` elements.""" class Figure(AbstractStructureElement): """Element for the representation of a graphical figure. Structure element.""" def json(self, attribs = None, recurse=True,ignorelist=False): if self.src: if not attribs: attribs = {} attribs['src'] = self.src return super(Figure, self).json(attribs, recurse, ignorelist) def caption(self): try: caption = next(self.select(Caption)) return caption.text() except: raise NoSuchText class Head(AbstractStructureElement): """Head element; a structure element that acts as the header/title of a :class:`Division`. There may be only one per division. Often contains sentences (:class:`Sentence`) or Words (:class:`Word`).""" class Paragraph(AbstractStructureElement): """Paragraph element. A structure element. Represents a paragraph and holds all its sentences (and possibly other structure Whitespace and Quotes).""" class Cell(AbstractStructureElement): """A cell in a :class:`Row` in a :class:`Table`""" pass class Row(AbstractStructureElement): """A row in a :class:`Table`""" pass class TableHead(AbstractStructureElement): """Encapsulated the header of a table, contains :class:`Cell` elements""" pass class Table(AbstractStructureElement): """A table consisting of :class:`Row` elements that in turn consist of :class:`Cell` elements""" pass class Division(AbstractStructureElement): """Structure element representing some kind of division. Divisions may be nested at will, and may include almost all kinds of other structure elements.""" def head(self): for e in self.data: if isinstance(e, Head): return e raise NoSuchAnnotation() class Speech(AbstractStructureElement): """A full speech. This is a high-level element. This element may contain :class:`Division`,:class:`Paragraph`, class:`Sentence`, etc..""" # (both SPEAKABLE and PRINTABLE) class Text(AbstractStructureElement): """A full text. This is a high-level element (not to be confused with TextContent!). This element may contain :class:`Division`,:class:`Paragraph`, class:`Sentence`, etc..""" # (both SPEAKABLE and PRINTABLE) class ForeignData(AbstractElement): """The ForeignData element encapsulated data that is not in FoLiA but in a different format. Such data must use a different XML namespace and will be preserved as-is, that is the ``lxml.etree.Element`` instance is retained unmodified. No further interpretation takes place. """ def __init__(self, doc, *args, **kwargs): #pylint: disable=super-init-not-called self.data = [] if 'node' not in kwargs: raise ValueError("Expected a node= keyword argument for foreign-data") if not isinstance(kwargs['node'],ElementTree._Element): raise ValueError("foreign-data node should be ElementTree.Element instance, got " + str(type(kwargs['node']))) self.node = kwargs['node'] for subnode in self.node: self._checknamespace(subnode) self.doc = doc self.id = None self.auth = True self.next = None #chains foreigndata #do not call superconstructor def _checknamespace(self, node): #namespace must be foreign for subnode in node: if node.tag and node.tag.startswith('{'+NSFOLIA+'}'): raise ValueError("foreign-data may not include elements in the FoLiA namespace, a foreign XML namespace is mandatory") self._checknamespace(subnode) @classmethod def parsexml(Class, node, doc, **kwargs): return ForeignData(doc, node=node) def select(self, Class, set=None, recursive=True, ignore=True, node=None): #pylint: disable=bad-classmethod-argument,redefined-builtin """This is a dummy method that returns an empty generator, select() does not work on ForeignData""" #select can never descend into ForeignData, empty generator: return yield def xml(self, attribs = None,elements = None, skipchildren = False): """Returns the XML node (an lxml.etree.Element) that holds the foreign data""" return self.node @classmethod def relaxng(cls, includechildren=True,extraattribs = None, extraelements=None): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) return E.define( E.element(E.ref(name="any_content"), name=cls.XMLTAG), name=cls.XMLTAG, ns=NSFOLIA) #=================================================================================================================== class Query(object): """An XPath query on one or more FoLiA documents""" def __init__(self, files, expression): if isstring(files): self.files = [u(files)] else: assert hasattr(files,'__iter__') self.files = files self.expression = expression def __iter__(self): for filename in self.files: doc = Document(file=filename, mode=Mode.XPATH) for result in doc.xpath(self.expression): yield result class RegExp(object): def __init__(self, regexp): self.regexp = re.compile(regexp) def __eq__(self, value): return self.regexp.match(value) class Pattern(object): """ This class describes a pattern over words to be searched for. The :meth:`Document.findwords` method can subsequently be called with this pattern, and it will return all the words that match. An example will best illustrate this, first a trivial example of searching for one word:: for match in doc.findwords( folia.Pattern('house') ): for word in match: print word.id print "----" The same can be done for a sequence:: for match in doc.findwords( folia.Pattern('a','big', 'house') ): for word in match: print word.id print "----" The boolean value ``True`` acts as a wildcard, matching any word:: for match in doc.findwords( folia.Pattern('a',True,'house') ): for word in match: print word.id, word.text() print "----" Alternatively, and more constraning, you may also specify a tuple of alternatives:: for match in doc.findwords( folia.Pattern('a',('big','small'),'house') ): for word in match: print word.id, word.text() print "----" Or even a regular expression using the ``folia.RegExp`` class:: for match in doc.findwords( folia.Pattern('a', folia.RegExp('b?g'),'house') ): for word in match: print word.id, word.text() print "----" Rather than searching on the text content of the words, you can search on the classes of any kind of token annotation using the keyword argument ``matchannotation=``:: for match in doc.findwords( folia.Pattern('det','adj','noun',matchannotation=folia.PosAnnotation ) ): for word in match: print word.id, word.text() print "----" The set can be restricted by adding the additional keyword argument ``matchannotationset=``. Case sensitivity, by default disabled, can be enabled by setting ``casesensitive=True``. Things become even more interesting when different Patterns are combined. A match will have to satisfy all patterns:: for match in doc.findwords( folia.Pattern('a', True, 'house'), folia.Pattern('det','adj','noun',matchannotation=folia.PosAnnotation ) ): for word in match: print word.id, word.text() print "----" The ``findwords()`` method can be instructed to also return left and/or right context for any match. This is done using the ``leftcontext=`` and ``rightcontext=`` keyword arguments, their values being an integer number of the number of context words to include in each match. For instance, we can look for the word house and return its immediate neighbours as follows:: for match in doc.findwords( folia.Pattern('house') , leftcontext=1, rightcontext=1): for word in match: print word.id print "----" A match here would thus always consist of three words instead of just one. Last, ``Pattern`` also has support for variable-width gaps, the asterisk symbol has special meaning to this end:: for match in doc.findwords( folia.Pattern('a','*','house') ): for word in match: print word.id print "----" Unlike the pattern ``('a',True,'house')``, which by definition is a pattern of three words, the pattern in the example above will match gaps of any length (up to a certain built-in maximum), so this might include matches such as *a very nice house*. Some remarks on these methods of querying are in order. These searches are pretty exhaustive and are done by simply iterating over all the words in the document. The entire document is loaded in memory and no special indices are involved. For single documents this is okay, but when iterating over a corpus of thousands of documents, this method is too slow, especially for real-time applications. For huge corpora, clever indexing and database management systems will be required. This however is beyond the scope of this library. """ def __init__(self, *args, **kwargs): if not all( ( (x is True or isinstance(x,RegExp) or isstring(x) or isinstance(x, list) or isinstance(x, tuple)) for x in args )): raise TypeError self.sequence = args if 'matchannotation' in kwargs: self.matchannotation = kwargs['matchannotation'] del kwargs['matchannotation'] else: self.matchannotation = None if 'matchannotationset' in kwargs: self.matchannotationset = kwargs['matchannotationset'] del kwargs['matchannotationset'] else: self.matchannotationset = None if 'casesensitive' in kwargs: self.casesensitive = bool(kwargs['casesensitive']) del kwargs['casesensitive'] else: self.casesensitive = False for key in kwargs.keys(): raise Exception("Unknown keyword parameter: " + key) if not self.casesensitive: if all( ( isstring(x) for x in self.sequence) ): self.sequence = [ u(x).lower() for x in self.sequence ] def __nonzero__(self): #Python 2.x return True def __bool__(self): return True def __len__(self): return len(self.sequence) def __getitem__(self, index): return self.sequence[index] def __getslice__(self, begin,end): return self.sequence[begin:end] def variablesize(self): return ('*' in self.sequence) def variablewildcards(self): wildcards = [] for i,x in enumerate(self.sequence): if x == '*': wildcards.append(i) return wildcards def __repr__(self): return repr(self.sequence) def resolve(self,size, distribution): """Resolve a variable sized pattern to all patterns of a certain fixed size""" if not self.variablesize(): raise Exception("Can only resize patterns with * wildcards") nrofwildcards = 0 for x in self.sequence: if x == '*': nrofwildcards += 1 assert (len(distribution) == nrofwildcards) wildcardnr = 0 newsequence = [] for x in self.sequence: if x == '*': newsequence += [True] * distribution[wildcardnr] wildcardnr += 1 else: newsequence.append(x) d = { 'matchannotation':self.matchannotation, 'matchannotationset':self.matchannotationset, 'casesensitive':self.casesensitive } yield Pattern(*newsequence, **d ) class NativeMetaData(object): def __init__(self, *args, **kwargs): self.data = {} self.order = [] for key, value in kwargs.items(): self[key] = value def __setitem__(self, key, value): exists = key in self.data if sys.version < '3': self.data[key] = unicode(value) else: self.data[key] = str(value) if not exists: self.order.append(key) def __iter__(self): for x in self.order: yield x def __contains__(self, x): return x in self.data def items(self): for key in self.order: yield key, self.data[key] def __len__(self): return len(self.data) def __getitem__(self, key): return self.data[key] def __delitem__(self,key): del self.data[key] self.order.remove(key) class Document(object): """This is the FoLiA Document and holds all its data in memory. All FoLiA elements have to be associated with a FoLiA document. Besides holding elements, the document may hold metadata including declarations, and an index of all IDs.""" IDSEPARATOR = '.' def __init__(self, *args, **kwargs): """Start/load a FoLiA document: There are four sources of input for loading a FoLiA document:: 1) Create a new document by specifying an *ID*:: doc = folia.Document(id='test') 2) Load a document from FoLiA or D-Coi XML file:: doc = folia.Document(file='/path/to/doc.xml') 3) Load a document from an XML string:: doc = folia.Document(string='....') 4) Load a document by passing a parse xml tree (lxml.etree): doc = folia.Document(tree=xmltree) Additionally, there are three modes that can be set with the ``mode=`` keyword argument: * folia.Mode.MEMORY - The entire FoLiA Document will be loaded into memory. This is the default mode and the only mode in which documents can be manipulated and saved again. * folia.Mode.XPATH - The full XML tree will still be loaded into memory, but conversion to FoLiA classes occurs only when queried. This mode can be used when the full power of XPath is required. Keyword Arguments: setdefinition (dict): A dictionary of set definitions, the key corresponds to the set name, the value is a SetDefinition instance loadsetdefinitions (bool): download and load set definitions (default: False) deepvalidation (bool): Do deep validation of the document (default: False), implies ``loadsetdefinitions`` preparsexmlcallback (function): Callback for a function taking one argument (``node``, an lxml node). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort parsing this element (and all its children) parsexmlcallback (function): Callback for a function taking one argument (``element``, a FoLiA element). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort adding this element (and all its children) debug (bool): Boolean to enable/disable debug """ self.version = FOLIAVERSION self.data = [] #will hold all texts (usually only one) self.annotationdefaults = {} self.annotations = [] #Ordered list of incorporated annotations ['token','pos', etc..] #Add implicit declaration for TextContent self.annotations.append( (AnnotationType.TEXT,'undefined') ) self.annotationdefaults[AnnotationType.TEXT] = {'undefined': {} } #Add implicit declaration for PhonContent self.annotations.append( (AnnotationType.PHON,'undefined') ) self.annotationdefaults[AnnotationType.PHON] = {'undefined': {} } self.index = {} #all IDs go here self.declareprocessed = False # Will be set to True when declarations have been processed self.metadata = NativeMetaData() #will point to XML Element holding native metadata self.metadatatype = MetaDataType.NATIVE self.metadatafile = None #reference to external metadata file self.textclasses = set() #will contain the text classes found self.autodeclare = False #Automatic declarations in case of undeclared elements (will be enabled for DCOI, since DCOI has no declarations) if 'setdefinitions' in kwargs: self.setdefinitions = kwargs['setdefinitions'] #to re-use a shared store else: self.setdefinitions = {} #key: set name, value: SetDefinition instance (only used when deepvalidation=True) #The metadata fields FoLiA is directly aware of: self._title = self._date = self._publisher = self._license = self._language = None if 'debug' in kwargs: self.debug = kwargs['debug'] else: self.debug = False if 'verbose' in kwargs: self.verbose = kwargs['verbose'] else: self.verbose = False if 'mode' in kwargs: self.mode = int(kwargs['mode']) else: self.mode = Mode.MEMORY #Load all in memory if 'parentdoc' in kwargs: #for subdocuments assert isinstance(kwargs['parentdoc'], Document) self.parentdoc = kwargs['parentdoc'] else: self.parentdoc = None self.subdocs = {} #will hold all subdocs (sourcestring => document) , needed so the index can resolve IDs in subdocs self.standoffdocs = {} #will hold all standoffdocs (type => set => sourcestring => document) if 'external' in kwargs: self.external = kwargs['external'] else: self.external = False if self.external and not self.parentdoc: raise DeepValidationError("Document is marked as external and should not be loaded independently. However, no parentdoc= has been specified!") if 'loadsetdefinitions' in kwargs: self.loadsetdefinitions = bool(kwargs['loadsetdefinitions']) else: self.loadsetdefinitions = False if 'deepvalidation' in kwargs: self.deepvalidation = bool(kwargs['deepvalidation']) else: self.deepvalidation = False if self.deepvalidation: self.loadsetdefinitions = True if 'allowadhocsets' in kwargs: self.allowadhocsets = bool(kwargs['allowadhocsets']) else: if self.deepvalidation: self.allowadhocsets = False else: self.allowadhocsets = True if 'autodeclare' in kwargs: self.autodeclare = True if 'bypassleak' in kwargs: self.bypassleak = False #obsolete now if 'preparsexmlcallback' in kwargs: self.preparsexmlcallback = kwargs['parsexmlcallback'] else: self.preparsexmlcallback = None if 'parsexmlcallback' in kwargs: self.parsexmlcallback = kwargs['parsexmlcallback'] else: self.parsexmlcallback = None if 'id' in kwargs: isncname(kwargs['id']) self.id = kwargs['id'] elif 'file' in kwargs: self.filename = kwargs['file'] if self.filename[-4:].lower() == '.bz2': f = bz2.BZ2File(self.filename) contents = f.read() f.close() self.tree = xmltreefromstring(contents) del contents self.parsexml(self.tree.getroot()) elif self.filename[-3:].lower() == '.gz': f = gzip.GzipFile(self.filename) #pylint: disable=redefined-variable-type contents = f.read() f.close() self.tree = xmltreefromstring(contents) del contents self.parsexml(self.tree.getroot()) else: self.load(self.filename) elif 'string' in kwargs: self.tree = xmltreefromstring(kwargs['string']) del kwargs['string'] self.parsexml(self.tree.getroot()) if self.mode != Mode.XPATH: #XML Tree is now obsolete (only needed when partially loaded for xpath queries) self.tree = None elif 'tree' in kwargs: self.parsexml(kwargs['tree']) else: raise Exception("No ID, filename or tree specified") if self.mode != Mode.XPATH: #XML Tree is now obsolete (only needed when partially loaded for xpath queries), free memory self.tree = None #def __del__(self): # del self.index # for child in self.data: # del child # del self.data def load(self, filename): """Load a FoLiA XML file. Argument: filename (str): The file to load """ #if LXE and self.mode != Mode.XPATH: # #workaround for xml:id problem (disabled) # #f = open(filename) # #s = f.read().replace(' xml:id=', ' id=') # #f.close() # self.tree = ElementTree.parse(filename) #else: self.tree = xmltreefromfile(filename) self.parsexml(self.tree.getroot()) if self.mode != Mode.XPATH: #XML Tree is now obsolete (only needed when partially loaded for xpath queries) self.tree = None def items(self): """Returns a depth-first flat list of all items in the document""" l = [] for e in self.data: l += e.items() return l def xpath(self, query): """Run Xpath expression and parse the resulting elements. Don't forget to use the FoLiA namesapace in your expressions, using folia: or the short form f: """ for result in self.tree.xpath(query,namespaces={'f': 'http://ilk.uvt.nl/folia','folia': 'http://ilk.uvt.nl/folia' }): yield self.parsexml(result) def findwords(self, *args, **kwargs): for x in findwords(self,self.words,*args,**kwargs): yield x def save(self, filename=None): """Save the document to file. Arguments: * filename (str): The filename to save to. If not set (``None``, default), saves to the same file as loaded from. """ if not filename: filename = self.filename if not filename: raise Exception("No filename specified") if filename[-4:].lower() == '.bz2': f = bz2.BZ2File(filename,'wb') f.write(self.xmlstring().encode('utf-8')) f.close() elif filename[-3:].lower() == '.gz': f = gzip.GzipFile(filename,'wb') #pylint: disable=redefined-variable-type f.write(self.xmlstring().encode('utf-8')) f.close() else: f = io.open(filename,'w',encoding='utf-8') f.write(self.xmlstring()) f.close() def __len__(self): return len(self.data) def __nonzero__(self): #Python 2.x return True def __bool__(self): return True def __iter__(self): for text in self.data: yield text def __contains__(self, key): """Tests if the specified element ID is in the document index""" if key in self.index: return True elif self.subdocs: for subdoc in self.subdocs.values(): if key in subdoc: return True return False else: return False def __getitem__(self, key): """Obtain an element by ID from the document index. Example:: word = doc['example.p.4.s.10.w.3'] """ if isinstance(key, int): return self.data[key] else: try: return self.index[key] except KeyError: if self.subdocs: #perhaps the key is in one of our subdocs? for subdoc in self.subdocs.values(): try: return subdoc[key] except KeyError: pass else: raise KeyError("No such key: " + key) def append(self,text): """Add a text (or speech) to the document: Example 1:: doc.append(folia.Text) Example 2:: doc.append( folia.Text(doc, id='example.text') ) Example 3:: doc.append(folia.Speech) """ if text is Text: text = Text(self, id=self.id + '.text.' + str(len(self.data)+1) ) elif text is Speech: text = Speech(self, id=self.id + '.speech.' + str(len(self.data)+1) ) #pylint: disable=redefined-variable-type else: assert isinstance(text, Text) or isinstance(text, Speech) self.data.append(text) return text def add(self,text): """Alias for :meth:`Document.append`""" return self.append(text) def create(self, Class, *args, **kwargs): """Create an element associated with this Document. This method may be obsolete and removed later.""" return Class(self, *args, **kwargs) def xmldeclarations(self): """Internal method to generate XML nodes for all declarations""" l = [] E = ElementMaker(namespace="http://ilk.uvt.nl/folia",nsmap={None: "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) for annotationtype, set in self.annotations: label = None #Find the 'label' for the declarations dynamically (aka: AnnotationType --> String) for key, value in vars(AnnotationType).items(): if value == annotationtype: label = key break #gather attribs if (annotationtype == AnnotationType.TEXT or annotationtype == AnnotationType.PHON) and set == 'undefined' and len(self.annotationdefaults[annotationtype][set]) == 0: #this is the implicit TextContent declaration, no need to output it explicitly continue attribs = {} if set and set != 'undefined': attribs['{' + NSFOLIA + '}set'] = set for key, value in self.annotationdefaults[annotationtype][set].items(): if key == 'annotatortype': if value == AnnotatorType.MANUAL: attribs['{' + NSFOLIA + '}' + key] = 'manual' elif value == AnnotatorType.AUTO: attribs['{' + NSFOLIA + '}' + key] = 'auto' elif key == 'datetime': attribs['{' + NSFOLIA + '}' + key] = value.strftime("%Y-%m-%dT%H:%M:%S") #proper iso-formatting elif value: attribs['{' + NSFOLIA + '}' + key] = value if label: l.append( makeelement(E,'{' + NSFOLIA + '}' + label.lower() + '-annotation', **attribs) ) else: raise Exception("Invalid annotation type") return l def jsondeclarations(self): """Return all declarations in a form ready to be serialised to JSON. Returns: list of dict """ l = [] for annotationtype, set in self.annotations: label = None #Find the 'label' for the declarations dynamically (aka: AnnotationType --> String) for key, value in vars(AnnotationType).items(): if value == annotationtype: label = key break #gather attribs if (annotationtype == AnnotationType.TEXT or annotationtype == AnnotationType.PHON) and set == 'undefined' and len(self.annotationdefaults[annotationtype][set]) == 0: #this is the implicit TextContent declaration, no need to output it explicitly continue jsonnode = {'annotationtype': label.lower()} if set and set != 'undefined': jsonnode['set'] = set for key, value in self.annotationdefaults[annotationtype][set].items(): if key == 'annotatortype': if value == AnnotatorType.MANUAL: jsonnode[key] = 'manual' elif value == AnnotatorType.AUTO: jsonnode[key] = 'auto' elif key == 'datetime': jsonnode[key] = value.strftime("%Y-%m-%dT%H:%M:%S") #proper iso-formatting elif value: jsonnode[key] = value if label: l.append( jsonnode ) else: raise Exception("Invalid annotation type") return l def xml(self): """Serialise the document to XML. Returns: lxml.etree.Element See also: :meth:`Document.xmlstring` """ E = ElementMaker(namespace="http://ilk.uvt.nl/folia",nsmap={'xml' : "http://www.w3.org/XML/1998/namespace", 'xlink':"http://www.w3.org/1999/xlink"}) attribs = {} attribs['{http://www.w3.org/XML/1998/namespace}id'] = self.id if self.version: attribs['version'] = self.version else: attribs['version'] = FOLIAVERSION attribs['generator'] = 'pynlpl.formats.folia-v' + LIBVERSION metadataattribs = {} metadataattribs['{' + NSFOLIA + '}type'] = self.metadatatype if self.metadatafile: metadataattribs['{' + NSFOLIA + '}src'] = self.metadatafile e = E.FoLiA( E.metadata( E.annotations( *self.xmldeclarations() ), *self.xmlmetadata(), **metadataattribs ) , **attribs) for text in self.data: e.append(text.xml()) return e def json(self): """Serialise the document to a ``dict`` ready for serialisation to JSON. Example:: import json jsondoc = json.dumps(doc.json()) """ jsondoc = {'id': self.id, 'children': [], 'declarations': self.jsondeclarations() } if self.version: jsondoc['version'] = self.version else: jsondoc['version'] = FOLIAVERSION jsondoc['generator'] = 'pynlpl.formats.folia-v' + LIBVERSION for text in self.data: jsondoc['children'].append(text.json()) return jsondoc def xmlmetadata(self): """Internal method to serialize XML declarations""" E = ElementMaker(namespace="http://ilk.uvt.nl/folia",nsmap={None: "http://ilk.uvt.nl/folia", 'xml' : "http://www.w3.org/XML/1998/namespace"}) if self.metadatatype == MetaDataType.NATIVE: e = [] if not self.metadatafile: for key, value in self.metadata.items(): e.append(E.meta(value,id=key) ) return e else: if self.metadatafile: return [] #external elif self.metadata is not None: #in-document e = [] m = self.metadata while m is not None: e.append(m.xml()) m = m.next return e else: return [] def parsexmldeclarations(self, node): """Internal method to parse XML declarations""" if self.debug >= 1: print("[PyNLPl FoLiA DEBUG] Processing Annotation Declarations",file=stderr) self.declareprocessed = True for subnode in node: #pylint: disable=too-many-nested-blocks if not isinstance(subnode.tag, str): continue if subnode.tag[:25] == '{' + NSFOLIA + '}' and subnode.tag[-11:] == '-annotation': prefix = subnode.tag[25:][:-11] type = None if prefix.upper() in vars(AnnotationType): type = vars(AnnotationType)[prefix.upper()] else: raise Exception("Unknown declaration: " + subnode.tag) if 'set' in subnode.attrib and subnode.attrib['set']: set = subnode.attrib['set'] else: set = 'undefined' if (type,set) in self.annotations: if type == AnnotationType.TEXT: #explicit Text declaration, remove the implicit declaration: a = [] for t,s in self.annotations: if not (t == AnnotationType.TEXT and s == 'undefined'): a.append( (t,s) ) self.annotations = a #raise ValueError("Double declaration of " + subnode.tag + ", set '" + set + "' + is already declared") //doubles are okay says Ko else: self.annotations.append( (type, set) ) #Load set definition if set and self.loadsetdefinitions and set not in self.setdefinitions: if set[:7] == "http://" or set[:8] == "https://" or set[:6] == "ftp://": try: self.setdefinitions[set] = SetDefinition(set,verbose=self.verbose) #will raise exception on error except DeepValidationError: print("WARNING: Set " + set + " could not be downloaded, ignoring!",file=sys.stderr) #warning and ignore #Set defaults if type in self.annotationdefaults and set in self.annotationdefaults[type]: #handle duplicate. If ambiguous: remove defaults if 'annotator' in subnode.attrib: if not ('annotator' in self.annotationdefaults[type][set]): self.annotationdefaults[type][set]['annotator'] = subnode.attrib['annotator'] elif self.annotationdefaults[type][set]['annotator'] != subnode.attrib['annotator']: del self.annotationdefaults[type][set]['annotator'] if 'annotatortype' in subnode.attrib: if not ('annotatortype' in self.annotationdefaults[type][set]): self.annotationdefaults[type][set]['annotatortype'] = subnode.attrib['annotatortype'] elif self.annotationdefaults[type][set]['annotatortype'] != subnode.attrib['annotatortype']: del self.annotationdefaults[type][set]['annotatortype'] else: defaults = {} if 'annotator' in subnode.attrib: defaults['annotator'] = subnode.attrib['annotator'] if 'annotatortype' in subnode.attrib: if subnode.attrib['annotatortype'] == 'auto': defaults['annotatortype'] = AnnotatorType.AUTO else: defaults['annotatortype'] = AnnotatorType.MANUAL if 'datetime' in subnode.attrib: if isinstance(subnode.attrib['datetime'], datetime): defaults['datetime'] = subnode.attrib['datetime'] else: defaults['datetime'] = parse_datetime(subnode.attrib['datetime']) if not type in self.annotationdefaults: self.annotationdefaults[type] = {} self.annotationdefaults[type][set] = defaults if 'external' in subnode.attrib and subnode.attrib['external']: if self.debug >= 1: print("[PyNLPl FoLiA DEBUG] Loading external document: " + subnode.attrib['external'],file=stderr) if not type in self.standoffdocs: self.standoffdocs[type] = {} self.standoffdocs[type][set] = {} #check if it is already loaded, if multiple references are made to the same doc we reuse the instance standoffdoc = None for t in self.standoffdocs: for s in self.standoffdocs[t]: for source in self.standoffdocs[t][s]: if source == subnode.attrib['external']: standoffdoc = self.standoffdocs[t][s] break if standoffdoc: break if standoffdoc: break if not standoffdoc: if subnode.attrib['external'][:7] == 'http://' or subnode.attrib['external'][:8] == 'https://': #document is remote, download (in memory) try: f = urlopen(subnode.attrib['external']) except: raise DeepValidationError("Unable to download standoff document: " + subnode.attrib['external']) try: content = u(f.read()) except IOError: raise DeepValidationError("Unable to download standoff document: " + subnode.attrib['external']) f.close() standoffdoc = Document(string=content, parentdoc=self, setdefinitions=self.setdefinitions) elif os.path.exists(subnode.attrib['external']): #document is on disk: standoffdoc = Document(file=subnode.attrib['external'], parentdoc=self, setdefinitions=self.setdefinitions) else: #document not found raise DeepValidationError("Unable to find standoff document: " + subnode.attrib['external']) self.standoffdocs[type][set][subnode.attrib['external']] = standoffdoc standoffdoc.parentdoc = self if self.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found declared annotation " + subnode.tag + ". Defaults: " + repr(defaults),file=stderr) def setimdi(self, node): #OBSOLETE """OBSOLETE""" ns = {'imdi': 'http://www.mpi.nl/IMDI/Schema/IMDI'} self.metadatatype = MetaDataType.IMDI if LXE: self.metadata = ElementTree.tostring(node, xml_declaration=False, pretty_print=True, encoding='utf-8') else: self.metadata = ElementTree.tostring(node, encoding='utf-8') n = node.xpath('imdi:Session/imdi:Title', namespaces=ns) if n and n[0].text: self._title = n[0].text n = node.xpath('imdi:Session/imdi:Date', namespaces=ns) if n and n[0].text: self._date = n[0].text n = node.xpath('//imdi:Source/imdi:Access/imdi:Publisher', namespaces=ns) if n and n[0].text: self._publisher = n[0].text n = node.xpath('//imdi:Source/imdi:Access/imdi:Availability', namespaces=ns) if n and n[0].text: self._license = n[0].text n = node.xpath('//imdi:Languages/imdi:Language/imdi:ID', namespaces=ns) if n and n[0].text: self._language = n[0].text def declare(self, annotationtype, set, **kwargs): """Declare a new annotation type to be used in the document. Keyword arguments can be used to set defaults for any annotation of this type and set. Arguments: annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``. set (str): the set, should formally be a URL pointing to the set definition Keyword Arguments: annotator (str): Sets a default annotator annotatortype: Should be either ``AnnotatorType.MANUAL`` or ``AnnotatorType.AUTO``, indicating whether the annotation was performed manually or by an automated process. datetime (datetime.datetime): Sets the default datetime Example:: doc.declare(folia.PosAnnotation, 'http://some/path/brown-tag-set', annotator="mytagger", annotatortype=folia.AnnotatorType.AUTO) """ if (sys.version > '3' and not isinstance(set,str)) or (sys.version < '3' and not isinstance(set,(str,unicode))): raise ValueError("Set parameter for declare() must be a string") if inspect.isclass(annotationtype): annotationtype = annotationtype.ANNOTATIONTYPE if not (annotationtype, set) in self.annotations: self.annotations.append( (annotationtype,set) ) if set and self.loadsetdefinitions and not set in self.setdefinitions: if set[:7] == "http://" or set[:8] == "https://" or set[:6] == "ftp://": self.setdefinitions[set] = loadsetdefinition(set) #will raise exception on error if not annotationtype in self.annotationdefaults: self.annotationdefaults[annotationtype] = {} self.annotationdefaults[annotationtype][set] = kwargs def declared(self, annotationtype, set): """Checks if the annotation type is present (i.e. declared) in the document. Arguments: annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``. set (str): the set, should formally be a URL pointing to the set definition Example:: if doc.declared(folia.PosAnnotation, 'http://some/path/brown-tag-set'): .. Returns: bool """ if inspect.isclass(annotationtype): annotationtype = annotationtype.ANNOTATIONTYPE return ( (annotationtype,set) in self.annotations) def defaultset(self, annotationtype): """Obtain the default set for the specified annotation type. Arguments: annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``. Returns: the set (str) Raises: :class:`NoDefaultError` if the annotation type does not exist or if there is ambiguity (multiple sets for the same type) """ if inspect.isclass(annotationtype) or isinstance(annotationtype,AbstractElement): annotationtype = annotationtype.ANNOTATIONTYPE try: return list(self.annotationdefaults[annotationtype].keys())[0] except KeyError: raise NoDefaultError except IndexError: raise NoDefaultError def defaultannotator(self, annotationtype, set=None): """Obtain the default annotator for the specified annotation type and set. Arguments: annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``. set (str): the set, should formally be a URL pointing to the set definition Returns: the set (str) Raises: :class:`NoDefaultError` if the annotation type does not exist or if there is ambiguity (multiple sets for the same type) """ if inspect.isclass(annotationtype) or isinstance(annotationtype,AbstractElement): annotationtype = annotationtype.ANNOTATIONTYPE if not set: set = self.defaultset(annotationtype) try: return self.annotationdefaults[annotationtype][set]['annotator'] except KeyError: raise NoDefaultError def defaultannotatortype(self, annotationtype,set=None): """Obtain the default annotator type for the specified annotation type and set. Arguments: annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``. set (str): the set, should formally be a URL pointing to the set definition Returns: ``AnnotatorType.AUTO`` or ``AnnotatorType.MANUAL`` Raises: :class:`NoDefaultError` if the annotation type does not exist or if there is ambiguity (multiple sets for the same type) """ if inspect.isclass(annotationtype) or isinstance(annotationtype,AbstractElement): annotationtype = annotationtype.ANNOTATIONTYPE if not set: set = self.defaultset(annotationtype) try: return self.annotationdefaults[annotationtype][set]['annotatortype'] except KeyError: raise NoDefaultError def defaultdatetime(self, annotationtype,set=None): """Obtain the default datetime for the specified annotation type and set. Arguments: annotationtype: The type of annotation, this is conveyed by passing the corresponding annototion class (such as :class:`PosAnnotation` for example), or a member of :class:`AnnotationType`, such as ``AnnotationType.POS``. set (str): the set, should formally be a URL pointing to the set definition Returns: the set (str) Raises: :class:`NoDefaultError` if the annotation type does not exist or if there is ambiguity (multiple sets for the same type) """ if inspect.isclass(annotationtype) or isinstance(annotationtype,AbstractElement): annotationtype = annotationtype.ANNOTATIONTYPE if not set: set = self.defaultset(annotationtype) try: return self.annotationdefaults[annotationtype][set]['datetime'] except KeyError: raise NoDefaultError def title(self, value=None): """Get or set the document's title from/in the metadata No arguments: Get the document's title from metadata Argument: Set the document's title in metadata """ if not (value is None): if (self.metadatatype == MetaDataType.NATIVE): self.metadata['title'] = value else: self._title = value if (self.metadatatype == MetaDataType.NATIVE): if 'title' in self.metadata: return self.metadata['title'] else: return None else: return self._title def date(self, value=None): """Get or set the document's date from/in the metadata. No arguments: Get the document's date from metadata Argument: Set the document's date in metadata """ if not (value is None): if (self.metadatatype == MetaDataType.NATIVE): self.metadata['date'] = value else: self._date = value if (self.metadatatype == MetaDataType.NATIVE): if 'date' in self.metadata: return self.metadata['date'] else: return None else: return self._date def publisher(self, value=None): """No arguments: Get the document's publisher from metadata Argument: Set the document's publisher in metadata """ if not (value is None): if (self.metadatatype == MetaDataType.NATIVE): self.metadata['publisher'] = value else: self._publisher = value if (self.metadatatype == MetaDataType.NATIVE): if 'publisher' in self.metadata: return self.metadata['publisher'] else: return None else: return self._publisher def license(self, value=None): """No arguments: Get the document's license from metadata Argument: Set the document's license in metadata """ if not (value is None): if (self.metadatatype == MetaDataType.NATIVE): self.metadata['license'] = value else: self._license = value if (self.metadatatype == MetaDataType.NATIVE): if 'license' in self.metadata: return self.metadata['license'] else: return None else: return self._license def language(self, value=None): """No arguments: Get the document's language (ISO-639-3) from metadata Argument: Set the document's language (ISO-639-3) in metadata """ if not (value is None): if (self.metadatatype == MetaDataType.NATIVE): self.metadata['language'] = value else: self._language = value if (self.metadatatype == MetaDataType.NATIVE): if 'language' in self.metadata: return self.metadata['language'] else: return None else: return self._language def parsemetadata(self, node): """Internal method to parse metadata""" if 'type' in node.attrib: self.metadatatype = node.attrib['type'] else: #no type specified, default to native self.metadatatype = "native" if 'src' in node.attrib: self.metadatafile = node.attrib['src'] else: self.metadatafile = None if self.metadatatype == "native": self.metadata = NativeMetaData() else: self.metadata = None #will be set below for subnode in node: if subnode.tag == '{' + NSFOLIA + '}annotations': self.parsexmldeclarations(subnode) elif subnode.tag == '{' + NSFOLIA + '}meta': if self.metadatatype == "native": if subnode.text: self.metadata[subnode.attrib['id']] = subnode.text else: raise MetaDataError("Encountered a meta element but metadata type is not native!") elif subnode.tag == '{' + NSFOLIA + '}foreign-data': if self.metadatatype == "native": raise MetaDataError("Encountered a foreign-data element but metadata type is native!") elif self.metadata is not None: #multiple foreign-data elements, chain: e = self.metadata while e.next is not None: e = e.next e.next = ForeignData(self, node=subnode) else: self.metadata = ForeignData(self, node=subnode) elif subnode.tag == '{http://www.mpi.nl/IMDI/Schema/IMDI}METATRANSCRIPT': #backward-compatibility for old IMDI without foreign-key E = ElementMaker(namespace=NSFOLIA,nsmap={None: NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"}) self.metadatatype = "imdi" self.metadata = makeelement(E, '{'+NSFOLIA+'}foreign-data') self.metadata.append(subnode) def parsexml(self, node, ParentClass = None): """Internal method. This is the main XML parser, will invoke class-specific XML parsers.""" if (LXE and isinstance(node,ElementTree._ElementTree)) or (not LXE and isinstance(node, ElementTree.ElementTree)): #pylint: disable=protected-access node = node.getroot() elif isstring(node): node = xmltreefromstring(node).getroot() if node.tag.startswith('{' + NSFOLIA + '}'): foliatag = node.tag[nslen:] if foliatag == "FoLiA": if self.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found FoLiA document",file=stderr) try: self.id = node.attrib['{http://www.w3.org/XML/1998/namespace}id'] except KeyError: try: self.id = node.attrib['XMLid'] except KeyError: try: self.id = node.attrib['id'] except KeyError: raise Exception("FoLiA Document has no ID!") if 'version' in node.attrib: self.version = node.attrib['version'] if checkversion(self.version) > 0: print("WARNING!!! Document uses a newer version of FoLiA than this library! (" + self.version + " vs " + FOLIAVERSION + "). Any possible subsequent failures in parsing or processing may probably be attributed to this. Upgrade pynlpl to remedy this.",file=sys.stderr) else: self.version = None if 'external' in node.attrib: self.external = (node.attrib['external'] == 'yes') if self.external and not self.parentdoc: raise DeepValidationError("Document is marked as external and should not be loaded independently. However, no parentdoc= has been specified!") for subnode in node: if subnode.tag == '{' + NSFOLIA + '}metadata': self.parsemetadata(subnode) elif (subnode.tag == '{' + NSFOLIA + '}text' or subnode.tag == '{' + NSFOLIA + '}speech') and self.mode == Mode.MEMORY: if self.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found Text",file=stderr) e = self.parsexml(subnode) if e is not None: self.data.append(e) else: #generic handling (FoLiA) if not foliatag in XML2CLASS: raise Exception("Unknown FoLiA XML tag: " + foliatag) Class = XML2CLASS[foliatag] return Class.parsexml(node,self) elif node.tag == '{' + NSDCOI + '}DCOI': if self.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found DCOI document",file=stderr) self.autodeclare = True try: self.id = node.attrib['{http://www.w3.org/XML/1998/namespace}id'] except KeyError: try: self.id = node.attrib['id'] except KeyError: try: self.id = node.attrib['XMLid'] except KeyError: raise Exception("D-Coi Document has no ID!") for subnode in node: if subnode.tag == '{http://www.mpi.nl/IMDI/Schema/IMDI}METATRANSCRIPT': self.metadatatype = MetaDataType.IMDI self.setimdi(subnode) elif subnode.tag == '{' + NSDCOI + '}text': if self.debug >= 1: print("[PyNLPl FoLiA DEBUG] Found Text",file=stderr) e = self.parsexml(subnode) if e is not None: self.data.append( e ) elif node.tag.startswith('{' + NSDCOI + '}'): #generic handling (D-Coi) if node.tag[nslendcoi:] in XML2CLASS: Class = XML2CLASS[node.tag[nslendcoi:]] return Class.parsexml(node,self) elif node.tag[nslendcoi:][0:3] == 'div': #support for div0, div1, etc: Class = Division return Class.parsexml(node,self) elif node.tag[nslendcoi:] == 'item': #support for listitem Class = ListItem return Class.parsexml(node,self) elif node.tag[nslendcoi:] == 'figDesc': #support for description in figures Class = Description return Class.parsexml(node,self) else: raise Exception("Unknown DCOI XML tag: " + node.tag) else: raise Exception("Unknown FoLiA XML tag: " + node.tag) def select(self, Class, set=None, recursive=True, ignore=True): """See :meth:`AbstractElement.select`""" if self.mode == Mode.MEMORY: for t in self.data: if Class.__name__ == 'Text': yield t else: for e in t.select(Class,set,recursive,ignore): yield e def count(self, Class, set=None, recursive=True,ignore=True): """See :meth:`AbstractElement.count`""" if self.mode == Mode.MEMORY: s = 0 for t in self.data: s += sum( 1 for e in t.select(Class,recursive,True ) ) return s def paragraphs(self, index = None): """Return a generator of all paragraphs found in the document. If an index is specified, return the n'th paragraph only (starting at 0)""" if index is None: return self.select(Paragraph) else: if index < 0: index = sum(t.count(Paragraph) for t in self.data) + index for t in self.data: for i,e in enumerate(t.select(Paragraph)) : if i == index: return e raise IndexError def sentences(self, index = None): """Return a generator of all sentence found in the document. Except for sentences in quotes. If an index is specified, return the n'th sentence only (starting at 0)""" if index is None: return self.select(Sentence,None,True,[Quote]) else: if index < 0: index = sum(t.count(Sentence,None,True,[Quote]) for t in self.data) + index for t in self.data: for i,e in enumerate(t.select(Sentence,None,True,[Quote])) : if i == index: return e raise IndexError def words(self, index = None): """Return a generator of all active words found in the document. Does not descend into annotation layers, alternatives, originals, suggestions. If an index is specified, return the n'th word only (starting at 0)""" if index is None: return self.select(Word,None,True,default_ignore_structure) else: if index < 0: index = sum(t.count(Word,None,True,default_ignore_structure) for t in self.data) + index for t in self.data: for i, e in enumerate(t.select(Word,None,True,default_ignore_structure)): if i == index: return e raise IndexError def text(self, cls='current', retaintokenisation=False): """Returns the text of the entire document (returns a unicode instance) See also: :meth:`AbstractElement.text` """ #backward compatibility, old versions didn't have cls as first argument, so if a boolean is passed first we interpret it as the 2nd: if cls is True or cls is False: retaintokenisation = cls cls = 'current' s = "" for c in self.data: if s: s += "\n\n\n" try: s += c.text(cls, retaintokenisation) except NoSuchText: continue return s def xmlstring(self): """Return the XML representation of the document as a string.""" s = ElementTree.tostring(self.xml(), xml_declaration=True, pretty_print=True, encoding='utf-8') if sys.version < '3': if isinstance(s, str): s = unicode(s,'utf-8') #pylint: disable=undefined-variable else: if isinstance(s,bytes): s = str(s,'utf-8') s = s.replace('ns0:','') #ugly patch to get rid of namespace prefix s = s.replace(':ns0','') return s def __unicode__(self): """Returns the text of the entire document""" return self.text() def __str__(self): """Returns the text of the entire document""" return self.text() def __ne__(self, other): return not (self == other) def __eq__(self, other): if len(self.data) != len(other.data): if self.debug: print("[PyNLPl FoLiA DEBUG] Equality check - Documents have unequal amount of children",file=stderr) return False for e,e2 in zip(self.data,other.data): if e != e2: return False return True #============================================================================== class Corpus: """A corpus of various FoLiA documents. Yields a Document on each iteration. Suitable for sequential processing.""" def __init__(self,corpusdir, extension = 'xml', restrict_to_collection = "", conditionf=lambda x: True, ignoreerrors=False, **kwargs): self.corpusdir = corpusdir self.extension = extension self.restrict_to_collection = restrict_to_collection self.conditionf = conditionf self.ignoreerrors = ignoreerrors self.kwargs = kwargs def __iter__(self): if not self.restrict_to_collection: for f in glob.glob(os.path.join(self.corpusdir,"*." + self.extension)): if self.conditionf(f): try: yield Document(file=f, **self.kwargs ) except Exception as e: #pylint: disable=broad-except print("Error, unable to parse " + f + ": " + e.__class__.__name__ + " - " + str(e),file=stderr) if not self.ignoreerrors: raise for d in glob.glob(os.path.join(self.corpusdir,"*")): if (not self.restrict_to_collection or self.restrict_to_collection == os.path.basename(d)) and (os.path.isdir(d)): for f in glob.glob(os.path.join(d ,"*." + self.extension)): if self.conditionf(f): try: yield Document(file=f, **self.kwargs) except Exception as e: #pylint: disable=broad-except print("Error, unable to parse " + f + ": " + e.__class__.__name__ + " - " + str(e),file=stderr) if not self.ignoreerrors: raise class CorpusFiles(Corpus): """A corpus of various FoLiA documents. Yields the filenames on each iteration.""" def __iter__(self): if not self.restrict_to_collection: for f in glob.glob(os.path.join(self.corpusdir,"*." + self.extension)): if self.conditionf(f): try: yield f except Exception as e: #pylint: disable=broad-except print("Error, unable to parse " + f+ ": " + e.__class__.__name__ + " - " + str(e),file=stderr) if not self.ignoreerrors: raise for d in glob.glob(os.path.join(self.corpusdir,"*")): if (not self.restrict_to_collection or self.restrict_to_collection == os.path.basename(d)) and (os.path.isdir(d)): for f in glob.glob(os.path.join(d, "*." + self.extension)): if self.conditionf(f): try: yield f except Exception as e: #pylint: disable=broad-except print("Error, unable to parse " + f+ ": " + e.__class__.__name__ + " - " + str(e),file=stderr) if not self.ignoreerrors: raise class CorpusProcessor(object): """Processes a corpus of various FoLiA documents using a parallel processing. Calls a user-defined function with the three-tuple (filename, args, kwargs) for each file in the corpus. The user-defined function is itself responsible for instantiating a FoLiA document! args and kwargs, as received by the custom function, are set through the run() method, which yields the result of the custom function on each iteration.""" def __init__(self,corpusdir, function, threads = None, extension = 'xml', restrict_to_collection = "", conditionf=lambda x: True, maxtasksperchild=100, preindex = False, ordered=True, chunksize = 1): self.function = function self.threads = threads #If set to None, will use all available cores by default self.corpusdir = corpusdir self.extension = extension self.restrict_to_collection = restrict_to_collection self.conditionf = conditionf self.ignoreerrors = True self.maxtasksperchild = maxtasksperchild #This should never be set too high due to lxml leaking memory!!! self.preindex = preindex self.ordered = ordered self.chunksize = chunksize if preindex: self.index = list(CorpusFiles(self.corpusdir, self.extension, self.restrict_to_collection, self.conditionf, True)) self.index.sort() def __len__(self): if self.preindex: return len(self.index) else: return ValueError("Can only retrieve length if instantiated with preindex=True") def execute(self): for _ in self.run(): pass def run(self, *args, **kwargs): if not self.preindex: self.index = CorpusFiles(self.corpusdir, self.extension, self.restrict_to_collection, self.conditionf, True) #generator pool = multiprocessing.Pool(self.threads,None,None, self.maxtasksperchild) if self.ordered: return pool.imap( self.function, ( (filename, args, kwargs) for filename in self.index), self.chunksize) else: return pool.imap_unordered( self.function, ( (filename, args, kwargs) for filename in self.index), self.chunksize) #pool.close() def __iter__(self): return self.run() def relaxng_declarations(): E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"}) for key in vars(AnnotationType).keys(): if key[0] != '_': yield E.element( E.optional( E.attribute(name='set')) , E.optional(E.attribute(name='annotator')) , E.optional( E.attribute(name='annotatortype') ) , E.optional( E.attribute(name='datetime') ) , name=key.lower() + '-annotation') def relaxng(filename=None): """Generates a RelaxNG Schema for FoLiA. Optionally saves it to file. Args: filename (str): Save the schema to the following filename Returns: lxml.ElementTree: The schema """ E = ElementMaker(namespace="http://relaxng.org/ns/structure/1.0",nsmap={None:'http://relaxng.org/ns/structure/1.0' , 'folia': NSFOLIA, 'xml' : "http://www.w3.org/XML/1998/namespace"}) grammar = E.grammar( E.start( E.element( #FoLiA E.attribute(name='id',ns="http://www.w3.org/XML/1998/namespace"), E.optional( E.attribute(name='version') ), E.optional( E.attribute(name='generator') ), E.element( #metadata E.optional(E.attribute(name='type')), E.optional(E.attribute(name='src')), E.element( E.zeroOrMore( E.choice( *relaxng_declarations() ) ) ,name='annotations'), E.zeroOrMore( E.element(E.attribute(name='id'), E.text(), name='meta'), ), E.zeroOrMore( E.ref(name="foreign-data"), ), #E.optional( # E.ref(name='METATRANSCRIPT') #), name='metadata', #ns=NSFOLIA, ), E.interleave( E.zeroOrMore( E.ref(name='text'), ), E.zeroOrMore( E.ref(name='speech'), ), ), name='FoLiA', ns = NSFOLIA ) ), #definitions needed for ForeignData (allow any content) - see http://www.microhowto.info/howto/match_arbitrary_content_using_relax_ng.html E.define( E.interleave(E.zeroOrMore(E.ref(name="any_element")),E.text()), name="any_content"), E.define( E.element(E.anyName(), E.zeroOrMore(E.ref(name="any_attribute")), E.zeroOrMore(E.ref(name="any_content"))), name="any_element"), E.define( E.attribute(E.anyName()), name="any_attribute"), #Definition for allowing alien-namespace attributes on any element E.define( E.zeroOrMore(E.attribute(E.anyName(getattr(E,'except')(E.nsName(),E.nsName(ns=""),E.nsName(ns="http://www.w3.org/XML/1998/namespace"),E.nsName(ns="http://www.w3.org/1999/xlink"))))), name="allow_foreign_attributes"), datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes", ) done = {} for c in globals().values(): if 'relaxng' in dir(c): if c.relaxng and c.XMLTAG and not c.XMLTAG in done: done[c.XMLTAG] = True definition = c.relaxng() grammar.append( definition ) if c.XMLTAG == 'item': #nasty backward-compatibility hack to allow deprecated listitem element (actually called item) definition_alias = c.relaxng() definition_alias.set('name','listitem') definition_alias[0].set('name','listitem') grammar.append( definition_alias ) #for e in relaxng_imdi(): # grammar.append(e) if filename: if sys.version < '3': f = io.open(filename,'w',encoding='utf-8') else: f = io.open(filename,'wb') if LXE: if sys.version < '3': f.write( ElementTree.tostring(relaxng(),pretty_print=True).replace("","\n\n") ) else: f.write( ElementTree.tostring(relaxng(),pretty_print=True).replace(b"",b"\n\n") ) else: f.write( ElementTree.tostring(relaxng()).replace("","\n\n") ) f.close() return grammar def findwords(doc, worditerator, *args, **kwargs): if 'leftcontext' in kwargs: leftcontext = int(kwargs['leftcontext']) del kwargs['leftcontext'] else: leftcontext = 0 if 'rightcontext' in kwargs: rightcontext = int(kwargs['rightcontext']) del kwargs['rightcontext'] else: rightcontext = 0 if 'maxgapsize' in kwargs: maxgapsize = int(kwargs['maxgapsize']) del kwargs['maxgapsize'] else: maxgapsize = 10 for key in kwargs.keys(): raise Exception("Unknown keyword parameter: " + key) matchcursor = 0 #shortcut for when no Pattern is passed, make one on the fly if len(args) == 1 and not isinstance(args[0], Pattern): if not isinstance(args[0], list) and not isinstance(args[0], tuple): args[0] = [args[0]] args[0] = Pattern(*args[0]) unsetwildcards = False variablewildcards = None prevsize = -1 #sanity check for i, pattern in enumerate(args): if not isinstance(pattern, Pattern): raise TypeError("You must pass instances of Sequence to findwords") if prevsize > -1 and len(pattern) != prevsize: raise Exception("If multiple patterns are provided, they must all have the same length!") if pattern.variablesize(): if not variablewildcards and i > 0: unsetwildcards = True else: if variablewildcards and pattern.variablewildcards() != variablewildcards: raise Exception("If multiple patterns are provided with variable wildcards, then these wildcards must all be in the same positions!") variablewildcards = pattern.variablewildcards() elif variablewildcards: unsetwildcards = True prevsize = len(pattern) if unsetwildcards: #one pattern determines a fixed length whilst others are variable, rewrite all to fixed length #converting multi-span * wildcards into single-span 'True' wildcards for pattern in args: if pattern.variablesize(): pattern.sequence = [ True if x == '*' else x for x in pattern.sequence ] if variablewildcards: #pylint: disable=too-many-nested-blocks #one or more items have a * wildcard, which may span multiple tokens. Resolve this to a wider range of simpler patterns #we're not commited to a particular size, expand to various ones for size in range(len(variablewildcards), maxgapsize+1): for distribution in pynlpl.algorithms.sum_to_n(size, len(variablewildcards)): #gap distributions, (amount) of 'True' wildcards patterns = [] for pattern in args: if pattern.variablesize(): patterns += list(pattern.resolve(size,distribution)) else: patterns.append( pattern ) for match in findwords(doc, worditerator,*patterns, **{'leftcontext':leftcontext,'rightcontext':rightcontext}): yield match else: patterns = args #pylint: disable=redefined-variable-type buffers = [] for word in worditerator(): buffers.append( [] ) #Add a new empty buffer for every word match = [None] * len(buffers) for pattern in patterns: #find value to match against if not pattern.matchannotation: value = word.text() else: if pattern.matchannotationset: items = list(word.select(pattern.matchannotation, pattern.matchannotationset, True, [Original, Suggestion, Alternative])) else: try: set = doc.defaultset(pattern.matchannotation.ANNOTATIONTYPE) items = list(word.select(pattern.matchannotation, set, True, [Original, Suggestion, Alternative] )) except KeyError: continue if len(items) == 1: value = items[0].cls else: continue if not pattern.casesensitive: value = value.lower() for i, buffer in enumerate(buffers): if match[i] is False: continue matchcursor = len(buffer) match[i] = (value == pattern.sequence[matchcursor] or pattern.sequence[matchcursor] is True or (isinstance(pattern.sequence[matchcursor], tuple) and value in pattern.sequence[matchcursor])) for buffer, matches in list(zip(buffers, match)): if matches: buffer.append(word) #add the word if len(buffer) == len(pattern.sequence): yield buffer[0].leftcontext(leftcontext) + buffer + buffer[-1].rightcontext(rightcontext) buffers.remove(buffer) else: buffers.remove(buffer) #remove buffer class Reader(object): """Streaming FoLiA reader. The reader allows you to read a FoLiA Document without holding the whole tree structure in memory. The document will be read and the elements you seek returned as they are found. If you are querying a corpus of large FoLiA documents for a specific structure, then it is strongly recommend to use the Reader rather than the standard Document!""" def __init__(self, filename, target, *args, **kwargs): """Read a FoLiA document in a streaming fashion. You select a specific target element and all occurrences of this element, including all contents (so all elements within), will be returned. Arguments: * ``filename``: The filename of the document to read * ``target``: The FoLiA element(s) you want to read (with everything contained in its scope). Passed as a class. For example: ``folia.Sentence``, or a tuple of multiple element classes. Can also be set to ``None`` to return all elements, but that would load the full tree structure into memory. """ self.target = target if not (isinstance(self.target, tuple) or isinstance(self.target, list) or issubclass(self.target, AbstractElement)): raise ValueError("Target must be subclass of FoLiA element") if 'bypassleak' in kwargs: self.bypassleak = False self.stream = io.open(filename,'rb') self.initdoc() def findwords(self, *args, **kwargs): self.target = Word for x in findwords(self.doc,self.__iter__,*args,**kwargs): yield x def initdoc(self): self.doc = None metadata = False for action, node in ElementTree.iterparse(self.stream, events=("start","end")): if action == "start" and node.tag == "{" + NSFOLIA + "}FoLiA": if '{http://www.w3.org/XML/1998/namespace}id' in node.attrib: id = node.attrib['{http://www.w3.org/XML/1998/namespace}id'] self.doc = Document(id=id) if 'version' in node.attrib: self.doc.version = node.attrib['version'] if action == "end" and node.tag == "{" + NSFOLIA + "}metadata": if not self.doc: raise MalformedXMLError("Metadata found, but no document? Impossible") metadata = True self.doc.parsemetadata(node) break if not self.doc: raise MalformedXMLError("No FoLiA Document found!") elif not metadata: raise MalformedXMLError("No metadata found!") self.stream.seek(0) def __iter__(self): """Iterating over a Reader instance will cause the FoLiA document to be read. This is a generator yielding instances of the object you specified""" if not isinstance(self.target, tuple) or isinstance(self.target,list): target = "{" + NSFOLIA + "}" + self.target.XMLTAG Class = self.target multitargets = False else: multitargets = True for action, node in ElementTree.iterparse(self.stream, events=("end",), tag=target): if not multitargets or (multitargets and node.tag.startswith('{' + NSFOLIA + '}')): if not multitargets: Class = XML2CLASS[node.tag[nslen:]] if not multitargets or (multitargets and Class in self.targets): element = Class.parsexml(node, self.doc) node.clear() #clean up children # Also eliminate now-empty references from the root node to # elem (http://www.ibm.com/developerworks/xml/library/x-hiperfparse/) #for ancestor in node.xpath('ancestor-or-self::*'): while node.getprevious() is not None: del node.getparent()[0] # clean up preceding siblings yield element def __del__(self): self.stream.close() def isncname(name): #not entirely according to specs http://www.w3.org/TR/REC-xml/#NT-Name , but simplified: for i, c in enumerate(name): if i == 0: if not c.isalpha(): raise ValueError('Invalid XML NCName identifier: ' + name + ' (at position ' + str(i+1)+')') else: if not c.isalnum() and not (c in ['-','_','.']): raise ValueError('Invalid XML NCName identifier: ' + name + ' (at position ' + str(i+1)+')') return True def validate(filename,schema=None,deep=False): if not os.path.exists(filename): raise IOError("No such file") try: try: doc = ElementTree.parse(filename, ElementTree.XMLParser(collect_ids=False) ) except TypeError: doc = ElementTree.parse(filename, ElementTree.XMLParser() ) #older lxml, may leak! except: raise MalformedXMLError("Malformed XML!") #See if there's inline IMDI and strip it off prior to validation (validator doesn't do IMDI) m = doc.xpath('//folia:metadata', namespaces={'f': 'http://ilk.uvt.nl/folia','folia': 'http://ilk.uvt.nl/folia' }) if m: metadata = m[0] m = metadata.find('{http://www.mpi.nl/IMDI/Schema/IMDI}METATRANSCRIPT') if m is not None: metadata.remove(m) if not schema: schema = ElementTree.RelaxNG(relaxng()) schema.assertValid(doc) #will raise exceptions if deep: doc = Document(tree=doc, deepvalidation=True) #================================= FOLIA SPECIFICATION ========================================================== #foliaspec:header #This file was last updated according to the FoLiA specification for version 1.4.0 on 2016-12-09 14:31:07, using foliaspec.py #Code blocks after a foliaspec comment (until the next newline) are automatically generated. **DO NOT EDIT THOSE** and **DO NOT REMOVE ANY FOLIASPEC COMMENTS** !!! #foliaspec:structurescope:STRUCTURESCOPE #Structure scope above the sentence level, used by next() and previous() methods STRUCTURESCOPE = (Sentence, Paragraph, Division, ListItem, Text, Event, Caption, Head,) #foliaspec:annotationtype_xml_map #A mapping from annotation types to xml tags (strings) ANNOTATIONTYPE2XML = { AnnotationType.ALIGNMENT: "alignment" , AnnotationType.CHUNKING: "chunk" , AnnotationType.COMPLEXALIGNMENT: "complexalignment" , AnnotationType.COREFERENCE: "coreferencechain" , AnnotationType.CORRECTION: "correction" , AnnotationType.DEFINITION: "def" , AnnotationType.DEPENDENCY: "dependency" , AnnotationType.DIVISION: "div" , AnnotationType.DOMAIN: "domain" , AnnotationType.ENTITY: "entity" , AnnotationType.ENTRY: "entry" , AnnotationType.ERRORDETECTION: "errordetection" , AnnotationType.EVENT: "event" , AnnotationType.EXAMPLE: "ex" , AnnotationType.FIGURE: "figure" , AnnotationType.GAP: "gap" , AnnotationType.LANG: "lang" , AnnotationType.LEMMA: "lemma" , AnnotationType.LINEBREAK: "br" , AnnotationType.LIST: "list" , AnnotationType.METRIC: "metric" , AnnotationType.MORPHOLOGICAL: "morpheme" , AnnotationType.NOTE: "note" , AnnotationType.OBSERVATION: "observation" , AnnotationType.PARAGRAPH: "p" , AnnotationType.PART: "part" , AnnotationType.PHON: "ph" , AnnotationType.PHONOLOGICAL: "phoneme" , AnnotationType.POS: "pos" , AnnotationType.PREDICATE: "predicate" , AnnotationType.SEMROLE: "semrole" , AnnotationType.SENSE: "sense" , AnnotationType.SENTENCE: "s" , AnnotationType.SENTIMENT: "sentiment" , AnnotationType.STATEMENT: "statement" , AnnotationType.STRING: "str" , AnnotationType.SUBJECTIVITY: "subjectivity" , AnnotationType.SYNTAX: "su" , AnnotationType.TABLE: "table" , AnnotationType.TERM: "term" , AnnotationType.TEXT: "t" , AnnotationType.STYLE: "t-style" , AnnotationType.TIMESEGMENT: "timesegment" , AnnotationType.UTTERANCE: "utt" , AnnotationType.WHITESPACE: "whitespace" , AnnotationType.TOKEN: "w" , } #foliaspec:string_class_map XML2CLASS = { "aref": AlignReference, "alignment": Alignment, "alt": Alternative, "altlayers": AlternativeLayers, "caption": Caption, "cell": Cell, "chunk": Chunk, "chunking": ChunkingLayer, "comment": Comment, "complexalignment": ComplexAlignment, "complexalignments": ComplexAlignmentLayer, "content": Content, "coreferencechain": CoreferenceChain, "coreferences": CoreferenceLayer, "coreferencelink": CoreferenceLink, "correction": Correction, "current": Current, "def": Definition, "dependencies": DependenciesLayer, "dependency": Dependency, "dep": DependencyDependent, "desc": Description, "div": Division, "domain": DomainAnnotation, "entities": EntitiesLayer, "entity": Entity, "entry": Entry, "errordetection": ErrorDetection, "event": Event, "ex": Example, "external": External, "feat": Feature, "figure": Figure, "foreign-data": ForeignData, "gap": Gap, "head": Head, "hd": Headspan, "label": Label, "lang": LangAnnotation, "lemma": LemmaAnnotation, "br": Linebreak, "list": List, "item": ListItem, "metric": Metric, "morpheme": Morpheme, "morphology": MorphologyLayer, "new": New, "note": Note, "observation": Observation, "observations": ObservationLayer, "original": Original, "p": Paragraph, "part": Part, "ph": PhonContent, "phoneme": Phoneme, "phonology": PhonologyLayer, "pos": PosAnnotation, "predicate": Predicate, "quote": Quote, "ref": Reference, "relation": Relation, "row": Row, "semrole": SemanticRole, "semroles": SemanticRolesLayer, "sense": SenseAnnotation, "s": Sentence, "sentiment": Sentiment, "sentiments": SentimentLayer, "source": Source, "speech": Speech, "statement": Statement, "statements": StatementLayer, "str": String, "subjectivity": SubjectivityAnnotation, "suggestion": Suggestion, "su": SyntacticUnit, "syntax": SyntaxLayer, "table": Table, "tablehead": TableHead, "target": Target, "term": Term, "text": Text, "t": TextContent, "t-correction": TextMarkupCorrection, "t-error": TextMarkupError, "t-gap": TextMarkupGap, "t-str": TextMarkupString, "t-style": TextMarkupStyle, "timesegment": TimeSegment, "timing": TimingLayer, "utt": Utterance, "whitespace": Whitespace, "w": Word, "wref": WordReference, } XML2CLASS['listitem'] = ListItem #backward compatibility for erroneous old FoLiA versions (XML tag is 'item' now, consistent with manual) #foliaspec:annotationtype_layerclass_map ANNOTATIONTYPE2LAYERCLASS = { AnnotationType.CHUNKING: ChunkingLayer , AnnotationType.COMPLEXALIGNMENT: ComplexAlignmentLayer , AnnotationType.COREFERENCE: CoreferenceLayer , AnnotationType.DEPENDENCY: DependenciesLayer , AnnotationType.ENTITY: EntitiesLayer , AnnotationType.MORPHOLOGICAL: MorphologyLayer , AnnotationType.OBSERVATION: ObservationLayer , AnnotationType.PHONOLOGICAL: PhonologyLayer , AnnotationType.SEMROLE: SemanticRolesLayer , AnnotationType.SENTIMENT: SentimentLayer , AnnotationType.STATEMENT: StatementLayer , AnnotationType.SYNTAX: SyntaxLayer , AnnotationType.TIMESEGMENT: TimingLayer , AnnotationType.PREDICATE: SemanticRolesLayer } #foliaspec:default_ignore #Default ignore list for the select() method, do not descend into these default_ignore = ( Original, Suggestion, Alternative, AlternativeLayers, ForeignData,) #foliaspec:default_ignore_annotations #Default ignore list for token annotation default_ignore_annotations = ( Original, Suggestion, Alternative, AlternativeLayers, MorphologyLayer, PhonologyLayer,) #foliaspec:default_ignore_structure #Default ignore list for structure annotation default_ignore_structure = ( Original, Suggestion, Alternative, AlternativeLayers, AbstractAnnotationLayer,) #foliaspec:defaultproperties #Default properties which all elements inherit AbstractElement.ACCEPTED_DATA = (Description, Comment,) AbstractElement.ANNOTATIONTYPE = None AbstractElement.AUTH = True AbstractElement.AUTO_GENERATE_ID = False AbstractElement.OCCURRENCES = 0 AbstractElement.OCCURRENCES_PER_SET = 0 AbstractElement.OPTIONAL_ATTRIBS = None AbstractElement.PHONCONTAINER = False AbstractElement.PRIMARYELEMENT = True AbstractElement.PRINTABLE = False AbstractElement.REQUIRED_ATTRIBS = None AbstractElement.REQUIRED_DATA = None AbstractElement.SETONLY = False AbstractElement.SPEAKABLE = False AbstractElement.SUBSET = None AbstractElement.TEXTCONTAINER = False AbstractElement.TEXTDELIMITER = None AbstractElement.XLINK = False AbstractElement.XMLTAG = None #foliaspec:setelementproperties #Sets all element properties for all elements #------ AbstractAnnotationLayer ------- AbstractAnnotationLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData,) AbstractAnnotationLayer.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.N,) AbstractAnnotationLayer.PRINTABLE = False AbstractAnnotationLayer.SETONLY = True AbstractAnnotationLayer.SPEAKABLE = False #------ AbstractCorrectionChild ------- AbstractCorrectionChild.ACCEPTED_DATA = (AbstractSpanAnnotation, AbstractStructureElement, AbstractTokenAnnotation, Comment, Correction, Description, ForeignData, Metric, PhonContent, String, TextContent,) AbstractCorrectionChild.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.N,) AbstractCorrectionChild.PRINTABLE = True AbstractCorrectionChild.SPEAKABLE = True AbstractCorrectionChild.TEXTDELIMITER = None #------ AbstractExtendedTokenAnnotation ------- #------ AbstractSpanAnnotation ------- AbstractSpanAnnotation.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, ForeignData, Metric,) AbstractSpanAnnotation.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER,) AbstractSpanAnnotation.PRINTABLE = True AbstractSpanAnnotation.SPEAKABLE = True #------ AbstractSpanRole ------- AbstractSpanRole.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, WordReference,) AbstractSpanRole.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.ANNOTATOR, Attrib.N, Attrib.DATETIME,) #------ AbstractStructureElement ------- AbstractStructureElement.ACCEPTED_DATA = (AbstractAnnotationLayer, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Part,) AbstractStructureElement.AUTO_GENERATE_ID = True AbstractStructureElement.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER,) AbstractStructureElement.PRINTABLE = True AbstractStructureElement.REQUIRED_ATTRIBS = None AbstractStructureElement.SPEAKABLE = True AbstractStructureElement.TEXTDELIMITER = "\n\n" #------ AbstractTextMarkup ------- AbstractTextMarkup.ACCEPTED_DATA = (AbstractTextMarkup, Comment, Description,) AbstractTextMarkup.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER,) AbstractTextMarkup.PRIMARYELEMENT = False AbstractTextMarkup.PRINTABLE = True AbstractTextMarkup.TEXTCONTAINER = True AbstractTextMarkup.TEXTDELIMITER = "" AbstractTextMarkup.XLINK = True #------ AbstractTokenAnnotation ------- AbstractTokenAnnotation.ACCEPTED_DATA = (Comment, Description, Feature, ForeignData, Metric,) AbstractTokenAnnotation.OCCURRENCES_PER_SET = 1 AbstractTokenAnnotation.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER,) AbstractTokenAnnotation.REQUIRED_ATTRIBS = (Attrib.CLASS,) #------ ActorFeature ------- ActorFeature.SUBSET = "actor" ActorFeature.XMLTAG = None #------ AlignReference ------- AlignReference.XMLTAG = "aref" #------ Alignment ------- Alignment.ACCEPTED_DATA = (AlignReference, Comment, Description, Feature, ForeignData, Metric,) Alignment.ANNOTATIONTYPE = AnnotationType.ALIGNMENT Alignment.LABEL = "Alignment" Alignment.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER,) Alignment.PRINTABLE = False Alignment.REQUIRED_ATTRIBS = None Alignment.SPEAKABLE = False Alignment.XLINK = True Alignment.XMLTAG = "alignment" #------ Alternative ------- Alternative.ACCEPTED_DATA = (AbstractTokenAnnotation, Comment, Correction, Description, ForeignData, MorphologyLayer, PhonologyLayer,) Alternative.AUTH = False Alternative.LABEL = "Alternative" Alternative.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER,) Alternative.PRINTABLE = False Alternative.REQUIRED_ATTRIBS = None Alternative.SPEAKABLE = False Alternative.XMLTAG = "alt" #------ AlternativeLayers ------- AlternativeLayers.ACCEPTED_DATA = (AbstractAnnotationLayer, Comment, Description, ForeignData,) AlternativeLayers.AUTH = False AlternativeLayers.LABEL = "Alternative Layers" AlternativeLayers.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER,) AlternativeLayers.PRINTABLE = False AlternativeLayers.REQUIRED_ATTRIBS = None AlternativeLayers.SPEAKABLE = False AlternativeLayers.XMLTAG = "altlayers" #------ BegindatetimeFeature ------- BegindatetimeFeature.SUBSET = "begindatetime" BegindatetimeFeature.XMLTAG = None #------ Caption ------- Caption.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Gap, Linebreak, Metric, Part, PhonContent, Reference, Sentence, String, TextContent, Whitespace,) Caption.LABEL = "Caption" Caption.OCCURRENCES = 1 Caption.XMLTAG = "caption" #------ Cell ------- Cell.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Entry, Event, Example, Feature, ForeignData, Gap, Head, Linebreak, Metric, Note, Paragraph, Part, Reference, Sentence, String, TextContent, Whitespace, Word,) Cell.LABEL = "Cell" Cell.TEXTDELIMITER = " | " Cell.XMLTAG = "cell" #------ Chunk ------- Chunk.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, WordReference,) Chunk.ANNOTATIONTYPE = AnnotationType.CHUNKING Chunk.LABEL = "Chunk" Chunk.XMLTAG = "chunk" #------ ChunkingLayer ------- ChunkingLayer.ACCEPTED_DATA = (Chunk, Comment, Correction, Description, ForeignData,) ChunkingLayer.ANNOTATIONTYPE = AnnotationType.CHUNKING ChunkingLayer.PRIMARYELEMENT = False ChunkingLayer.XMLTAG = "chunking" #------ Comment ------- Comment.LABEL = "Comment" Comment.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.N,) Comment.XMLTAG = "comment" #------ ComplexAlignment ------- ComplexAlignment.ACCEPTED_DATA = (Alignment, Comment, Description, Feature, ForeignData, Metric,) ComplexAlignment.ANNOTATIONTYPE = AnnotationType.COMPLEXALIGNMENT ComplexAlignment.LABEL = "Complex Alignment" ComplexAlignment.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER,) ComplexAlignment.PRINTABLE = False ComplexAlignment.REQUIRED_ATTRIBS = None ComplexAlignment.SPEAKABLE = False ComplexAlignment.XMLTAG = "complexalignment" #------ ComplexAlignmentLayer ------- ComplexAlignmentLayer.ACCEPTED_DATA = (Comment, ComplexAlignment, Correction, Description, ForeignData,) ComplexAlignmentLayer.ANNOTATIONTYPE = AnnotationType.COMPLEXALIGNMENT ComplexAlignmentLayer.PRIMARYELEMENT = False ComplexAlignmentLayer.XMLTAG = "complexalignments" #------ Content ------- Content.LABEL = "Gap Content" Content.OCCURRENCES = 1 Content.XMLTAG = "content" #------ CoreferenceChain ------- CoreferenceChain.ACCEPTED_DATA = (AlignReference, Alignment, Comment, CoreferenceLink, Description, Feature, ForeignData, Metric,) CoreferenceChain.ANNOTATIONTYPE = AnnotationType.COREFERENCE CoreferenceChain.LABEL = "Coreference Chain" CoreferenceChain.REQUIRED_DATA = (CoreferenceLink,) CoreferenceChain.XMLTAG = "coreferencechain" #------ CoreferenceLayer ------- CoreferenceLayer.ACCEPTED_DATA = (Comment, CoreferenceChain, Correction, Description, ForeignData,) CoreferenceLayer.ANNOTATIONTYPE = AnnotationType.COREFERENCE CoreferenceLayer.PRIMARYELEMENT = False CoreferenceLayer.XMLTAG = "coreferences" #------ CoreferenceLink ------- CoreferenceLink.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Headspan, LevelFeature, Metric, ModalityFeature, TimeFeature, WordReference,) CoreferenceLink.ANNOTATIONTYPE = AnnotationType.COREFERENCE CoreferenceLink.LABEL = "Coreference Link" CoreferenceLink.PRIMARYELEMENT = False CoreferenceLink.XMLTAG = "coreferencelink" #------ Correction ------- Correction.ACCEPTED_DATA = (Comment, Current, Description, ErrorDetection, Feature, ForeignData, Metric, New, Original, Suggestion,) Correction.ANNOTATIONTYPE = AnnotationType.CORRECTION Correction.LABEL = "Correction" Correction.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER,) Correction.PRINTABLE = True Correction.SPEAKABLE = True Correction.TEXTDELIMITER = None Correction.XMLTAG = "correction" #------ Current ------- Current.OCCURRENCES = 1 Current.OPTIONAL_ATTRIBS = None Current.XMLTAG = "current" #------ Definition ------- Definition.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, Figure, ForeignData, List, Metric, Paragraph, Part, PhonContent, Reference, Sentence, String, Table, TextContent, Utterance, Word,) Definition.ANNOTATIONTYPE = AnnotationType.DEFINITION Definition.LABEL = "Definition" Definition.XMLTAG = "def" #------ DependenciesLayer ------- DependenciesLayer.ACCEPTED_DATA = (Comment, Correction, Dependency, Description, ForeignData,) DependenciesLayer.ANNOTATIONTYPE = AnnotationType.DEPENDENCY DependenciesLayer.PRIMARYELEMENT = False DependenciesLayer.XMLTAG = "dependencies" #------ Dependency ------- Dependency.ACCEPTED_DATA = (AlignReference, Alignment, Comment, DependencyDependent, Description, Feature, ForeignData, Headspan, Metric,) Dependency.ANNOTATIONTYPE = AnnotationType.DEPENDENCY Dependency.LABEL = "Dependency" Dependency.REQUIRED_DATA = (DependencyDependent, Headspan,) Dependency.XMLTAG = "dependency" #------ DependencyDependent ------- DependencyDependent.LABEL = "Dependent" DependencyDependent.OCCURRENCES = 1 DependencyDependent.XMLTAG = "dep" #------ Description ------- Description.LABEL = "Description" Description.OCCURRENCES = 1 Description.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.N,) Description.XMLTAG = "desc" #------ Division ------- Division.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Division, Entry, Event, Example, Feature, Figure, ForeignData, Gap, Head, Linebreak, List, Metric, Note, Paragraph, Part, PhonContent, Quote, Reference, Sentence, Table, TextContent, Utterance, Whitespace,) Division.ANNOTATIONTYPE = AnnotationType.DIVISION Division.LABEL = "Division" Division.TEXTDELIMITER = "\n\n\n" Division.XMLTAG = "div" #------ DomainAnnotation ------- DomainAnnotation.ANNOTATIONTYPE = AnnotationType.DOMAIN DomainAnnotation.LABEL = "Domain" DomainAnnotation.OCCURRENCES_PER_SET = 0 DomainAnnotation.XMLTAG = "domain" #------ EnddatetimeFeature ------- EnddatetimeFeature.SUBSET = "enddatetime" EnddatetimeFeature.XMLTAG = None #------ EntitiesLayer ------- EntitiesLayer.ACCEPTED_DATA = (Comment, Correction, Description, Entity, ForeignData,) EntitiesLayer.ANNOTATIONTYPE = AnnotationType.ENTITY EntitiesLayer.PRIMARYELEMENT = False EntitiesLayer.XMLTAG = "entities" #------ Entity ------- Entity.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, WordReference,) Entity.ANNOTATIONTYPE = AnnotationType.ENTITY Entity.LABEL = "Entity" Entity.XMLTAG = "entity" #------ Entry ------- Entry.ACCEPTED_DATA = (AbstractAnnotationLayer, Alignment, Alternative, AlternativeLayers, Comment, Correction, Definition, Description, Example, Feature, ForeignData, Metric, Part, Term,) Entry.ANNOTATIONTYPE = AnnotationType.ENTRY Entry.LABEL = "Entry" Entry.XMLTAG = "entry" #------ ErrorDetection ------- ErrorDetection.ANNOTATIONTYPE = AnnotationType.ERRORDETECTION ErrorDetection.LABEL = "Error Detection" ErrorDetection.OCCURRENCES_PER_SET = 0 ErrorDetection.XMLTAG = "errordetection" #------ Event ------- Event.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, ActorFeature, Alignment, Alternative, AlternativeLayers, BegindatetimeFeature, Comment, Correction, Description, Division, EnddatetimeFeature, Event, Example, Feature, Figure, ForeignData, Head, Linebreak, List, Metric, Paragraph, Part, PhonContent, Reference, Sentence, String, Table, TextContent, Utterance, Whitespace, Word,) Event.ANNOTATIONTYPE = AnnotationType.EVENT Event.LABEL = "Event" Event.XMLTAG = "event" #------ Example ------- Example.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, Figure, ForeignData, Linebreak, List, Metric, Paragraph, Part, PhonContent, Reference, Sentence, String, Table, TextContent, Utterance, Whitespace, Word,) Example.ANNOTATIONTYPE = AnnotationType.EXAMPLE Example.LABEL = "Example" Example.XMLTAG = "ex" #------ External ------- External.ACCEPTED_DATA = (Comment, Description,) External.AUTH = True External.LABEL = "External" External.OPTIONAL_ATTRIBS = None External.PRINTABLE = True External.REQUIRED_ATTRIBS = (Attrib.SRC,) External.SPEAKABLE = False External.XMLTAG = "external" #------ Feature ------- Feature.LABEL = "Feature" Feature.XMLTAG = "feat" #------ Figure ------- Figure.ACCEPTED_DATA = (AbstractAnnotationLayer, Alignment, Alternative, AlternativeLayers, Caption, Comment, Correction, Description, Feature, ForeignData, Metric, Part, Sentence, String, TextContent,) Figure.ANNOTATIONTYPE = AnnotationType.FIGURE Figure.LABEL = "Figure" Figure.SPEAKABLE = False Figure.TEXTDELIMITER = "\n\n" Figure.XMLTAG = "figure" #------ ForeignData ------- ForeignData.XMLTAG = "foreign-data" #------ FunctionFeature ------- FunctionFeature.SUBSET = "function" FunctionFeature.XMLTAG = None #------ Gap ------- Gap.ACCEPTED_DATA = (Comment, Content, Description, Feature, ForeignData, Metric, Part,) Gap.ANNOTATIONTYPE = AnnotationType.GAP Gap.LABEL = "Gap" Gap.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME,) Gap.XMLTAG = "gap" #------ Head ------- Head.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Event, Feature, ForeignData, Gap, Linebreak, Metric, Part, PhonContent, Reference, Sentence, String, TextContent, Whitespace, Word,) Head.LABEL = "Head" Head.OCCURRENCES = 1 Head.TEXTDELIMITER = "\n\n" Head.XMLTAG = "head" #------ HeadFeature ------- HeadFeature.SUBSET = "head" HeadFeature.XMLTAG = None #------ Headspan ------- Headspan.LABEL = "Head" Headspan.OCCURRENCES = 1 Headspan.XMLTAG = "hd" #------ Label ------- Label.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Part, PhonContent, Reference, String, TextContent, Word,) Label.LABEL = "Label" Label.XMLTAG = "label" #------ LangAnnotation ------- LangAnnotation.ANNOTATIONTYPE = AnnotationType.LANG LangAnnotation.LABEL = "Language" LangAnnotation.XMLTAG = "lang" #------ LemmaAnnotation ------- LemmaAnnotation.ANNOTATIONTYPE = AnnotationType.LEMMA LemmaAnnotation.LABEL = "Lemma" LemmaAnnotation.XMLTAG = "lemma" #------ LevelFeature ------- LevelFeature.SUBSET = "level" LevelFeature.XMLTAG = None #------ Linebreak ------- Linebreak.ANNOTATIONTYPE = AnnotationType.LINEBREAK Linebreak.LABEL = "Linebreak" Linebreak.TEXTDELIMITER = "" Linebreak.XMLTAG = "br" #------ List ------- List.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Caption, Comment, Correction, Description, Event, Feature, ForeignData, ListItem, Metric, Note, Part, PhonContent, Reference, String, TextContent,) List.ANNOTATIONTYPE = AnnotationType.LIST List.LABEL = "List" List.TEXTDELIMITER = "\n\n" List.XMLTAG = "list" #------ ListItem ------- ListItem.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Event, Feature, ForeignData, Gap, Label, Linebreak, List, Metric, Note, Part, PhonContent, Reference, Sentence, String, TextContent, Whitespace,) ListItem.LABEL = "List Item" ListItem.TEXTDELIMITER = "\n" ListItem.XMLTAG = "item" #------ Metric ------- Metric.ACCEPTED_DATA = (Comment, Description, Feature, ForeignData, ValueFeature,) Metric.ANNOTATIONTYPE = AnnotationType.METRIC Metric.LABEL = "Metric" Metric.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.N, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME, Attrib.SPEAKER,) Metric.XMLTAG = "metric" #------ ModalityFeature ------- ModalityFeature.SUBSET = "modality" ModalityFeature.XMLTAG = None #------ Morpheme ------- Morpheme.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, FunctionFeature, Metric, Morpheme, Part, PhonContent, String, TextContent,) Morpheme.ANNOTATIONTYPE = AnnotationType.MORPHOLOGICAL Morpheme.LABEL = "Morpheme" Morpheme.TEXTDELIMITER = "" Morpheme.XMLTAG = "morpheme" #------ MorphologyLayer ------- MorphologyLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Morpheme,) MorphologyLayer.ANNOTATIONTYPE = AnnotationType.MORPHOLOGICAL MorphologyLayer.PRIMARYELEMENT = False MorphologyLayer.XMLTAG = "morphology" #------ New ------- New.OCCURRENCES = 1 New.OPTIONAL_ATTRIBS = None New.XMLTAG = "new" #------ Note ------- Note.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Example, Feature, Figure, ForeignData, Head, Linebreak, List, Metric, Paragraph, Part, PhonContent, Reference, Sentence, String, Table, TextContent, Utterance, Whitespace, Word,) Note.ANNOTATIONTYPE = AnnotationType.NOTE Note.LABEL = "Note" Note.XMLTAG = "note" #------ Observation ------- Observation.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, WordReference,) Observation.ANNOTATIONTYPE = AnnotationType.OBSERVATION Observation.LABEL = "Observation" Observation.XMLTAG = "observation" #------ ObservationLayer ------- ObservationLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Observation,) ObservationLayer.ANNOTATIONTYPE = AnnotationType.OBSERVATION ObservationLayer.PRIMARYELEMENT = False ObservationLayer.XMLTAG = "observations" #------ Original ------- Original.AUTH = False Original.OCCURRENCES = 1 Original.OPTIONAL_ATTRIBS = None Original.XMLTAG = "original" #------ Paragraph ------- Paragraph.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Entry, Event, Example, Feature, Figure, ForeignData, Gap, Head, Linebreak, List, Metric, Note, Part, PhonContent, Quote, Reference, Sentence, String, TextContent, Whitespace, Word,) Paragraph.ANNOTATIONTYPE = AnnotationType.PARAGRAPH Paragraph.LABEL = "Paragraph" Paragraph.TEXTDELIMITER = "\n\n" Paragraph.XMLTAG = "p" #------ Part ------- Part.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, AbstractStructureElement, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Part,) Part.ANNOTATIONTYPE = AnnotationType.PART Part.LABEL = "Part" Part.TEXTDELIMITER = None Part.XMLTAG = "part" #------ PhonContent ------- PhonContent.ACCEPTED_DATA = (Comment, Description,) PhonContent.ANNOTATIONTYPE = AnnotationType.PHON PhonContent.LABEL = "Phonetic Content" PhonContent.OCCURRENCES = 0 PhonContent.OPTIONAL_ATTRIBS = (Attrib.CLASS, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME,) PhonContent.PHONCONTAINER = True PhonContent.PRINTABLE = False PhonContent.SPEAKABLE = True PhonContent.XMLTAG = "ph" #------ Phoneme ------- Phoneme.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, FunctionFeature, Metric, Part, PhonContent, Phoneme, String, TextContent,) Phoneme.ANNOTATIONTYPE = AnnotationType.PHONOLOGICAL Phoneme.LABEL = "Phoneme" Phoneme.TEXTDELIMITER = "" Phoneme.XMLTAG = "phoneme" #------ PhonologyLayer ------- PhonologyLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Phoneme,) PhonologyLayer.ANNOTATIONTYPE = AnnotationType.PHONOLOGICAL PhonologyLayer.PRIMARYELEMENT = False PhonologyLayer.XMLTAG = "phonology" #------ PolarityFeature ------- PolarityFeature.SUBSET = "polarity" PolarityFeature.XMLTAG = None #------ PosAnnotation ------- PosAnnotation.ACCEPTED_DATA = (Comment, Description, Feature, ForeignData, HeadFeature, Metric,) PosAnnotation.ANNOTATIONTYPE = AnnotationType.POS PosAnnotation.LABEL = "Part-of-Speech" PosAnnotation.XMLTAG = "pos" #------ Predicate ------- Predicate.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, SemanticRole, WordReference,) Predicate.ANNOTATIONTYPE = AnnotationType.PREDICATE Predicate.LABEL = "Predicate" Predicate.XMLTAG = "predicate" #------ Quote ------- Quote.ACCEPTED_DATA = (AbstractAnnotationLayer, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Division, Feature, ForeignData, Gap, Metric, Paragraph, Part, Quote, Sentence, String, TextContent, Utterance, Word,) Quote.LABEL = "Quote" Quote.XMLTAG = "quote" #------ Reference ------- Reference.ACCEPTED_DATA = (AbstractAnnotationLayer, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Paragraph, Part, PhonContent, Quote, Sentence, String, TextContent, Utterance, Word,) Reference.LABEL = "Reference" Reference.TEXTDELIMITER = None Reference.XMLTAG = "ref" #------ Relation ------- Relation.LABEL = "Relation" Relation.OCCURRENCES = 1 Relation.XMLTAG = "relation" #------ Row ------- Row.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Cell, Comment, Correction, Description, Feature, ForeignData, Metric, Part,) Row.LABEL = "Table Row" Row.TEXTDELIMITER = "\n" Row.XMLTAG = "row" #------ SemanticRole ------- SemanticRole.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Headspan, Metric, WordReference,) SemanticRole.ANNOTATIONTYPE = AnnotationType.SEMROLE SemanticRole.LABEL = "Semantic Role" SemanticRole.REQUIRED_ATTRIBS = (Attrib.CLASS,) SemanticRole.XMLTAG = "semrole" #------ SemanticRolesLayer ------- SemanticRolesLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Predicate, SemanticRole,) SemanticRolesLayer.ANNOTATIONTYPE = AnnotationType.SEMROLE SemanticRolesLayer.PRIMARYELEMENT = False SemanticRolesLayer.XMLTAG = "semroles" #------ SenseAnnotation ------- SenseAnnotation.ACCEPTED_DATA = (Comment, Description, Feature, ForeignData, Metric, SynsetFeature,) SenseAnnotation.ANNOTATIONTYPE = AnnotationType.SENSE SenseAnnotation.LABEL = "Semantic Sense" SenseAnnotation.OCCURRENCES_PER_SET = 0 SenseAnnotation.XMLTAG = "sense" #------ Sentence ------- Sentence.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Entry, Event, Example, Feature, ForeignData, Gap, Linebreak, Metric, Note, Part, PhonContent, Quote, Reference, String, TextContent, Whitespace, Word,) Sentence.ANNOTATIONTYPE = AnnotationType.SENTENCE Sentence.LABEL = "Sentence" Sentence.TEXTDELIMITER = " " Sentence.XMLTAG = "s" #------ Sentiment ------- Sentiment.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Headspan, Metric, PolarityFeature, Source, StrengthFeature, Target, WordReference,) Sentiment.ANNOTATIONTYPE = AnnotationType.SENTIMENT Sentiment.LABEL = "Sentiment" Sentiment.XMLTAG = "sentiment" #------ SentimentLayer ------- SentimentLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Sentiment,) SentimentLayer.ANNOTATIONTYPE = AnnotationType.SENTIMENT SentimentLayer.PRIMARYELEMENT = False SentimentLayer.XMLTAG = "sentiments" #------ Source ------- Source.LABEL = "Source" Source.OCCURRENCES = 1 Source.XMLTAG = "source" #------ Speech ------- Speech.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Division, Entry, Event, Example, External, Feature, ForeignData, Gap, List, Metric, Note, Paragraph, Part, PhonContent, Quote, Reference, Sentence, String, TextContent, Utterance, Word,) Speech.LABEL = "Speech Body" Speech.TEXTDELIMITER = "\n\n\n" Speech.XMLTAG = "speech" #------ Statement ------- Statement.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Headspan, Metric, Relation, Source, WordReference,) Statement.ANNOTATIONTYPE = AnnotationType.STATEMENT Statement.LABEL = "Statement" Statement.XMLTAG = "statement" #------ StatementLayer ------- StatementLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, Statement,) StatementLayer.ANNOTATIONTYPE = AnnotationType.STATEMENT StatementLayer.PRIMARYELEMENT = False StatementLayer.XMLTAG = "statements" #------ StrengthFeature ------- StrengthFeature.SUBSET = "strength" StrengthFeature.XMLTAG = None #------ String ------- String.ACCEPTED_DATA = (AbstractExtendedTokenAnnotation, Alignment, Comment, Correction, Description, Feature, ForeignData, Metric, PhonContent, TextContent,) String.ANNOTATIONTYPE = AnnotationType.STRING String.LABEL = "String" String.OCCURRENCES = 0 String.OPTIONAL_ATTRIBS = (Attrib.ID, Attrib.CLASS, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME, Attrib.N, Attrib.SRC, Attrib.BEGINTIME, Attrib.ENDTIME,) String.PRINTABLE = True String.XMLTAG = "str" #------ StyleFeature ------- StyleFeature.SUBSET = "style" StyleFeature.XMLTAG = None #------ SubjectivityAnnotation ------- SubjectivityAnnotation.ANNOTATIONTYPE = AnnotationType.SUBJECTIVITY SubjectivityAnnotation.LABEL = "Subjectivity/Sentiment" SubjectivityAnnotation.XMLTAG = "subjectivity" #------ Suggestion ------- Suggestion.AUTH = False Suggestion.OCCURRENCES = 0 Suggestion.XMLTAG = "suggestion" #------ SynsetFeature ------- SynsetFeature.SUBSET = "synset" SynsetFeature.XMLTAG = None #------ SyntacticUnit ------- SyntacticUnit.ACCEPTED_DATA = (AlignReference, Alignment, Comment, Description, Feature, ForeignData, Metric, SyntacticUnit, WordReference,) SyntacticUnit.ANNOTATIONTYPE = AnnotationType.SYNTAX SyntacticUnit.LABEL = "Syntactic Unit" SyntacticUnit.XMLTAG = "su" #------ SyntaxLayer ------- SyntaxLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, SyntacticUnit,) SyntaxLayer.ANNOTATIONTYPE = AnnotationType.SYNTAX SyntaxLayer.PRIMARYELEMENT = False SyntaxLayer.XMLTAG = "syntax" #------ Table ------- Table.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Part, Row, TableHead,) Table.ANNOTATIONTYPE = AnnotationType.TABLE Table.LABEL = "Table" Table.XMLTAG = "table" #------ TableHead ------- TableHead.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Part, Row,) TableHead.LABEL = "Table Header" TableHead.XMLTAG = "tablehead" #------ Target ------- Target.LABEL = "Target" Target.OCCURRENCES = 1 Target.XMLTAG = "target" #------ Term ------- Term.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Event, Feature, Figure, ForeignData, Gap, List, Metric, Paragraph, Part, PhonContent, Reference, Sentence, String, Table, TextContent, Utterance, Word,) Term.ANNOTATIONTYPE = AnnotationType.TERM Term.LABEL = "Term" Term.XMLTAG = "term" #------ Text ------- Text.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Division, Entry, Event, Example, External, Feature, Figure, ForeignData, Gap, List, Metric, Note, Paragraph, Part, PhonContent, Quote, Reference, Sentence, String, Table, TextContent, Word,) Text.LABEL = "Text Body" Text.TEXTDELIMITER = "\n\n\n" Text.XMLTAG = "text" #------ TextContent ------- TextContent.ACCEPTED_DATA = (AbstractTextMarkup, Comment, Description, Linebreak,) TextContent.ANNOTATIONTYPE = AnnotationType.TEXT TextContent.LABEL = "Text" TextContent.OCCURRENCES = 0 TextContent.OPTIONAL_ATTRIBS = (Attrib.CLASS, Attrib.ANNOTATOR, Attrib.CONFIDENCE, Attrib.DATETIME,) TextContent.PRINTABLE = True TextContent.SPEAKABLE = False TextContent.TEXTCONTAINER = True TextContent.XLINK = True TextContent.XMLTAG = "t" #------ TextMarkupCorrection ------- TextMarkupCorrection.ANNOTATIONTYPE = AnnotationType.CORRECTION TextMarkupCorrection.PRIMARYELEMENT = False TextMarkupCorrection.XMLTAG = "t-correction" #------ TextMarkupError ------- TextMarkupError.ANNOTATIONTYPE = AnnotationType.ERRORDETECTION TextMarkupError.PRIMARYELEMENT = False TextMarkupError.XMLTAG = "t-error" #------ TextMarkupGap ------- TextMarkupGap.ANNOTATIONTYPE = AnnotationType.GAP TextMarkupGap.PRIMARYELEMENT = False TextMarkupGap.XMLTAG = "t-gap" #------ TextMarkupString ------- TextMarkupString.ANNOTATIONTYPE = AnnotationType.STRING TextMarkupString.PRIMARYELEMENT = False TextMarkupString.XMLTAG = "t-str" #------ TextMarkupStyle ------- TextMarkupStyle.ANNOTATIONTYPE = AnnotationType.STYLE TextMarkupStyle.PRIMARYELEMENT = True TextMarkupStyle.XMLTAG = "t-style" #------ TimeFeature ------- TimeFeature.SUBSET = "time" TimeFeature.XMLTAG = None #------ TimeSegment ------- TimeSegment.ACCEPTED_DATA = (ActorFeature, AlignReference, Alignment, BegindatetimeFeature, Comment, Description, EnddatetimeFeature, Feature, ForeignData, Metric, WordReference,) TimeSegment.ANNOTATIONTYPE = AnnotationType.TIMESEGMENT TimeSegment.LABEL = "Time Segment" TimeSegment.XMLTAG = "timesegment" #------ TimingLayer ------- TimingLayer.ACCEPTED_DATA = (Comment, Correction, Description, ForeignData, TimeSegment,) TimingLayer.ANNOTATIONTYPE = AnnotationType.TIMESEGMENT TimingLayer.PRIMARYELEMENT = False TimingLayer.XMLTAG = "timing" #------ Utterance ------- Utterance.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractExtendedTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Gap, Metric, Note, Part, PhonContent, Quote, Reference, Sentence, String, TextContent, Word,) Utterance.ANNOTATIONTYPE = AnnotationType.UTTERANCE Utterance.LABEL = "Utterance" Utterance.TEXTDELIMITER = " " Utterance.XMLTAG = "utt" #------ ValueFeature ------- ValueFeature.SUBSET = "value" ValueFeature.XMLTAG = None #------ Whitespace ------- Whitespace.ANNOTATIONTYPE = AnnotationType.WHITESPACE Whitespace.LABEL = "Whitespace" Whitespace.TEXTDELIMITER = "" Whitespace.XMLTAG = "whitespace" #------ Word ------- Word.ACCEPTED_DATA = (AbstractAnnotationLayer, AbstractTokenAnnotation, Alignment, Alternative, AlternativeLayers, Comment, Correction, Description, Feature, ForeignData, Metric, Part, PhonContent, Reference, String, TextContent,) Word.ANNOTATIONTYPE = AnnotationType.TOKEN Word.LABEL = "Word/Token" Word.TEXTDELIMITER = " " Word.XMLTAG = "w" #------ WordReference ------- WordReference.XMLTAG = "wref" #EOF PyNLPl-1.1.2/pynlpl/formats/cql.py0000644000175000001440000002361512527312166017637 0ustar proyconusers00000000000000#--------------------------------------------------------------- # PyNLPl - Corpus Query Language (CQL) # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # http://proycon.github.com/folia # http://www.github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Parser and interpreter for a basic subset of the Corpus Query Language # # Licensed under GPLv3 # #---------------------------------------------------------------- from __future__ import print_function, unicode_literals, division, absolute_import from pynlpl.fsa import State, NFA import re import sys OPERATORS = ('=','!=') MAXINTERVAL = 99 class SyntaxError(Exception): pass class ValueExpression(object): def __init__(self, values): self.values = values #disjunction @staticmethod def parse(s,i): values = "" assert s[i] == '"' i += 1 while not (s[i] == '"' and s[i-1] != "\\"): values += s[i] i += 1 values = values.split("|") return ValueExpression(values), i+1 def __len__(self): return len(self.values) def __iter__(self): for x in self.values: yield x def __getitem__(self,index): return self.values[index] class AttributeExpression(object): def __init__(self, attribute, operator, valueexpression): self.attribute = attribute self.operator = operator self.valueexpr = valueexpression @staticmethod def parse(s,i): while s[i] == " ": i +=1 if s[i] == '"': #no attribute and no operator, use defaults: attribute = "word" operator = "=" else: attribute = "" while s[i] not in (' ','!','>','<','='): attribute += s[i] i += 1 if not attribute: raise SyntaxError("Expected attribute name, none found") operator = "" while s[i] in (' ','!','>','<','='): if s[i] != ' ': operator += s[i] i += 1 if operator not in OPERATORS: raise SyntaxError("Expected operator, got '" + operator + "'") if s[i] != '"': raise SyntaxError("Expected start of value expression (doublequote) in position " + str(i) + ", got " + s[i]) valueexpr, i = ValueExpression.parse(s,i) return AttributeExpression(attribute,operator, valueexpr), i class TokenExpression(object): def __init__(self, attribexprs=[], interval=None): self.attribexprs = attribexprs self.interval = interval @staticmethod def parse(s,i): attribexprs = [] while s[i] == " ": i +=1 if s[i] == '"': attribexpr,i = AttributeExpression.parse(s,i) attribexprs.append(attribexpr) elif s[i] == "[": i += 1 while True: while s[i] == " ": i +=1 if s[i] == "&": attribexpr,i = AttributeExpression.parse(s,i+1) attribexprs.append(attribexpr) elif s[i] == "]": i += 1 break elif not attribexprs: attribexpr,i = AttributeExpression.parse(s,i) attribexprs.append(attribexpr) else: raise SyntaxError("Unexpected char whilst parsing token expression, position " + str(i) + ": " + s[i]) else: raise SyntaxError("Expected token expression starting with either \" or [, got: " + s[i]) if i == len(s): interval = None #end of query! elif s[i] == "{": #interval expression, find end: interval = None for j in range(i+1, len(s)): if s[j] == "}": interval = s[i+1:j] if interval is None: raise SyntaxError("Interval expression started but no end-brace found") i += len(interval) + 2 try: if ',' in interval: interval = tuple(int(x) for x in interval.split(",")) if len(interval) != 2: raise SyntaxError("Invalid interval: " + interval) elif '-' in interval: #alternative interval = tuple(int(x) for x in interval.split("-")) if len(interval) != 2: raise SyntaxError("Invalid interval: " + interval) else: interval = (int(interval),int(interval)) except ValueError: raise SyntaxError("Invalid interval: " + interval) elif s[i] == "?": interval = (0,1) i += 1 elif s[i] == "+": interval = (1,MAXINTERVAL) i += 1 elif s[i] == "*": interval = (0,MAXINTERVAL) i += 1 else: interval = None return TokenExpression(attribexprs,interval),i def __len__(self): return len(self.attribexprs) def __iter__(self): for x in self.attribexprs: yield x def __getitem__(self,index): return self.attribexprs[index] def nfa(self, nextstate): """Returns an initial state for an NFA""" if self.interval: mininterval, maxinterval = self.interval #pylint: disable=unpacking-non-sequence nextstate2 = nextstate for i in range(maxinterval): state = State(transitions=[(self,self.match, nextstate2)]) if i+1> mininterval: if nextstate is not nextstate2: state.transitions.append((self,self.match, nextstate)) if maxinterval == MAXINTERVAL: state.epsilon.append(state) break nextstate2 = state return state else: state = State(transitions=[(self,self.match, nextstate)]) return state def match(self, value): match = False for _, attribexpr in enumerate(self): annottype = attribexpr.attribute if annottype == 'text': annottype = 'word' if attribexpr.operator == "!=": negate = True elif attribexpr.operator == "=": negate = False else: raise Exception("Unexpected operator " + attribexpr.operator) if len(attribexpr.valueexpr) > 1: expr = re.compile("^(" + "|".join(attribexpr.valueexpr) + ")$") else: expr = re.compile("^" + attribexpr.valueexpr[0] + '$') match = (expr.match(value[annottype]) is not None) if negate: match = not match if not match: return False return True class Query(object): def __init__(self, s): self.tokenexprs = [] i = 0 l = len(s) while i < l: if s[i] == " ": i += 1 else: tokenexpr,i = TokenExpression.parse(s,i) self.tokenexprs.append(tokenexpr) def __len__(self): return len(self.tokenexprs) def __iter__(self): for x in self.tokenexprs: yield x def __getitem__(self,index): return self.tokenexprs[index] def nfa(self): """convert the expression into an NFA""" finalstate = State(final=True) nextstate = finalstate for tokenexpr in reversed(self): state = tokenexpr.nfa(nextstate) nextstate = state return NFA(state) def __call__(self, tokens, debug=False): """Execute the CQL expression, pass a list of tokens/annotations using keyword arguments: word, pos, lemma, etc""" if not tokens: raise Exception("Pass a list of tokens/annotation using keyword arguments! (word,pos,lemma, or others)") #convert the expression into an NFA nfa = self.nfa() if debug: print(repr(nfa), file=sys.stderr) return list(nfa.find(tokens,debug)) def cql2fql(cq): fq = "SELECT FOR SPAN " if not isinstance(cq, Query): cq = Query(cq) for i, token in enumerate(cq): if i > 0: fq += " & " fq += "w" if token.interval: fq += " {" + str(token.interval[0]) + "," + str(token.interval[1])+ "} " else: fq += " " if token.attribexprs: fq += "WHERE " for j, attribexpr in enumerate(token): if j > 0: fq += " AND " fq += "(" if attribexpr.operator == "!=": operator = "NOTMATCHES" elif attribexpr.operator == "=": operator = "MATCHES" else: raise Exception("Invalid operator: " + attribexpr.operator) if attribexpr.attribute in ("word","text"): if len(attribexpr.valueexpr) > 1: fq += "text " + operator + " \"^(" + "|".join(attribexpr.valueexpr) + ")$\" " else: fq += "text " + operator + " \"^" + attribexpr.valueexpr[0] + "$\" " else: annottype = attribexpr.attribute if annottype == "tag": annottype = "pos" elif annottype == "lempos": raise Exception("lempos not supported in CQL to FQL conversion, use pos and lemma separately") fq += annottype + " HAS class " if len(attribexpr.valueexpr) > 1: fq += operator + " \"^(" + "|".join(attribexpr.valueexpr) + ")$\" " else: fq += operator + " \"^" + attribexpr.valueexpr[0] + "$\" " fq += ")" return fq PyNLPl-1.1.2/pynlpl/tagger.py0000755000175000001440000003223612445064173016661 0ustar proyconusers00000000000000#! /usr/bin/env python # -*- coding: utf8 -*- ############################################################### # PyNLPl - FreeLing Library # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Radboud University Nijmegen # # Licensed under GPLv3 # # Generic Tagger interface for PoS-tagging and lemmatisation, # offers an interface to various software # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout import io import codecs import json import getopt import subprocess class Tagger(object): def __init__(self, *args): global WSDDIR self.tagger = None self.mode = args[0] if args[0] == "file": if len(args) != 2: raise Exception("Syntax: file:[filename]") self.tagger = codecs.open(args[1],'r','utf-8') elif args[0] == "frog": if len(args) != 3: raise Exception("Syntax: frog:[host]:[port]") from pynlpl.clients.frogclient import FrogClient port = int(args[2]) self.tagger = FrogClient(args[1],port) elif args[0] == "freeling": if len(args) != 3: raise Exception("Syntax: freeling:[host]:[port]") from pynlpl.clients.freeling import FreeLingClient host = args[1] port = int(args[2]) self.tagger = FreeLingClient(host,port) elif args[0] == "corenlp": if len(args) != 1: raise Exception("Syntax: corenlp") import corenlp print("Initialising Stanford Core NLP",file=stderr) self.tagger = corenlp.StanfordCoreNLP() elif args[0] == 'treetagger': if not len(args) == 2: raise Exception("Syntax: treetagger:[treetagger-bin]") self.tagger = args[1] elif args[0] == "durmlex": if not len(args) == 2: raise Exception("Syntax: durmlex:[filename]") print("Reading durm lexicon: ", args[1],file=stderr) self.mode = "lookup" self.tagger = {} f = codecs.open(args[1],'r','utf-8') for line in f: fields = line.split('\t') wordform = fields[0].lower() lemma = fields[4].split('.')[0] self.tagger[wordform] = (lemma, 'n') f.close() print("Loaded ", len(self.tagger), " wordforms",file=stderr) elif args[0] == "oldlex": if not len(args) == 2: raise Exception("Syntax: oldlex:[filename]") print("Reading OLDLexique: ", args[1],file=stderr) self.mode = "lookup" self.tagger = {} f = codecs.open(args[1],'r','utf-8') for line in f: fields = line.split('\t') wordform = fields[0].lower() lemma = fields[1] if lemma == '=': lemma == fields[0] pos = fields[2][0].lower() self.tagger[wordform] = (lemma, pos) print("Loaded ", len(self.tagger), " wordforms",file=stderr) f.close() else: raise Exception("Invalid mode: " + args[0]) def __iter__(self): if self.mode != 'file': raise Exception("Iteration only possible in file mode") line = self.tagger.next() newwords = [] postags = [] lemmas = [] for item in line: word,lemma,pos = item.split('|') newwords.append(word) postags.append(pos) lemmas.append(lemma) yield newwords, postags, lemmas def reset(self): if self.mode == 'file': self.tagger.seek(0) def process(self, words, debug=False): if self.mode == 'file': line = self.tagger.next() newwords = [] postags = [] lemmas = [] for item in line.split(' '): if item.strip(): try: word,lemma,pos = item.split('|') except: raise Exception("Unable to parse word|lemma|pos in " + item) newwords.append(word) postags.append(pos) lemmas.append(lemma) return newwords, postags, lemmas elif self.mode == "frog": newwords = [] postags = [] lemmas = [] for fields in self.tagger.process(' '.join(words)): word,lemma,morph,pos = fields[:4] newwords.append(word) postags.append(pos) lemmas.append(lemma) return newwords, postags, lemmas elif self.mode == "freeling": postags = [] lemmas = [] for fields in self.tagger.process(words, debug): word, lemma,pos = fields[:3] postags.append(pos) lemmas.append(lemma) return words, postags, lemmas elif self.mode == "corenlp": data = json.loads(self.tagger.parse(" ".join(words))) words = [] postags = [] lemmas = [] for sentence in data['sentences']: for word, worddata in sentence['words']: words.append(word) lemmas.append(worddata['Lemma']) postags.append(worddata['PartOfSpeech']) return words, postags, lemmas elif self.mode == 'lookup': postags = [] lemmas = [] for word in words: try: lemma, pos = self.tagger[word.lower()] lemmas.append(lemma) postags.append(pos) except KeyError: lemmas.append(word) postags.append('?') return words, postags, lemmas elif self.mode == 'treetagger': s = " ".join(words) s = u(s) p = subprocess.Popen([self.tagger], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) (out, err) = p.communicate(s.encode('utf-8')) newwords = [] postags = [] lemmas = [] for line in out.split('\n'): line = line.strip() if line: fields = line.split('\t') newwords.append( unicode(fields[0],'utf-8') ) postags.append( unicode(fields[1],'utf-8') ) lemmas.append( unicode(fields[2],'utf-8') ) if p.returncode != 0: print(err,file=stderr) raise OSError('TreeTagger failed') return newwords, postags, lemmas else: raise Exception("Unknown mode") def treetagger_tag(self, f_in, f_out,oneperline=False, debug=False): def flush(sentences): if sentences: print("Processing " + str(len(sentences)) + " lines",file=stderr) for sentence in sentences: out = "" p = subprocess.Popen([self.tagger], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) (results, err) = p.communicate("\n".join(sentences).encode('utf-8')) for line in results.split('\n'): line = line.strip() if line: fields = line.split('\t') word = fields[0] pos = fields[1] lemma = fields[2] if oneperline: if out: out += "\n" out += word + "\t" + lemma + "\t" + pos else: if out: out += " " if '|' in word: word = word.replace('|','_') if '|' in lemma: lemma = lemma.replace('|','_') if '|' in pos: pos = pos.replace('|','_') out += word + "|" + lemma + "|" + pos if pos[0] == '$': out = u(out) f_out.write(out + "\n") if oneperline: f_out.write("\n") out = "" if out: out = u(out) f_out.write(out + "\n") if oneperline: f_out.write("\n") #buffered tagging sentences = [] linenum = 0 for line in f_in: linenum += 1 print(" Buffering input @" + str(linenum),file=stderr) line = line.strip() if not line or ('.' in line[:-1] or '?' in line[:-1] or '!' in line[:-1]) or (line[-1] != '.' and line[-1] != '?' and line[-1] != '!'): flush(sentences) sentences = [] if not line.strip(): f_out.write("\n") if oneperline: f_out.write("\n") sentences.append(line) flush(sentences) def tag(self, f_in, f_out,oneperline=False, debug=False): if self.mode == 'treetagger': self.treetagger_tag(f_in, f_out,oneperline=False, debug=False) else: linenum = 0 for line in f_in: linenum += 1 print(" Tagger input @" + str(linenum),file=stderr) if line.strip(): words = line.strip().split(' ') words, postags, lemmas = self.process(words, debug) out = "" for word, pos, lemma in zip(words,postags, lemmas): if word is None: word = "" if lemma is None: lemma = "?" if pos is None: pos = "?" if oneperline: if out: out += "\n" out += word + "\t" + lemma + "\t" + pos else: if out: out += " " if '|' in word: word = word.replace('|','_') if '|' in lemma: lemma = lemma.replace('|','_') if '|' in pos: pos = pos.replace('|','_') out += word + "|" + lemma + "|" + pos if not isinstance(out, unicode): out = unicode(out, 'utf-8') f_out.write(out + "\n") if oneperline: f_out.write("\n") else: f_out.write("\n") def usage(): print("tagger.py -c [conf] -f [input-filename] -o [output-filename]",file=stderr) if __name__ == "__main__": try: opts, args = getopt.getopt(sys.argv[1:], "f:c:o:D") except getopt.GetoptError as err: # print help information and exit: print(str(err),file=stderr) usage() sys.exit(2) taggerconf = None filename = None outfilename = None oneperline = False debug = False for o, a in opts: if o == "-c": taggerconf = a elif o == "-f": filename = a elif o == '-o': outfilename =a elif o == '-l': oneperline = True elif o == '-D': debug = True else: print >>sys.stderr,"Unknown option: ", o sys.exit(2) if not taggerconf: print("ERROR: Specify a tagger configuration with -c",file=stderr) sys.exit(2) if not filename: print("ERROR: Specify a filename with -f",file=stderr) sys.exit(2) if outfilename: f_out = io.open(outfilename,'w',encoding='utf-8') else: f_out = stdout; f_in = io.open(filename,'r',encoding='utf-8') tagger = Tagger(*taggerconf.split(':')) tagger.tag(f_in, f_out, oneperline, debug) f_in.close() if outfilename: f_out.close() PyNLPl-1.1.2/pynlpl/datatypes.py0000644000175000001440000004372612445064173017411 0ustar proyconusers00000000000000#--------------------------------------------------------------- # PyNLPl - Data Types # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # http://www.github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Based in large part on MIT licensed code from # AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html # Peter Norvig # # Licensed under GPLv3 # #---------------------------------------------------------------- """This library contains various extra data types, based to a certain extend on MIT-licensed code from Peter Norvig, AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html""" from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u import random import bisect import array from sys import version as PYTHONVERSION class Queue(object): #from AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html """Queue is an abstract class/interface. There are three types: Python List: A Last In First Out Queue (no Queue object necessary). FIFOQueue(): A First In First Out Queue. PriorityQueue(lt): Queue where items are sorted by lt, (default <). Each type supports the following methods and functions: q.append(item) -- add an item to the queue q.extend(items) -- equivalent to: for item in items: q.append(item) q.pop() -- return the top item from the queue len(q) -- number of items in q (also q.__len()).""" def extend(self, items): """Append all elements from items to the queue""" for item in items: self.append(item) #note: A Python list is a LIFOQueue / Stack class FIFOQueue(Queue): #adapted from AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html """A First-In-First-Out Queue""" def __init__(self, data = []): self.data = data self.start = 0 def append(self, item): self.data.append(item) def __len__(self): return len(self.data) - self.start def extend(self, items): """Append all elements from items to the queue""" self.data.extend(items) def pop(self): """Retrieve the next element in line, this will remove it from the queue""" e = self.data[self.start] self.start += 1 if self.start > 5 and self.start > len(self.data)//2: self.data = self.data[self.start:] self.start = 0 return e class PriorityQueue(Queue): #Heavily adapted/extended, originally from AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html """A queue in which the maximum (or minumum) element is returned first, as determined by either an external score function f (by default calling the objects score() method). If minimize=True, the item with minimum f(x) is returned first; otherwise is the item with maximum f(x) or x.score(). length can be set to an integer > 0. Items will only be added to the queue if they're better or equal to the worst scoring item. If set to zero, length is unbounded. blockworse can be set to true if you want to prohibit adding worse-scoring items to the queue. Only items scoring better than the *BEST* one are added. blockequal can be set to false if you also want to prohibit adding equally-scoring items to the queue. (Both parameters default to False) """ def __init__(self, data =[], f = lambda x: x.score, minimize=False, length=0, blockworse=False, blockequal=False,duplicates=True): self.data = [] self.f = f self.minimize=minimize self.length = length self.blockworse=blockworse self.blockequal=blockequal self.duplicates= duplicates self.bestscore = None for item in data: self.append(item) def append(self, item): """Adds an item to the priority queue (in the right place), returns True if successfull, False if the item was blocked (because of a bad score)""" f = self.f(item) if callable(f): score = f() else: score = f if not self.duplicates: for s, i in self.data: if s == score and item == i: #item is a duplicate, don't add it return False if self.length and len(self.data) == self.length: #Fixed-length priority queue, abort when queue is full and new item scores worst than worst scoring item. if self.minimize: worstscore = self.data[-1][0] if score >= worstscore: return False else: worstscore = self.data[0][0] if score <= worstscore: return False if self.blockworse and self.bestscore != None: if self.minimize: if score > self.bestscore: return False else: if score < self.bestscore: return False if self.blockequal and self.bestscore != None: if self.bestscore == score: return False if (self.bestscore == None) or (self.minimize and score < self.bestscore) or (not self.minimize and score > self.bestscore): self.bestscore = score bisect.insort(self.data, (score, item)) if self.length: #fixed length queue: queue is now too long, delete worst items while len(self.data) > self.length: if self.minimize: del self.data[-1] else: del self.data[0] return True def __exists__(self, item): return (item in self.data) def __len__(self): return len(self.data) def __iter__(self): """Iterate over all items, in order from best to worst!""" if self.minimize: f = lambda x: x else: f = reversed for score, item in f(self.data): yield item def __getitem__(self, i): """Item 0 is always the best item!""" if isinstance(i, slice): indices = i.indices(len(self)) if self.minimize: return PriorityQueue([ self.data[j][1] for j in range(*indices) ],self.f, self.minimize, self.length, self.blockworse, self.blockequal) else: return PriorityQueue([ self.data[(-1 * j) - 1][1] for j in range(*indices) ],self.f, self.minimize, self.length, self.blockworse, self.blockequal) else: if self.minimize: return self.data[i][1] else: return self.data[(-1 * i) - 1][1] def pop(self): """Retrieve the next element in line, this will remove it from the queue""" if self.minimize: return self.data.pop(0)[1] else: return self.data.pop()[1] def score(self, i): """Return the score for item x (cheap lookup), Item 0 is always the best item""" if self.minimize: return self.data[i][0] else: return self.data[(-1 * i) - 1][0] def prune(self, n): """prune all but the first (=best) n items""" if self.minimize: self.data = self.data[:n] else: self.data = self.data[-1 * n:] def randomprune(self,n): """prune down to n items at random, disregarding their score""" self.data = random.sample(self.data, n) def stochasticprune(self,n): """prune down to n items, chance of an item being pruned is reverse proportional to its score""" raise NotImplemented def prunebyscore(self, score, retainequalscore=False): """Deletes all items below/above a certain score from the queue, depending on whether minimize is True or False. Note: It is recommended (more efficient) to use blockworse=True / blockequal=True instead! Preventing the addition of 'worse' items.""" if retainequalscore: if self.minimize: f = lambda x: x[0] <= score else: f = lambda x: x[0] >= score else: if self.minimize: f = lambda x: x[0] < score else: f = lambda x: x[0] > score self.data = filter(f, self.data) def __eq__(self, other): return (self.data == other.data) and (self.minimize == other.minimize) def __repr__(self): return repr(self.data) def __add__(self, other): """Priority queues can be added up, as long as they all have minimize or maximize (rather than mixed). In case of fixed-length queues, the FIRST queue in the operation will be authorative for the fixed lengthness of the result!""" assert (isinstance(other, PriorityQueue) and self.minimize == other.minimize) return PriorityQueue(self.data + other.data, self.f, self.minimize, self.length, self.blockworse, self.blockequal) class Tree(object): """Simple tree structure. Nodes are themselves trees.""" def __init__(self, value = None, children = None): self.parent = None self.value = value if not children: self.children = None else: for c in children: self.append(c) def leaf(self): """Is this a leaf node or not?""" return not self.children def __len__(self): if not self.children: return 0 else: return len(self.children) def __bool__(self): return True def __iter__(self): """Iterate over all items in the tree""" for c in self.children: return c def append(self, item): """Add an item to the Tree""" if not isinstance(item, Tree): return ValueError("Can only append items of type Tree") if not self.children: self.children = [] item.parent = self self.children.append(item) def __getitem__(self, index): """Retrieve a specific item, by index, from the Tree""" assert isinstance(index,int) try: return self.children[index] except: raise def __str__(self): return str(self.value) def __unicode__(self): #Python 2.x return u(self.value) class Trie(object): """Simple trie structure. Nodes are themselves tries, values are stored on the edges, not the nodes.""" def __init__(self, sequence = None): self.parent = None self.children = None self.value = None if sequence: self.append(sequence) def leaf(self): """Is this a leaf node or not?""" return not self.children def root(self): """Returns True if this is the root of the Trie""" return not self.parent def __len__(self): if not self.children: return 0 else: return len(self.children) def __bool__(self): return True def __iter__(self): if self.children: for key in self.children.keys(): yield key def items(self): if self.children: for key, trie in self.children.items(): yield key, trie def __setitem__(self, key, subtrie): if not isinstance(subtrie, Trie): return ValueError("Can only set items of type Trie, got " + str(type(subtrie))) if not self.children: self.children = {} subtrie.value = key subtrie.parent = self self.children[key] = subtrie def append(self, sequence): if not sequence: return self if not self.children: self.children = {} if not (sequence[0] in self.children): self.children[sequence[0]] = Trie() return self.children[sequence[0]].append( sequence[1:] ) else: return self.children[sequence[0]].append( sequence[1:] ) def find(self, sequence): if not sequence: return self elif self.children and sequence[0] in self.children: return self.children[sequence[0]].find(sequence[1:]) else: return False def __contains__(self, key): return (key in self.children) def __getitem__(self, key): try: return self.children[key] except: raise def size(self): """Size is number of nodes under the trie, including the current node""" if self.children: return sum( ( c.size() for c in self.children.values() ) ) + 1 else: return 1 def path(self): """Returns the path to the current node""" if self.parent: return (self,) + self.parent.path() else: return (self,) def depth(self): """Returns the depth of the current node""" if self.parent: return 1 + self.parent.depth() else: return 1 def sequence(self): if self.parent: if self.value: return (self.value,) + self.parent.sequence() else: return self.parent.sequence() else: return (self,) def walk(self, leavesonly=True, maxdepth=None, _depth = 0): """Depth-first search, walking through trie, returning all encounterd nodes (by default only leaves)""" if self.children: if not maxdepth or (maxdepth and _depth < maxdepth): for key, child in self.children.items(): if child.leaf(): yield child else: for results in child.walk(leavesonly, maxdepth, _depth + 1): yield results FIXEDGAP = 128 DYNAMICGAP = 129 if PYTHONVERSION > '3': #only available for Python 3 class Pattern: def __init__(self, data, classdecoder=None): assert isinstance(data, bytes) self.data = data self.classdecoder = classdecoder @staticmethod def fromstring(s, classencoder): #static data = b'' for s in s.split(): data += classencoder[s] return Pattern(data) def __str__(self): s = "" for cls in self: s += self.classdecoder[int.from_bytes(cls)] def iterbytes(self, begin=0, end=0): i = 0 l = len(self.data) n = 0 if (end != begin): slice = True else: slice = False while i < l: size = self.data[i] if (size < 128): #everything from 128 onward is reserved for markers if not slice: yield self.data[i+1:i+1+size] else: n += 1 if n >= begin and n < end: yield self.data[i+1:i+1+size] i += 1 + size else: raise ValueError("Size >= 128") def __iter__(self): for b in self.iterbytes(): yield Pattern(b, self.classdecoder) def __bytes__(self): return self.data def __len__(self): """Return n""" i = 0 l = len(self.data) n = 0 while i < l: size = self.data[i] if (size < 128): n += 1 i += 1 + size else: raise ValueError("Size >= 128") def __getitem__(self, index): assert isinstance(index, int) for b in self.iterbytes(index,index+1): return Pattern(b, self.classdecoder) def __getslice__(self, begin, end): slice = b'' for b in self.iterbytes(begin,end): slice += b return slice def __add__(self, other): assert isinstance(other, Pattern) return Pattern(self.data + other.data, self.classdecoder) def __eq__(self, other): return self.data == other.data class PatternSet: def __init__(self): self.data = set() def add(self, pattern): self.data.add(pattern.data) def remove(self, pattern): self.data.remove(pattern.data) def __len__(self): return len(self.data) def __bool__(self): return len(self.data) > 0 def __contains__(self, pattern): return pattern.data in self.data def __iter__(self): for patterndata in self.data: yield Pattern(patterndata) class PatternMap: def __init__(self, default=None): self.data = {} self.default = default def __getitem__(self, pattern): assert isinstance(pattern, Pattern) if not self.default is None: try: return self.data[pattern.data] except KeyError: return self.default else: return self.data[pattern.data] def __setitem__(self, pattern, value): self.data[pattern.data] = value def __delitem__(self, pattern): del self.data[pattern.data] def __len__(self): return len(self.data) def __bool__(self): return len(self.data) > 0 def __contains__(self, pattern): return pattern.data in self.data def __iter__(self): for patterndata in self.data: yield Pattern(patterndata) def items(self): for patterndata, value in self.data.items(): yield Pattern(patterndata), value #class SuffixTree(object): # def __init__(self): # self.data = {} # # # def append(self, seq): # if len(seq) > 1: # for item in seq: # self.append(item) # else: # # # def compile(self, s): PyNLPl-1.1.2/pynlpl/net.py0000644000175000001440000001172312770321315016164 0ustar proyconusers00000000000000#-*- coding:utf-8 -*- #--------------------------------------------------------------- # PyNLPl - Network utilities # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # http://www.github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Generic Server for Language Models # #---------------------------------------------------------------- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u,b import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout from twisted.internet import protocol, reactor # will fail on Python 3 for now from twisted.protocols import basic import shlex class GWSNetProtocol(basic.LineReceiver): def connectionMade(self): print("Client connected", file=stderr) self.factory.connections += 1 if self.factory.connections < 1: self.transport.loseConnection() else: self.sendLine(b("READY")) def lineReceived(self, line): try: if sys.version >= '3' and isinstance(line,bytes): print("Client in: " + str(line,'utf-8'),file=stderr) else: print("Client in: " + line,file=stderr) except UnicodeDecodeError: print("Client in: (unicodeerror)",file=stderr) if sys.version < '3': if isinstance(line,unicode): self.factory.processprotocol.transport.write(line.encode('utf-8')) else: self.factory.processprotocol.transport.write(line) self.factory.processprotocol.transport.write(b('\n')) else: self.factory.processprotocol.transport.write(b(line) + b('\n')) self.factory.processprotocol.currentclient = self def connectionLost(self, reason): self.factory.connections -= 1 if self.factory.processprotocol.currentclient == self: self.factory.processprotocol.currentclient = None class GWSFactory(protocol.ServerFactory): protocol = GWSNetProtocol def __init__(self, processprotocol): self.connections = 0 self.processprotocol = processprotocol class GWSProcessProtocol(protocol.ProcessProtocol): def __init__(self, printstderr=True, sendstderr= False, filterout = None, filtererr = None): self.currentclient = None self.printstderr = printstderr self.sendstderr = sendstderr if not filterout: self.filterout = lambda x: x else: self.filterout = filterout if not filtererr: self.filtererr = lambda x: x else: self.filtererr = filtererr def connectionMade(self): pass def outReceived(self, data): try: if sys.version >= '3' and isinstance(data,bytes): print("Process out " + str(data, 'utf-8'),file=stderr) else: print("Process out " + data,file=stderr) except UnicodeDecodeError: print("Process out (unicodeerror)",file=stderr) print("DEBUG:", repr(b(data).strip().split(b('\n')))) for line in b(data).strip().split(b('\n')): line = self.filterout(line.strip()) if self.currentclient and line: self.currentclient.sendLine(b(line)) def errReceived(self, data): try: if sys.version >= '3' and isinstance(data,bytes): print("Process err " + str(data,'utf-8'), file=sys.stderr) else: print("Process err " + data,file=stderr) except UnicodeDecodeError: print("Process out (unicodeerror)",file=stderr) if self.printstderr and data: print(data.strip(),file=stderr) for line in b(data).strip().split(b('\n')): line = self.filtererr(line.strip()) if self.sendstderr and self.currentclient and line: self.currentclient.sendLine(b(line)) def processExited(self, reason): print("Process exited",file=stderr) def processEnded(self, reason): print("Process ended",file=stderr) if self.currentclient: self.currentclient.transport.loseConnection() reactor.stop() class GenericWrapperServer: """Generic Server around a stdin/stdout based CLI tool. Only accepts one client at a time to prevent concurrency issues !!!!!""" def __init__(self, cmdline, port, printstderr= True, sendstderr= False, filterout = None, filtererr = None): gwsprocessprotocol = GWSProcessProtocol(printstderr, sendstderr, filterout, filtererr) cmdline = shlex.split(cmdline) reactor.spawnProcess(gwsprocessprotocol, cmdline[0], cmdline) gwsfactory = GWSFactory(gwsprocessprotocol) reactor.listenTCP(port, gwsfactory) reactor.run() PyNLPl-1.1.2/pynlpl/__init__.py0000644000175000001440000000102213024723325017124 0ustar proyconusers00000000000000"""PyNLPl, pronounced as "pineapple", is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for example the computation of n-grams, frequency lists and distributions, language models. There are also more complex data types, such as Priority Queues, and search algorithms, such as Beam Search. The library is divided into several packages and modules. It is designed for Python 2.6 and upwards. Including Python 3.""" VERSION = "1.1.2" PyNLPl-1.1.2/pynlpl/tools/0000755000175000001440000000000013024723552016162 5ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/tools/frogwrapper.py0000755000175000001440000002530712565300351021101 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- #Frog Wrapper with XML input and FoLiA output support from __future__ import print_function, unicode_literals, division, absolute_import import getopt import lxml.etree import sys import os import codecs if __name__ == "__main__": sys.path.append(sys.path[0] + '/../..') os.environ['PYTHONPATH'] = sys.path[0] + '/../..' import pynlpl.formats.folia as folia from pynlpl.clients.frogclient import FrogClient def legacyout(i, word,lemma,morph,pos): if word: out = str(i + 1) + "\t" + word + "\t" + lemma + "\t" + morph + "\t" + pos print(out.encode('utf-8')) else: print() def usage(): print >>sys.stderr,"frogwrapper.py [options]" print >>sys.stderr,"------------------------------------------------------" print >>sys.stderr,"Input file:" print >>sys.stderr,"\t--txt=[file] Plaintext input" print >>sys.stderr,"\t--xml=[file] XML Input" print >>sys.stderr,"\t--folia=[file] FoLiA XML Input" print >>sys.stderr,"Frog settings:" print >>sys.stderr,"\t-p [port] Port the Frog server is running on" print >>sys.stderr,"Output type:" print >>sys.stderr,"\t--id=[ID] ID for outputted FoLiA XML Document" print >>sys.stderr,"\t--legacy Use legacy columned output instead of FoLiA" print >>sys.stderr,"\t-o Write output to input file (only works for --folia)" print >>sys.stderr,"XML Input:" print >>sys.stderr,"\t--selectsen=[expr] Use xpath expression to select sentences" print >>sys.stderr,"\t--selectpar=[expr] Use xpath expression to select paragraphs" print >>sys.stderr,"\t--idattrib=[attrb] Copy ID from this attribute" print >>sys.stderr,"Text Input:" print >>sys.stderr,"\t-N No structure" print >>sys.stderr,"\t-S One sentence per line (strict)" print >>sys.stderr,"\t-P One paragraph per line" print >>sys.stderr,"\t-I Value in first column (tab seperated) is ID!" print >>sys.stderr,"\t-E [encoding] Encoding of input file (default: utf-8)" try: opts, files = getopt.getopt(sys.argv[1:], "hSPINEp:o", ["txt=","xml=", "folia=","id=",'legacy','tok','selectsen=','selectpar=','idattrib=']) except getopt.GetoptError as err: # print help information and exit: print(str(err)) usage() sys.exit(1) textfile = xmlfile = foliafile = None foliaid = 'UNTITLED' legacy = None tok = False idinfirstcolumn = False encoding = 'utf-8' mode='s' xpathselect = '' idattrib='' port = None save = False for o, a in opts: if o == "-h": usage() sys.exit(0) elif o == "-I": idinfirstcolumn = True elif o == "-S": mode = 's' elif o == "-P": mode = 'p' elif o == "-p": port = int(a) elif o == "-N": mode = 'n' elif o == "-E": encoding = a elif o == "--selectsen": mode='s' xpathselect = a elif o == "--selectpar": mode='p' xpathselect = a elif o == "--idattrib": idattrib = a elif o == "--txt": textfile = a elif o == "--xml": xmlfile = a elif o == "--folia": foliafile = a elif o == "--id": foliaid = a #ID elif o == "-o": save = True elif o == "--legacy": legacy = True elif o == "--tok": tok = True else: print >>sys.stderr, "ERROR: Unknown option:",o sys.exit(1) if not port: print >> sys.stderr,"ERROR: No port specified to connect to Frog server" sys.exit(2) elif (not textfile and not xmlfile and not foliafile): print >> sys.stderr,"ERROR: Specify a file with either --txt, --xml or --folia" sys.exit(2) elif xmlfile and not xpathselect: print >> sys.stderr,"ERROR: You need to specify --selectsen or --selectpar when using --xml" sys.exit(2) frogclient = FrogClient('localhost',port) idmap = [] data = [] if textfile: f = codecs.open(textfile, 'r', encoding) for line in f.readlines(): if idinfirstcolumn: id, line = line.split('\t',1) idmap.append(id.strip()) else: idmap.append(None) data.append(line.strip()) f.close() if xmlfile: xmldoc = lxml.etree.parse(xmlfile) for node in xmldoc.xpath(xpathselect): if idattrib: if idattrib in node.attrib: idmap.append(node.attrib[idattrib]) else: print >>sys.stderr,"WARNING: Attribute " + idattrib + " not found on node!" idmap.append(None) else: idmap.append(None) data.append(node.text) if foliafile: foliadoc = folia.Document(file=foliafile) if not foliadoc.declared(folia.AnnotationType.TOKEN): foliadoc.declare(folia.AnnotationType.TOKEN, set='http://ilk.uvt.nl/folia/sets/ucto-nl.foliaset', annotator='Frog',annotatortype=folia.AnnotatorType.AUTO) if not foliadoc.declared(folia.AnnotationType.POS): foliadoc.declare(folia.AnnotationType.POS, set='http://ilk.uvt.nl/folia/sets/cgn-legacy.foliaset', annotator='Frog',annotatortype=folia.AnnotatorType.AUTO) if not foliadoc.declared(folia.AnnotationType.LEMMA): foliadoc.declare(folia.AnnotationType.LEMMA, set='http://ilk.uvt.nl/folia/sets/mblem-nl.foliaset', annotator='Frog',annotatortype=folia.AnnotatorType.AUTO) foliadoc.language('nld') text = foliadoc.data[-1] for p in foliadoc.paragraphs(): found_s = False for s in p.sentences(): found_w = False for w in s.words(): found_w = True found_s = True if found_w: #pass tokenised sentence words = s.words() response = frogclient.process(" ".join([unicode(w) for w in words])) for i, (word, lemma, morph, pos) in enumerate(response): if legacy: legacyout(i,word,lemma,morph,pos) if unicode(words[i]) == word: if lemma: words[i].append( folia.LemmaAnnotation(foliadoc, cls=lemma) ) if pos: words[i].append( folia.PosAnnotation(foliadoc, cls=pos) ) else: print >>sys.stderr,"WARNING: Out of sync after calling Frog! ", i, word else: #pass untokenised sentence try: sentext = s.text() except folia.NoSuchText: continue response = frogclient.process(sentext) for i, (word, lemma, morph, pos) in enumerate(response): if legacy: legacyout(i,word,lemma,morph,pos) if word: w = folia.Word(foliadoc, text=word, generate_id_in=s) if lemma: w.append( folia.LemmaAnnotation(foliadoc, cls=lemma) ) if pos: w.append( folia.PosAnnotation(foliadoc, cls=pos) ) s.append(w) if not found_s: #pass paragraph try: partext = p.text() except folia.NoSuchText: continue s = folia.Sentence(foliadoc, generate_id_in=p) response = frogclient.process(partext) for i, (word, lemma, morph, pos) in enumerate(response): if (not word or i == len(response) - 1) and len(s) > 0: #gap or end of response: terminate sentence p.append(s) s = folia.Sentence(foliadoc, generate_id_in=p) elif word: w = folia.Word(foliadoc, text=word, generate_id_in=s) if lemma: w.append( folia.LemmaAnnotation(foliadoc, cls=lemma) ) if pos: w.append( folia.PosAnnotation(foliadoc, cls=pos) ) s.append(w) else: foliadoc = folia.Document(id=foliaid) foliadoc.declare(folia.AnnotationType.TOKEN, set='http://ilk.uvt.nl/folia/sets/ucto-nl.foliaset', annotator='Frog',annotatortype=folia.AnnotatorType.AUTO) foliadoc.declare(folia.AnnotationType.POS, set='http://ilk.uvt.nl/folia/sets/cgn-legacy.foliaset', annotator='Frog',annotatortype=folia.AnnotatorType.AUTO) foliadoc.declare(folia.AnnotationType.LEMMA, set='http://ilk.uvt.nl/folia/sets/mblem-nl.foliaset', annotator='Frog',annotatortype=folia.AnnotatorType.AUTO) foliadoc.language('nld') text = folia.Text(foliadoc, id=foliadoc.id + '.text.1') foliadoc.append(text) curid = None for (fragment, id) in zip(data,idmap): if mode == 's' or mode == 'n': if id: s = folia.Sentence(foliadoc, id=id) else: s = folia.Sentence(foliadoc, generate_id_in=text) elif mode == 'p': if id: p = folia.Paragraph(foliadoc, id=id) else: p = folia.Paragraph(foliadoc, generate_id_in=text) s = folia.Sentence(foliadoc, generate_id_in=p) curid = s.id response = frogclient.process(fragment) for i, (word, lemma, morph, pos) in enumerate(response): if legacy: legacyout(i,word,lemma,morph,pos) continue if word: w = folia.Word(foliadoc, text=word, generate_id_in=s) if lemma: w.append( folia.LemmaAnnotation(foliadoc, cls=lemma) ) if pos: w.append( folia.PosAnnotation(foliadoc, cls=pos) ) s.append(w) if (not word or i == len(response) - 1) and len(s) > 0: #gap or end of response: terminate sentence if mode == 'p': p.append(s) if (i == len(response) - 1): text.append(p) elif mode == 'n' or (mode == 's' and i == len(response) - 1): text.append(s) elif mode == 's': continue if i < len(response) - 1: #not done yet? #create new sentence if mode == 'p': s = folia.Sentence(foliadoc, generate_id_in=p) elif mode == 'n' and id: #no id for this unforeseen sentence, make something up s = folia.Sentence(foliadoc, id=curid+'.X') print("WARNING: Sentence found that was not in original",file=sys.stderr) if not legacy: print(foliadoc.xmlstring()) if save and foliafile: foliadoc.save() PyNLPl-1.1.2/pynlpl/tools/sonar2folia.py0000755000175000001440000000721712565300351020762 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- #--------------------------------------------------------------- # PyNLPl - Conversion script for converting SoNaR/D-Coi from D-Coi XML to FoLiA XML # by Maarten van Gompel, ILK, Tilburg University # http://ilk.uvt.nl/~mvgompel # proycon AT anaproy DOT nl # # Licensed under GPLv3 # #---------------------------------------------------------------- # Usage: sonar2folia.py sonar-input-dir output-dir nr-of-threads from __future__ import print_function, unicode_literals, division, absolute_import import sys import os if __name__ == "__main__": sys.path.append(sys.path[0] + '/../..') os.environ['PYTHONPATH'] = sys.path[0] + '/../..' import pynlpl.formats.folia as folia import pynlpl.formats.sonar as sonar from multiprocessing import Pool, Process import datetime import codecs def process(data): i, filename = data category = os.path.basename(os.path.dirname(filename)) progress = round((i+1) / float(len(index)) * 100,1) print("#" + str(i+1) + " " + filename + ' ' + datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') + ' ' + str(progress) + '%',file=sys.stderr) try: doc = folia.Document(file=filename) except Exception as e: print("ERROR loading " + filename + ":" + str(e),file=sys.stderr) return False filename = filename.replace(sonardir,'') if filename[0] == '/': filename = filename[1:] if filename[-4:] == '.pos': filename = filename[:-4] if filename[-4:] == '.tok': filename = filename[:-4] if filename[-4:] == '.ilk': filename = filename[:-4] #Load document prior to tokenisation try: pretokdoc = folia.Document(file=sonardir + '/' + filename) except: print("WARNING unable to load pretokdoc " + filename,file=sys.stderr) pretokdoc = None if pretokdoc: for p2 in pretokdoc.paragraphs(): try: p = doc[p2.id] except: print("ERROR: Paragraph " + p2.id + " not found. Tokenised and pre-tokenised versions out of sync?",file=sys.stderr) continue if p2.text: p.text = p2.text try: os.mkdir(foliadir + os.path.dirname(filename)) except: pass try: doc.save(foliadir + filename) except: print("ERROR saving " + foliadir + filename,file=sys.stderr) try: f = codecs.open(foliadir + filename.replace('.xml','.tok.txt'),'w','utf-8') f.write(unicode(doc)) f.close() except: print("ERROR saving " + foliadir + filename.replace('.xml','.tok.txt'),file=sys.stderr) sys.stdout.flush() sys.stderr.flush() return True def outputexists(filename, sonardir, foliadir): filename = filename.replace(sonardir,'') if filename[0] == '/': filename = filename[1:] if filename[-4:] == '.pos': filename = filename[:-4] if filename[-4:] == '.tok': filename = filename[:-4] if filename[-4:] == '.ilk': filename = filename[:-4] return os.path.exists(foliadir + filename) if __name__ == '__main__': sonardir = sys.argv[1] foliadir = sys.argv[2] threads = int(sys.argv[3]) if foliadir[-1] != '/': foliadir += '/' try: os.mkdir(foliadir[:-1]) except: pass print("Building index...") index = list(enumerate([ x for x in sonar.CorpusFiles(sonardir,'pos', "", lambda x: True, True) if not outputexists(x, sonardir, foliadir) ])) print("Processing...") p = Pool(threads) p.map(process, index ) PyNLPl-1.1.2/pynlpl/tools/freqlist.py0000755000175000001440000000456412667243276020415 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- ############################################################### # PyNLPl - Frequency List Generator # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # Licensed under GPLv3 # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import argparse import sys import io from pynlpl.statistics import FrequencyList, Distribution from pynlpl.textprocessors import Windower, crude_tokenizer def main(): parser = argparse.ArgumentParser(description="Generate an n-gram frequency list", formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument('-n','--ngramsize', help="N-gram size", type=int, action='store',default=1) parser.add_argument('-i','--caseinsensitive', help="Case insensitive", action="store_true") parser.add_argument('-e','--encoding', help="Character encoding", type=str, action='store',default='utf-8') parser.add_argument('files', type=str, nargs='+', help="The data sets to sample from, must be of equal size (i.e., same number of lines)") args = parser.parse_args() if not args.files: print("No files specified", file=sys.stderr) sys.exit(1) freqlist = FrequencyList(None, args.caseinsensitive) for filename in args.files: f = io.open(filename,'r',encoding=args.encoding) for line in f: if args.ngramsize > 1: freqlist.append(Windower(crude_tokenizer(line),args.ngramsize)) else: freqlist.append(crude_tokenizer(line)) f.close() dist = Distribution(freqlist) for type, count in freqlist: if isinstance(type,tuple) or isinstance(type,list): type = " ".join(type) s = type + "\t" + str(count) + "\t" + str(dist[type]) + "\t" + str(dist.information(type)) print(s) print("Tokens: ", freqlist.tokens(),file=sys.stderr) print("Types: ", len(freqlist),file=sys.stderr) print("Type-token ratio: ", freqlist.typetokenratio(),file=sys.stderr) print("Entropy: ", dist.entropy(),file=sys.stderr) if __name__ == '__main__': main() PyNLPl-1.1.2/pynlpl/tools/sonarlemmafreqlist.py0000755000175000001440000000220012565300351022436 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- from __future__ import print_function, unicode_literals, division, absolute_import import sys import os if __name__ == "__main__": sys.path.append(sys.path[0] + '/../..') os.environ['PYTHONPATH'] = sys.path[0] + '/../..' from pynlpl.formats.sonar import CorpusFiles, Corpus from pynlpl.statistics import FrequencyList sonardir = sys.argv[1] freqlist = FrequencyList() lemmapos_freqlist = FrequencyList() poshead_freqlist = FrequencyList() pos_freqlist = FrequencyList() for i, doc in enumerate(Corpus(sonardir)): print("#" + str(i) + " Processing " + doc.filename,file=sys.stderr) for word, id, pos, lemma in doc: freqlist.count(word) if lemma and pos: poshead = pos.split('(')[0] lemmapos_freqlist.count(lemma+'.'+poshead) poshead_freqlist.count(poshead) pos_freqlist.count(pos) freqlist.save('sonarfreqlist.txt') lemmapos_freqlist.save('sonarlemmaposfreqlist.txt') poshead_freqlist.save('sonarposheadfreqlist.txt') pos_freqlist.save('sonarposfreqlist.txt') print(unicode(freqlist).encode('utf-8')) PyNLPl-1.1.2/pynlpl/tools/sampler.py0000755000175000001440000000436612667243276020227 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- ############################################################### # PyNLPl - Sampler # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # Licensed under GPLv3 # # This tool can be used to split a file (or multiple interdependent # files, such as a parallel corpus) into a train, test and development # set. # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import argparse import sys import random from pynlpl.evaluation import filesampler def main(): parser = argparse.ArgumentParser(description="Extracts random samples from datasets, supports multiple parallel datasets (such as parallel corpora), provided that corresponding data is on the same line.", formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument('-t','--testsetsize', help="Test set size (lines)", type=float, action='store',default=0) parser.add_argument('-d','--devsetsize', help="Development set size (lines)", type=float, action='store',default=0) parser.add_argument('-T','--trainsetsize', help="Training set size (lines), leave unassigned (0) to automatically use all of the remaining data", type=float, action='store',default=0) parser.add_argument('-S','--seed', help="Seed for random number generator", type=int, action='store',default=0) parser.add_argument('files', type=str, nargs='+', help="The data sets to sample from, must be of equal size (i.e., same number of lines)") args = parser.parse_args() if args.seed: random.seed(args.seed) if args.testsetsize == 0: print("ERROR: Specify at least a testset size!",file=sys.stderr) sys.exit(2) try: if not args.files: print("ERROR: Specify at least one file!",file=sys.stderr) sys.exit(2) except: print("ERROR: Specify at least one file!",file=sys.stderr) sys.exit(2) filesampler(args.files, args.testsetsize, args.devsetsize, args.trainsetsize) if __name__ == '__main__': main() PyNLPl-1.1.2/pynlpl/tools/__init__.py0000644000175000001440000000000012445064173020264 0ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/tools/computepmi.py0000755000175000001440000001272212445064173020730 0ustar proyconusers00000000000000#!/usr/bin/env python3 from __future__ import print_function, unicode_literals, division, absolute_import import argparse import sys from math import log from collections import defaultdict def pmi(sentences1, sentences2,discount = 0): jointcount = len(sentences1 & sentences2) - discount if jointcount <= 0: return None return log( jointcount / (len(sentences1) * len(sentences2))), jointcount+discount def npmi(sentences1, sentences2,discount=0): jointcount = len(sentences1 & sentences2) - discount if jointcount <= 0: return None return log( jointcount / (len(sentences1) * len(sentences2))) / -log(jointcount), jointcount+discount def main(): parser = argparse.ArgumentParser(description="Simple cooccurence computation", formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument('-f','--inputtext', type=str,help="Input file (plaintext, tokenised, utf-8, one sentence per line)", action='store',default="",required=True) parser.add_argument('-s','--sorted', help="Output sorted by co-occurrence score", action='store_true',default=False) parser.add_argument('-t','--threshold', help="Joined occurrence threshold, do not consider words occuring less than this", type=int, action='store',default=1) parser.add_argument('-a','--adjacency', help="Compute the adjacency fraction (how many co-occurrence are immediate bigrams)", action='store_true',default=False) parser.add_argument('-A','--discountadjacency', help="Do not take immediately adjacent fragments (bigrams) into account when computing mutual information (requires -a)", action='store_true',default=False) parser.add_argument('--pmi',help="Compute pointwise mutual information", action='store_true',default=False) parser.add_argument('--npmi',help="Compute normalised pointwise mutual information", action='store_true',default=False) parser.add_argument('--jaccard',help="Compute jaccard similarity coefficient", action='store_true',default=False) parser.add_argument('--dice',help="Compute dice coefficient", action='store_true',default=False) args = parser.parse_args() if not args.pmi and not args.npmi and not args.jaccard and not args.dice: args.pmi = True count = defaultdict(int) cooc = defaultdict(lambda: defaultdict(int)) adjacent = defaultdict(lambda: defaultdict(int)) total = 0 f = open(args.inputtext,'r',encoding='utf-8') for i, line in enumerate(f): sentence = i + 1 if sentence % 1000 == 0: print("Indexing @" + str(sentence),file=sys.stderr) if line: words = list(enumerate(line.split())) for pos, word in words: count[word] += 1 total += 1 for pos2, word2 in words: if pos2 > pos: cooc[word][word2] += 1 if args.adjacency and pos2 == pos + len(word.split()): adjacent[word][word2] += 1 f.close() l = len(cooc) output = [] for i, (word, coocdata) in enumerate(cooc.items()): print("Computing mutual information @" + str(i+1) + "/" + str(l) + ": \"" + word + "\" , co-occurs with " + str(len(coocdata)) + " words",file=sys.stderr) for word2, jointcount in coocdata.items(): if jointcount> args.threshold: if args.adjacency and word in adjacent and word2 in adjacent[word]: adjcount = adjacent[word][word2] else: adjcount = 0 if args.discountadjacency: discount = adjcount else: discount = 0 if args.pmi: score = log( ((jointcount-discount)/total) / ((count[word]/total) * (count[word2]/total))) elif args.npmi: score = log( ((jointcount-discount)/total) / ((count[word]/total) * (count[word2]/total))) / -log((jointcount-discount)/total) elif args.jaccard or args.dice: score = (jointcount-discount) / (count[word] + count[word2] - (jointcount - discount) ) if args.dice: score = 2*score / (1+score) if args.sorted: outputdata = (word,word2,score, jointcount, adjcount, adjcount / jointcount if args.adjacency else None) output.append(outputdata) else: if args.adjacency: print(word + "\t" + word2 + "\t" + str(score) + "\t" + str(jointcount) + "\t" + str(adjcount) + "\t" + str(adjcount / jointcount)) else: print(word + "\t" + word2 + "\t" + str(score) + "\t" + str(jointcount)) if args.sorted: print("Outputting " + str(len(output)) + " pairs",file=sys.stderr) if args.adjacency: print("#WORD\tWORD2\tSCORE\tJOINTCOUNT\tBIGRAMCOUNT\tBIGRAMRATIO") else: print("#WORD\tWORD2\tSCORE\tJOINTCOUNT\tBIGRAMCOUNT\tBIGRAMRATIO") if args.npmi: sign = 1 else: sign = -1 for word,word2,score,jointcount,adjcount, adjratio in sorted(output, key=lambda x: sign * x[2]): if args.adjacency: print(word + "\t" + word2 + "\t" + str(score) + "\t" + str(jointcount) + "\t" + str(adjcount) + "\t" + str(adjratio) ) else: print(word + "\t" + word2 + "\t" + str(score) + "\t" + str(jointcount)) if __name__ == '__main__': main() PyNLPl-1.1.2/pynlpl/tools/reflow.py0000755000175000001440000000076012445064173020043 0ustar proyconusers00000000000000#! /usr/bin/env python # -*- coding: utf8 -*- from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import sys import io import getopt from pynlpl.textprocessors import ReflowText def main(): for filename in sys.argv[1:]: f = io.open(filename, 'r', encoding='utf-8') for line in ReflowText(f): print(line) f.close() if __name__ == '__main__': main() PyNLPl-1.1.2/pynlpl/tools/foliasplitcgnpostags.py0000755000175000001440000000271212565300351022775 0ustar proyconusers00000000000000#!/usr/bin/env python3 #-*- coding:utf-8 -*- from __future__ import print_function, unicode_literals, division, absolute_import import glob import sys import os if __name__ == "__main__": sys.path.append(sys.path[0] + '/../..') os.environ['PYTHONPATH'] = sys.path[0] + '/../..' from pynlpl.formats import folia from pynlpl.formats import cgn import lxml.etree def process(target): print("Processing " + target) if os.path.isdir(target): print("Descending into directory " + target) for f in glob.glob(target + '/*'): process(f) elif os.path.isfile(target) and target[-4:] == '.xml': print("Loading " + target) try: doc = folia.Document(file=target) except lxml.etree.XMLSyntaxError: print("UNABLE TO LOAD " + target + " (XML SYNTAX ERROR!)",file=sys.stderr) return None changed = False for word in doc.words(): try: pos = word.annotation(folia.PosAnnotation) except folia.NoSuchAnnotation: continue try: word.replace( cgn.parse_cgn_postag(pos.cls) ) changed = True except cgn.InvalidTagException: print("WARNING: INVALID TAG " + pos.cls,file=sys.stderr) continue if changed: print("Saving...") doc.save() target = sys.argv[1] process(target) PyNLPl-1.1.2/pynlpl/tools/phrasetableserver.py0000755000175000001440000000150012445064173022257 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- ############################################################### # PyNLPl - Phrase Table Server # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # Licensed under GPLv3 # ############################################################### import sys import os if __name__ == "__main__": sys.path.append(sys.path[0] + '/../..') os.environ['PYTHONPATH'] = sys.path[0] + '/../..' from pynlpl.formats.moses import PhraseTable, PhraseTableServer if len(sys.argv) != 3: print >>sys.stderr,"Syntax: phrasetableserver.py phrasetable port" sys.exit(2) else: port = int(sys.argv[2]) PhraseTableServer(PhraseTable(sys.argv[1]), port) PyNLPl-1.1.2/pynlpl/docs/0000755000175000001440000000000013024723552015752 5ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/docs/conf.py0000644000175000001440000001450213024723323017247 0ustar proyconusers00000000000000# -*- coding: utf-8 -*- # # PyNLPl documentation build configuration file, created by # sphinx-quickstart on Tue Jul 6 22:07:20 2010. # # This file is execfile()d with the current directory set to its containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import sys, os # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. #sys.path.append(os.path.abspath('.')) sys.path.append(os.path.abspath('../../')) from pynlpl import VERSION # -- General configuration ----------------------------------------------------- # Add any Sphinx extension module names here, as strings. They can be extensions # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon','sphinx.ext.autosummary'] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8' # The master toctree document. master_doc = 'index' # General information about the project. project = u'PyNLPl' copyright = u'2016, Maarten van Gompel' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = VERSION # The full version, including alpha/beta/rc tags. release = VERSION # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. #language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of documents that shouldn't be included in the build. #unused_docs = [] # List of directories, relative to source directory, that shouldn't be searched # for source files. exclude_trees = ['_build'] # The reST default role (used for this markup: `text`) to use for all documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. #add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # -- Options for HTML output --------------------------------------------------- # The theme to use for HTML and HTML Help pages. Major themes that come with # Sphinx are currently 'default' and 'sphinxdoc'. html_theme = 'default' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. #html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. #html_theme_path = [] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. #html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". # html_static_path = ['_static'] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_use_modindex = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = '' # Output file base name for HTML help builder. # htmlhelp_basename = 'pynlpl' # -- Options for LaTeX output -------------------------------------------------- # The paper size ('letter' or 'a4'). latex_paper_size = 'a4' # The font size ('10pt', '11pt' or '12pt'). #latex_font_size = '10pt' # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, author, documentclass [howto/manual]). latex_documents = [ ('index', 'pynlpl.tex', u'PyNLPl Documentation', u'Maarten van Gompel', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. #latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # Additional stuff for the LaTeX preamble. #latex_preamble = '' # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_use_modindex = True autosummary_generate = True PyNLPl-1.1.2/pynlpl/evaluation.py0000644000175000001440000006134612771005267017561 0ustar proyconusers00000000000000############################################################### # PyNLPl - Evaluation Library # by Maarten van Gompel (proycon) # http://ilk.uvt.nl/~mvgompel # Induction for Linguistic Knowledge Research Group # Universiteit van Tilburg # # Licensed under GPLv3 # # This is a Python library with classes and functions for evaluation # and experiments . # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout import io from pynlpl.statistics import FrequencyList from collections import defaultdict try: import numpy as np except ImportError: np = None import subprocess import itertools import time import random import copy import datetime import os.path def auc(x, y, reorder=False): #from sklearn, http://scikit-learn.org, licensed under BSD License """Compute Area Under the Curve (AUC) using the trapezoidal rule This is a general fuction, given points on a curve. For computing the area under the ROC-curve, see :func:`auc_score`. Parameters ---------- x : array, shape = [n] x coordinates. y : array, shape = [n] y coordinates. reorder : boolean, optional (default=False) If True, assume that the curve is ascending in the case of ties, as for an ROC curve. If the curve is non-ascending, the result will be wrong. Returns ------- auc : float Examples -------- >>> import numpy as np >>> from sklearn import metrics >>> y = np.array([1, 1, 2, 2]) >>> pred = np.array([0.1, 0.4, 0.35, 0.8]) >>> fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2) >>> metrics.auc(fpr, tpr) 0.75 See also -------- auc_score : Computes the area under the ROC curve """ if np is None: raise ImportError("No numpy installed") # XXX: Consider using ``scipy.integrate`` instead, or moving to # ``utils.extmath`` if not isinstance(x, np.ndarray): x = np.array(x) if not isinstance(x, np.ndarray): y = np.array(y) if x.shape[0] < 2: raise ValueError('At least 2 points are needed to compute' ' area under curve, but x.shape = %s' % x.shape) if reorder: # reorder the data points according to the x axis and using y to # break ties x, y = np.array(sorted(points for points in zip(x, y))).T h = np.diff(x) else: h = np.diff(x) if np.any(h < 0): h *= -1 assert not np.any(h < 0), ("Reordering is not turned on, and " "The x array is not increasing: %s" % x) area = np.sum(h * (y[1:] + y[:-1])) / 2.0 return area class ProcessFailed(Exception): pass class ConfusionMatrix(FrequencyList): """Confusion Matrix""" def __str__(self): """Print Confusion Matrix in table form""" o = "== Confusion Matrix == (hor: goals, vert: observations)\n\n" keys = sorted( set( ( x[1] for x in self._count.keys()) ) ) linemask = "%20s" cells = [''] for keyH in keys: l = len(keyH) if l < 4: l = 4 elif l > 15: l = 15 linemask += " %" + str(l) + "s" cells.append(keyH) linemask += "\n" o += linemask % tuple(cells) for keyV in keys: linemask = "%20s" cells = [keyV] for keyH in keys: l = len(keyH) if l < 4: l = 4 elif l > 15: l = 15 linemask += " %" + str(l) + "d" try: count = self._count[(keyH, keyV)] except: count = 0 cells.append(count) linemask += "\n" o += linemask % tuple(cells) return o class ClassEvaluation(object): def __init__(self, goals = [], observations = [], missing = {}, encoding ='utf-8'): assert len(observations) == len(goals) self.observations = copy.copy(observations) self.goals = copy.copy(goals) self.classes = set(self.observations + self.goals) self.tp = defaultdict(int) self.fp = defaultdict(int) self.tn = defaultdict(int) self.fn = defaultdict(int) self.missing = missing self.encoding = encoding self.computed = False if self.observations: self.compute() def append(self, goal, observation): self.goals.append(goal) self.observations.append(observation) self.classes.add(goal) self.classes.add(observation) self.computed = False def precision(self, cls=None, macro=False): if not self.computed: self.compute() if cls: if self.tp[cls] + self.fp[cls] > 0: return self.tp[cls] / (self.tp[cls] + self.fp[cls]) else: #return float('nan') return 0 else: if len(self.observations) > 0: if macro: return sum( ( self.precision(x) for x in set(self.goals) ) ) / len(set(self.classes)) else: return sum( ( self.precision(x) for x in self.goals ) ) / len(self.goals) else: #return float('nan') return 0 def recall(self, cls=None, macro=False): if not self.computed: self.compute() if cls: if self.tp[cls] + self.fn[cls] > 0: return self.tp[cls] / (self.tp[cls] + self.fn[cls]) else: #return float('nan') return 0 else: if len(self.observations) > 0: if macro: return sum( ( self.recall(x) for x in set(self.goals) ) ) / len(set(self.classes)) else: return sum( ( self.recall(x) for x in self.goals ) ) / len(self.goals) else: #return float('nan') return 0 def specificity(self, cls=None, macro=False): if not self.computed: self.compute() if cls: if self.tn[cls] + self.fp[cls] > 0: return self.tn[cls] / (self.tn[cls] + self.fp[cls]) else: #return float('nan') return 0 else: if len(self.observations) > 0: if macro: return sum( ( self.specificity(x) for x in set(self.goals) ) ) / len(set(self.classes)) else: return sum( ( self.specificity(x) for x in self.goals ) ) / len(self.goals) else: #return float('nan') return 0 def accuracy(self, cls=None): if not self.computed: self.compute() if cls: if self.tp[cls] + self.tn[cls] + self.fp[cls] + self.fn[cls] > 0: return (self.tp[cls]+self.tn[cls]) / (self.tp[cls] + self.tn[cls] + self.fp[cls] + self.fn[cls]) else: #return float('nan') return 0 else: if len(self.observations) > 0: return sum( ( self.tp[x] for x in self.tp ) ) / len(self.observations) else: #return float('nan') return 0 def fscore(self, cls=None, beta=1, macro=False): if not self.computed: self.compute() if cls: prec = self.precision(cls) rec = self.recall(cls) if prec * rec > 0: return (1 + beta*beta) * ((prec * rec) / (beta*beta * prec + rec)) else: #return float('nan') return 0 else: if len(self.observations) > 0: if macro: return sum( ( self.fscore(x,beta) for x in set(self.goals) ) ) / len(set(self.classes)) else: return sum( ( self.fscore(x,beta) for x in self.goals ) ) / len(self.goals) else: #return float('nan') return 0 def tp_rate(self, cls=None, macro=False): if not self.computed: self.compute() if cls: if self.tp[cls] > 0: return self.tp[cls] / (self.tp[cls] + self.fn[cls]) else: return 0 else: if len(self.observations) > 0: if macro: return sum( ( self.tp_rate(x) for x in set(self.goals) ) ) / len(set(self.classes)) else: return sum( ( self.tp_rate(x) for x in self.goals ) ) / len(self.goals) else: #return float('nan') return 0 def fp_rate(self, cls=None, macro=False): if not self.computed: self.compute() if cls: if self.fp[cls] > 0: return self.fp[cls] / (self.tn[cls] + self.fp[cls]) else: return 0 else: if len(self.observations) > 0: if macro: return sum( ( self.fp_rate(x) for x in set(self.goals) ) ) / len(set(self.classes)) else: return sum( ( self.fp_rate(x) for x in self.goals ) ) / len(self.goals) else: #return float('nan') return 0 def auc(self, cls=None, macro=False): if not self.computed: self.compute() if cls: tpr = self.tp_rate(cls) fpr = self.fp_rate(cls) return auc([0,fpr,1], [0,tpr,1]) else: if len(self.observations) > 0: if macro: return sum( ( self.auc(x) for x in set(self.goals) ) ) / len(set(self.classes)) else: return sum( ( self.auc(x) for x in self.goals ) ) / len(self.goals) else: #return float('nan') return 0 def __iter__(self): for g,o in zip(self.goals, self.observations): yield g,o def compute(self): self.tp = defaultdict(int) self.fp = defaultdict(int) self.tn = defaultdict(int) self.fn = defaultdict(int) for cls, count in self.missing.items(): self.fn[cls] = count for goal, observation in self: if goal == observation: self.tp[observation] += 1 elif goal != observation: self.fp[observation] += 1 self.fn[goal] += 1 l = len(self.goals) + sum(self.missing.values()) for o in self.classes: self.tn[o] = l - self.tp[o] - self.fp[o] - self.fn[o] self.computed = True def confusionmatrix(self, casesensitive =True): return ConfusionMatrix(zip(self.goals, self.observations), casesensitive) def outputmetrics(self): o = "Accuracy: " + str(self.accuracy()) + "\n" o += "Samples: " + str(len(self.goals)) + "\n" o += "Correct: " + str(sum( ( self.tp[x] for x in set(self.goals)) ) ) + "\n" o += "Recall (microav): "+ str(self.recall()) + "\n" o += "Recall (macroav): "+ str(self.recall(None,True)) + "\n" o += "Precision (microav): " + str(self.precision()) + "\n" o += "Precision (macroav): "+ str(self.precision(None,True)) + "\n" o += "Specificity (microav): " + str(self.specificity()) + "\n" o += "Specificity (macroav): "+ str(self.specificity(None,True)) + "\n" o += "F-score1 (microav): " + str(self.fscore()) + "\n" o += "F-score1 (macroav): " + str(self.fscore(None,1,True)) + "\n" return o def __str__(self): if not self.computed: self.compute() o = "%-15s TP\tFP\tTN\tFN\tAccuracy\tPrecision\tRecall(TPR)\tSpecificity(TNR)\tF-score\n" % ("") for cls in sorted(set(self.classes)): cls = u(cls) o += "%-15s %d\t%d\t%d\t%d\t%4f\t%4f\t%4f\t%4f\t%4f\n" % (cls, self.tp[cls], self.fp[cls], self.tn[cls], self.fn[cls], self.accuracy(cls), self.precision(cls), self.recall(cls),self.specificity(cls), self.fscore(cls) ) return o + "\n" + self.outputmetrics() def __unicode__(self): #Python 2.x return str(self) class AbstractExperiment(object): def __init__(self, inputdata = None, **parameters): self.inputdata = inputdata self.parameters = self.defaultparameters() for parameter, value in parameters.items(): self.parameters[parameter] = value self.process = None self.creationtime = datetime.datetime.now() self.begintime = self.endtime = 0 def defaultparameters(self): return {} def duration(self): if self.endtime and self.begintime: return self.endtime - self.begintime else: return 0 def start(self): """Start as a detached subprocess, immediately returning execution to caller.""" raise Exception("Not implemented yet, make sure to overload the start() method in your Experiment class") def done(self, warn=True): """Is the subprocess done?""" if not self.process: raise Exception("Not implemented yet or process not started yet, make sure to overload the done() method in your Experiment class") self.process.poll() if self.process.returncode == None: return False elif self.process.returncode > 0: raise ProcessFailed() else: self.endtime = datetime.datetime.now() return True def run(self): if hasattr(self,'start'): self.start() self.wait() else: raise Exception("Not implemented yet, make sure to overload the run() method!") def startcommand(self, command, cwd, stdout, stderr, *arguments, **parameters): argdelimiter=' ' printcommand = True cmd = command if arguments: cmd += ' ' + " ".join([ u(x) for x in arguments]) if parameters: for key, value in parameters.items(): if key == 'argdelimiter': argdelimiter = value elif key == 'printcommand': printcommand = value elif isinstance(value, bool) and value == True: cmd += ' ' + key elif key[-1] != '=': cmd += ' ' + key + argdelimiter + str(value) else: cmd += ' ' + key + str(value) if printcommand: print("STARTING COMMAND: " + cmd, file=stderr) self.begintime = datetime.datetime.now() if not cwd: self.process = subprocess.Popen(cmd, shell=True,stdout=stdout,stderr=stderr) else: self.process = subprocess.Popen(cmd, shell=True,cwd=cwd,stdout=stdout,stderr=stderr) #pid = process.pid #os.waitpid(pid, 0) #wait for process to finish return self.process def wait(self): while not self.done(): time.sleep(1) pass def score(self): raise Exception("Not implemented yet, make sure to overload the score() method") def delete(self): pass def sample(self, size): """Return a sample of the input data""" raise Exception("Not implemented yet, make sure to overload the sample() method") class ExperimentPool(object): def __init__(self, size): self.size = size self.queue = [] self.running = [] def append(self, experiment): assert isinstance(experiment, AbstractExperiment) self.queue.append( experiment ) def __len__(self): return len(self.queue) def __iter__(self): return iter(self.queue) def start(self, experiment): experiment.start() self.running.append( experiment ) def poll(self, haltonerror=True): done = [] for experiment in self.running: try: if experiment.done(): done.append( experiment ) except ProcessFailed: print("ERROR: One experiment in the pool failed: " + repr(experiment.inputdata) + repr(experiment.parameters), file=stderr) if haltonerror: raise else: done.append( experiment ) for experiment in done: self.running.remove( experiment ) return done def run(self, haltonerror=True): while True: #check how many processes are done done = self.poll(haltonerror) for experiment in done: yield experiment #start new processes while self.queue and len(self.running) < self.size: self.start( self.queue.pop(0) ) if not self.queue and not self.running: break class WPSParamSearch(object): """ParamSearch with support for Wrapped Progressive Sampling""" def __init__(self, experimentclass, inputdata, size, parameterscope, poolsize=1, sizefunc=None, prunefunc=None, constraintfunc = None, delete=True): #parameterscope: {'parameter':[values]} self.ExperimentClass = experimentclass self.inputdata = inputdata self.poolsize = poolsize #0 or 1: sequential execution (uses experiment.run() ), >1: parallel execution using ExperimentPool (uses experiment.start() ) self.maxsize = size self.delete = delete #delete intermediate experiments if self.maxsize == -1: self.sizefunc = lambda x,y: self.maxsize else: if sizefunc != None: self.sizefunc = sizefunc else: self.sizefunc = lambda i, maxsize: round((maxsize/100.0)*i*i) #prunefunc should return a number between 0 and 1, indicating how much is pruned. (for example: 0.75 prunes three/fourth of all combinations, retaining only 25%) if prunefunc != None: self.prunefunc = prunefunc else: self.prunefunc = lambda i: 0.5 if constraintfunc != None: self.constraintfunc = constraintfunc else: self.constraintfunc = lambda x: True #compute all parameter combinations: if isinstance(parameterscope, dict): verboseparameterscope = [ self._combine(x,y) for x,y in parameterscope.items() ] else: verboseparameterscope = [ self._combine(x,y) for x,y in parameterscope ] self.parametercombinations = [ (x,0) for x in itertools.product(*verboseparameterscope) if self.constraintfunc(dict(x)) ] #generator def _combine(self,name, values): #TODO: can't we do this inline in a list comprehension? l = [] for value in values: l.append( (name, value) ) return l def searchbest(self): solution = None for s in iter(self): solution = s return solution[0] def test(self,i=None): #sample size elements from inputdata if i is None or self.maxsize == -1: data = self.inputdata else: size = int(self.sizefunc(i, self.maxsize)) if size > self.maxsize: return [] data = self.ExperimentClass.sample(self.inputdata, size) #run on ALL available parameter combinations and retrieve score newparametercombinations = [] if self.poolsize <= 1: #Don't use experiment pool, sequential execution for parameters,score in self.parametercombinations: experiment = self.ExperimentClass(data, **dict(parameters)) experiment.run() newparametercombinations.append( (parameters, experiment.score()) ) if self.delete: experiment.delete() else: #Use experiment pool, parallel execution pool = ExperimentPool(self.poolsize) for parameters,score in self.parametercombinations: pool.append( self.ExperimentClass(data, **dict(parameters)) ) for experiment in pool.run(False): newparametercombinations.append( (experiment.parameters, experiment.score()) ) if self.delete: experiment.delete() return newparametercombinations def __iter__(self): i = 0 while True: i += 1 newparametercombinations = self.test(i) #prune the combinations, keeping only the best prune = int(round(self.prunefunc(i) * len(newparametercombinations))) self.parametercombinations = sorted(newparametercombinations, key=lambda v: v[1])[prune:] yield [ x[0] for x in self.parametercombinations ] if len(self.parametercombinations) <= 1: break class ParamSearch(WPSParamSearch): """A simpler version of ParamSearch without Wrapped Progressive Sampling""" def __init__(self, experimentclass, inputdata, parameterscope, poolsize=1, constraintfunc = None, delete=True): #parameterscope: {'parameter':[values]} prunefunc = lambda x: 0 super(ParamSearch, self).__init__(experimentclass, inputdata, -1, parameterscope, poolsize, None,prunefunc, constraintfunc, delete) def __iter__(self): for parametercombination, score in sorted(self.test(), key=lambda v: v[1]): yield parametercombination, score def filesampler(files, testsetsize = 0.1, devsetsize = 0, trainsetsize = 0, outputdir = '', encoding='utf-8'): """Extract a training set, test set and optimally a development set from one file, or multiple *interdependent* files (such as a parallel corpus). It is assumed each line contains one instance (such as a word or sentence for example).""" if not isinstance(files, list): files = list(files) total = 0 for filename in files: f = io.open(filename,'r', encoding=encoding) count = 0 for line in f: count += 1 f.close() if total == 0: total = count elif total != count: raise Exception("Size mismatch, when multiple files are specified they must contain the exact same amount of lines! (" +str(count) + " vs " + str(total) +")") #support for relative values: if testsetsize < 1: testsetsize = int(total * testsetsize) if devsetsize < 1 and devsetsize > 0: devsetsize = int(total * devsetsize) if testsetsize >= total or devsetsize >= total or testsetsize + devsetsize >= total: raise Exception("Test set and/or development set too large! No samples left for training set!") trainset = {} testset = {} devset = {} for i in range(1,total+1): trainset[i] = True for i in random.sample(trainset.keys(), int(testsetsize)): testset[i] = True del trainset[i] if devsetsize > 0: for i in random.sample(trainset.keys(), int(devsetsize)): devset[i] = True del trainset[i] if trainsetsize > 0: newtrainset = {} for i in random.sample(trainset.keys(), int(trainsetsize)): newtrainset[i] = True trainset = newtrainset for filename in files: if not outputdir: ftrain = io.open(filename + '.train','w',encoding=encoding) else: ftrain = io.open(outputdir + '/' + os.path.basename(filename) + '.train','w',encoding=encoding) if not outputdir: ftest = io.open(filename + '.test','w',encoding=encoding) else: ftest = io.open(outputdir + '/' + os.path.basename(filename) + '.test','w',encoding=encoding) if devsetsize > 0: if not outputdir: fdev = io.open(filename + '.dev','w',encoding=encoding) else: fdev = io.open(outputdir + '/' + os.path.basename(filename) + '.dev','w',encoding=encoding) f = io.open(filename,'r',encoding=encoding) for linenum, line in enumerate(f): if linenum+1 in trainset: ftrain.write(line) elif linenum+1 in testset: ftest.write(line) elif devsetsize > 0 and linenum+1 in devset: fdev.write(line) f.close() ftrain.close() ftest.close() if devsetsize > 0: fdev.close() PyNLPl-1.1.2/pynlpl/statistics.py0000644000175000001440000005323312466637746017616 0ustar proyconusers00000000000000############################################################### # PyNLPp - Statistics & Information Theory Library # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # http://www.github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Also contains MIT licensed code from # AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html # Peter Norvig # # Licensed under GPLv3 # ############################################################### """This is a Python library containing classes for Statistic and Information Theoretical computations. It also contains some code from Peter Norvig, AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html""" from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import u, isstring import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout import io import math import random import operator from collections import Counter class FrequencyList(object): """A frequency list (implemented using dictionaries)""" def __init__(self, tokens = None, casesensitive = True, dovalidation = True): self._count = Counter() self._ranked = {} self.total = 0 #number of tokens self.casesensitive = casesensitive self.dovalidation = dovalidation if tokens: self.append(tokens) def load(self, filename): """Load a frequency list from file (in the format produced by the save method)""" f = io.open(filename,'r',encoding='utf-8') for line in f: data = line.strip().split("\t") type, count = data[:2] self.count(type,count) f.close() def save(self, filename, addnormalised=False): """Save a frequency list to file, can be loaded later using the load method""" f = io.open(filename,'w',encoding='utf-8') for line in self.output("\t", addnormalised): f.write(line + '\n') f.close() def _validate(self,type): if isinstance(type,list): type = tuple(type) if isinstance(type,tuple): if not self.casesensitive: return tuple([x.lower() for x in type]) else: return type else: if not self.casesensitive: return type.lower() else: return type def append(self,tokens): """Add a list of tokens to the frequencylist. This method will count them for you.""" for token in tokens: self.count(token) def count(self, type, amount = 1): """Count a certain type. The counter will increase by the amount specified (defaults to one)""" if self.dovalidation: type = self._validate(type) if self._ranked: self._ranked = None if type in self._count: self._count[type] += amount else: self._count[type] = amount self.total += amount def sum(self): """Returns the total amount of tokens""" return self.total def _rank(self): if not self._ranked: self._ranked = self._count.most_common() def __iter__(self): """Iterate over the frequency lists, in order (frequent to rare). This is a generator that yields (type, count) pairs. The first time you iterate over the FrequencyList, the ranking will be computed. For subsequent calls it will be available immediately, unless the frequency list changed in the meantime.""" self._rank() for type, count in self._ranked: yield type, count def items(self): """Returns an *unranked* list of (type, count) pairs. Use this only if you are not interested in the order.""" for type, count in self._count.items(): yield type, count def __getitem__(self, type): if self.dovalidation: type = self._validate(type) try: return self._count[type] except KeyError: return 0 def __setitem__(self, type, value): """alias for count, but can only be called once""" if self.dovalidation: type = self._validate(type) if not type in self._count: self.count(type,value) else: raise ValueError("This type is already set!") def __delitem__(self, type): if self.dovalidation: type = self._validate(type) del self._count[type] if self._ranked: self._ranked = None def typetokenratio(self): """Computes the type/token ratio""" return len(self._count) / float(self.total) def __len__(self): """Returns the total amount of types""" return len(self._count) def tokens(self): """Returns the total amount of tokens""" return self.total def mode(self): """Returns the type that occurs the most frequently in the frequency list""" self._rank() return self._ranked[0][0] def p(self, type): """Returns the probability (relative frequency) of the token""" if self.dovalidation: type = self._validate(type) return self._count[type] / float(self.total) def __eq__(self, otherfreqlist): return (self.total == otherfreqlist.total and self._count == otherfreqlist._count) def __contains__(self, type): """Checks if the specified type is in the frequency list""" if self.dovalidation: type = self._validate(type) return type in self._count def __add__(self, otherfreqlist): """Multiple frequency lists can be added together""" assert isinstance(otherfreqlist,FrequencyList) product = FrequencyList(None,) for type, count in self.items(): product.count(type,count) for type, count in otherfreqlist.items(): product.count(type,count) return product def output(self,delimiter = '\t', addnormalised=False): """Print a representation of the frequency list""" for type, count in self: if isinstance(type,tuple) or isinstance(type,list): if addnormalised: yield " ".join((u(x) for x in type)) + delimiter + str(count) + delimiter + str(count/self.total) else: yield " ".join((u(x) for x in type)) + delimiter + str(count) elif isstring(type): if addnormalised: yield type + delimiter + str(count) + delimiter + str(count/self.total) else: yield type + delimiter + str(count) else: if addnormalised: yield str(type) + delimiter + str(count) + delimiter + str(count/self.total) else: yield str(type) + delimiter + str(count) def __repr__(self): return repr(self._count) def __unicode__(self): #Python 2 return str(self) def __str__(self): return "\n".join(self.output()) def values(self): return self._count.values() def dict(self): return self._count #class FrequencyTrie: # def __init__(self): # self.data = Tree() # # def count(self, sequence): # # # self.data.append( Tree(item) ) class Distribution(object): """A distribution can be created over a FrequencyList or a plain dictionary with numeric values. It will be normalized automatically. This implemtation uses dictionaries/hashing""" def __init__(self, data, base = 2): self.base = base #logarithmic base: can be set to 2, 10 or math.e (or anything else). when set to None, it's set to e automatically self._dist = {} if isinstance(data, FrequencyList): for type, count in data.items(): self._dist[type] = count / data.total elif isinstance(data, dict) or isinstance(data, list): if isinstance(data, list): self._dist = {} for key,value in data: self._dist[key] = float(value) else: self._dist = data total = sum(self._dist.values()) if total < 0.999 or total > 1.000: #normalize again for key, value in self._dist.items(): self._dist[key] = value / total else: raise Exception("Can't create distribution") self._ranked = None def _rank(self): if not self._ranked: self._ranked = sorted(self._dist.items(),key=lambda x: x[1], reverse=True ) def information(self, type): """Computes the information content of the specified type: -log_e(p(X))""" if not self.base: return -math.log(self._dist[type]) else: return -math.log(self._dist[type], self.base) def poslog(self, type): """alias for information content""" return self.information(type) def entropy(self, base = 2): """Compute the entropy of the distribution""" entropy = 0 if not base and self.base: base = self.base for type in self._dist: if not base: entropy += self._dist[type] * -math.log(self._dist[type]) else: entropy += self._dist[type] * -math.log(self._dist[type], base) return entropy def perplexity(self, base=2): return base ** self.entropy(base) def mode(self): """Returns the type that occurs the most frequently in the probability distribution""" self._rank() return self._ranked[0][0] def maxentropy(self, base = 2): """Compute the maximum entropy of the distribution: log_e(N)""" if not base and self.base: base = self.base if not base: return math.log(len(self._dist)) else: return math.log(len(self._dist), base) def __len__(self): """Returns the number of types""" return len(self._dist) def __getitem__(self, type): """Return the probability for this type""" return self._dist[type] def __iter__(self): """Iterate over the *ranked* distribution, returns (type, probability) pairs""" self._rank() for type, p in self._ranked: yield type, p def items(self): """Returns an *unranked* list of (type, prob) pairs. Use this only if you are not interested in the order.""" for type, count in self._dist.items(): yield type, count def output(self,delimiter = '\t', freqlist = None): """Generator yielding formatted strings expressing the time and probabily for each item in the distribution""" for type, prob in self: if freqlist: if isinstance(type,list) or isinstance(type, tuple): yield " ".join(type) + delimiter + str(freqlist[type]) + delimiter + str(prob) else: yield type + delimiter + str(freqlist[type]) + delimiter + str(prob) else: if isinstance(type,list) or isinstance(type, tuple): yield " ".join(type) + delimiter + str(prob) else: yield type + delimiter + str(prob) def __unicode__(self): return str(self) def __str__(self): return "\n".join(self.output()) def __repr__(self): return repr(self._dist) def keys(self): return self._dist.keys() def values(self): return self._dist.values() class MarkovChain(object): def __init__(self, startstate, endstate = None): self.nodes = set() self.edges_out = {} self.startstate = startstate self.endstate = endstate def settransitions(self, state, distribution): self.nodes.add(state) if not isinstance(distribution, Distribution): distribution = Distribution(distribution) self.edges_out[state] = distribution self.nodes.update(distribution.keys()) def __iter__(self): for state, distribution in self.edges_out.items(): yield state, distribution def __getitem__(self, state): for distribution in self.edges_out[state]: yield distribution def size(self): return len(self.nodes) def accessible(self,fromstate, tostate): """Is state tonode directly accessible (in one step) from state fromnode? (i.e. is there an edge between the nodes). If so, return the probability, else zero""" if (not (fromstate in self.nodes)) or (not (tostate in self.nodes)) or not (fromstate in self.edges_out): return 0 if tostate in self.edges_out[fromstate]: return self.edges_out[fromstate][tostate] else: return 0 def communicates(self,fromstate, tostate, maxlength=999999): """See if a node communicates (directly or indirectly) with another. Returns the probability of the *shortest* path (probably, but not necessarily the highest probability)""" if (not (fromstate in self.nodes)) or (not (tostate in self.nodes)): return 0 assert (fromstate != tostate) def _test(node,length,prob): if length > maxlength: return 0 if node == tostate: prob *= self.edges_out[node][tostate] return True for child in self.edges_out[node].keys(): if not child in visited: visited.add(child) if child == tostate: return prob * self.edges_out[node][tostate] else: r = _test(child, length+1, prob * self.edges_out[node][tostate]) if r: return r return 0 visited = set(fromstate) return _test(fromstate,1,1) def p(self, sequence, subsequence=True): """Returns the probability of the given sequence or subsequence (if subsequence=True, default).""" if sequence[0] != self.startstate: if isinstance(sequence, tuple): sequence = (self.startstate,) + sequence else: sequence = (self.startstate,) + tuple(sequence) if self.endstate: if sequence[-1] != self.endstate: if isinstance(sequence, tuple): sequence = sequence + (self.endstate,) else: sequence = tuple(sequence) + (self.endstate,) prevnode = None prob = 1 for node in sequence: if prevnode: try: prob *= self.edges_out[prevnode][node] except: return 0 return prob def __contains__(self, sequence): """Is the given sequence generated by the markov model? Does not work for subsequences!""" return bool(self.p(sequence,False)) def reducible(self): #TODO: implement raise NotImplementedError class HiddenMarkovModel(MarkovChain): def __init__(self, startstate, endstate = None): self.observablenodes = set() self.edges_toobservables = {} super(HiddenMarkovModel, self).__init__(startstate,endstate) def setemission(self, state, distribution): self.nodes.add(state) if not isinstance(distribution, Distribution): distribution = Distribution(distribution) self.edges_toobservables[state] = distribution self.observablenodes.update(distribution.keys()) def print_dptable(self, V): print(" ",end="",file=stdout) for i in range(len(V)): print("%7s" % ("%d" % i),end="",file=stdout) print(file=stdout) for y in V[0].keys(): print("%.5s: " % y, end="",file=stdout) for t in range(len(V)): print("%.7s" % ("%f" % V[t][y]),end="",file=stdout) print(file=stdout) #Adapted from: http://en.wikipedia.org/wiki/Viterbi_algorithm def viterbi(self,observations, doprint=False): #states, start_p, trans_p, emit_p): V = [{}] #Viterbi matrix path = {} # Initialize base cases (t == 0) for node in self.edges_out[self.startstate].keys(): try: V[0][node] = self.edges_out[self.startstate][node] * self.edges_toobservables[node][observations[0]] path[node] = [node] except KeyError: pass #will be 0, don't store # Run Viterbi for t > 0 for t in range(1,len(observations)): V.append({}) newpath = {} for node in self.nodes: column = [] for prevnode in V[t-1].keys(): try: column.append( (V[t-1][prevnode] * self.edges_out[prevnode][node] * self.edges_toobservables[node][observations[t]], prevnode ) ) except KeyError: pass #will be 0 if column: (prob, state) = max(column) V[t][node] = prob newpath[node] = path[state] + [node] # Don't need to remember the old paths path = newpath if doprint: self.print_dptable(V) if not V[len(observations) - 1]: return (0,[]) else: (prob, state) = max([(V[len(observations) - 1][node], node) for node in V[len(observations) - 1].keys()]) return (prob, path[state]) # ********************* Common Functions ****************************** def product(seq): """Return the product of a sequence of numerical values. >>> product([1,2,6]) 12 """ if len(seq) == 0: return 0 else: product = 1 for x in seq: product *= x return product # All below functions are mathematical functions from AI: A Modern Approach, see: http://aima.cs.berkeley.edu/python/utils.html def histogram(values, mode=0, bin_function=None): #from AI: A Modern Appproach """Return a list of (value, count) pairs, summarizing the input values. Sorted by increasing value, or if mode=1, by decreasing count. If bin_function is given, map it over values first.""" if bin_function: values = map(bin_function, values) bins = {} for val in values: bins[val] = bins.get(val, 0) + 1 if mode: return sorted(bins.items(), key=lambda v: v[1], reverse=True) else: return sorted(bins.items()) def log2(x): #from AI: A Modern Appproach """Base 2 logarithm. >>> log2(1024) 10.0 """ return math.log(x, 2) def mode(values): #from AI: A Modern Appproach """Return the most common value in the list of values. >>> mode([1, 2, 3, 2]) 2 """ return histogram(values, mode=1)[0][0] def median(values): #from AI: A Modern Appproach """Return the middle value, when the values are sorted. If there are an odd number of elements, try to average the middle two. If they can't be averaged (e.g. they are strings), choose one at random. >>> median([10, 100, 11]) 11 >>> median([1, 2, 3, 4]) 2.5 """ n = len(values) values = sorted(values) if n % 2 == 1: return values[n/2] else: middle2 = values[(n/2)-1:(n/2)+1] try: return mean(middle2) except TypeError: return random.choice(middle2) def mean(values): #from AI: A Modern Appproach """Return the arithmetic average of the values.""" return sum(values) / len(values) def stddev(values, meanval=None): #from AI: A Modern Appproach """The standard deviation of a set of values. Pass in the mean if you already know it.""" if meanval == None: meanval = mean(values) return math.sqrt( sum([(x - meanval)**2 for x in values]) / (len(values)-1) ) def dotproduct(X, Y): #from AI: A Modern Appproach """Return the sum of the element-wise product of vectors x and y. >>> dotproduct([1, 2, 3], [1000, 100, 10]) 1230 """ return sum([x * y for x, y in zip(X, Y)]) def vector_add(a, b): #from AI: A Modern Appproach """Component-wise addition of two vectors. >>> vector_add((0, 1), (8, 9)) (8, 10) """ return tuple(map(operator.add, a, b)) def normalize(numbers, total=1.0): #from AI: A Modern Appproach """Multiply each number by a constant such that the sum is 1.0 (or total). >>> normalize([1,2,1]) [0.25, 0.5, 0.25] """ k = total / sum(numbers) return [k * n for n in numbers] ########################################################################################### def levenshtein(s1, s2, maxdistance=9999): """Computes the levenshtein distance between two strings. Adapted from: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python""" l1 = len(s1) l2 = len(s2) if l1 < l2: return levenshtein(s2, s1) if not s1: return len(s2) #If the words differ too much in length, (if we have a low maxdistance) , we needn't bother compute distance: if l1 > l2 + maxdistance: return maxdistance+1 previous_row = list(range(l2 + 1)) for i, c1 in enumerate(s1): current_row = [i + 1] for j, c2 in enumerate(s2): insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer deletions = current_row[j] + 1 # than s2 substitutions = previous_row[j] + (c1 != c2) current_row.append(min(insertions, deletions, substitutions)) if current_row[-1] > maxdistance: return current_row[-1] previous_row = current_row return previous_row[-1] PyNLPl-1.1.2/pynlpl/common.py0000644000175000001440000001030412445064173016665 0ustar proyconusers00000000000000#!/usr/bin/env python #-*- coding:utf-8 -*- ###############################################################9 # PyNLPl - Common functions # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # http://www.github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Licensed under GPLv3 # # This contains very common functions and language extensions # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import import datetime from sys import stderr, version ## From http://code.activestate.com/recipes/413486/ (r7) def Enum(*names): ##assert names, "Empty enums are not supported" # <- Don't like empty enums? Uncomment! class EnumClass(object): __slots__ = names def __iter__(self): return iter(constants) def __len__(self): return len(constants) def __getitem__(self, i): return constants[i] def __repr__(self): return 'Enum' + str(names) def __str__(self): return 'enum ' + str(constants) class EnumValue(object): __slots__ = ('__value') def __init__(self, value): self.__value = value Value = property(lambda self: self.__value) EnumType = property(lambda self: EnumType) def __hash__(self): return hash(self.__value) def __cmp__(self, other): # C fans might want to remove the following assertion # to make all enums comparable by ordinal value {;)) assert self.EnumType is other.EnumType, "Only values from the same enum are comparable" return cmp(self.__value, other.__value) def __invert__(self): return constants[maximum - self.__value] def __bool__(self): return bool(self.__value) def __nonzero__(self): return bool(self.__value) #Python 2.x def __repr__(self): return str(names[self.__value]) maximum = len(names) - 1 constants = [None] * len(names) for i, each in enumerate(names): val = EnumValue(i) setattr(EnumClass, each, val) constants[i] = val constants = tuple(constants) EnumType = EnumClass() return EnumType def u(s, encoding = 'utf-8', errors='strict'): #ensure s is properly unicode.. wrapper for python 2.6/2.7, if version < '3': #ensure the object is unicode if isinstance(s, unicode): return s else: return unicode(s, encoding,errors=errors) else: #will work on byte arrays if isinstance(s, str): return s else: return str(s,encoding,errors=errors) def b(s): #ensure s is bytestring if version < '3': #ensure the object is unicode if isinstance(s, str): return s else: return s.encode('utf-8') else: #will work on byte arrays if isinstance(s, bytes): return s else: return s.encode('utf-8') def isstring(s): #Is this a proper string? return isinstance(s, str) or (version < '3' and isinstance(s, unicode)) def log(msg, **kwargs): """Generic log method. Will prepend timestamp. Keyword arguments: system - Name of the system/module indent - Integer denoting the desired level of indentation streams - List of streams to output to stream - Stream to output to (singleton version of streams) """ if 'debug' in kwargs: if 'currentdebug' in kwargs: if kwargs['currentdebug'] < kwargs['debug']: return False else: return False #no currentdebug passed, assuming no debug mode and thus skipping message s = "[" + datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "] " if 'system' in kwargs: s += "[" + system + "] " if 'indent' in kwargs: s += ("\t" * int(kwargs['indent'])) s += u(msg) if s[-1] != '\n': s += '\n' if 'streams' in kwargs: streams = kwargs['streams'] elif 'stream' in kwargs: streams = [kwargs['stream']] else: streams = [stderr] for stream in streams: stream.write(s) return s PyNLPl-1.1.2/pynlpl/build/0000755000175000001440000000000013024723552016121 5ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/build/pynlpl/0000755000175000001440000000000013024723552017437 5ustar proyconusers00000000000000PyNLPl-1.1.2/pynlpl/build/pynlpl/algorithms.py0000644000175000001440000000320212445064173022162 0ustar proyconusers00000000000000 ###############################################################9 # PyNLPl - Algorithms # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # http://www.github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Licensed under GPLv3 # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import def sum_to_n(n, size, limit=None): #from http://stackoverflow.com/questions/2065553/python-get-all-numbers-that-add-up-to-a-number """Produce all lists of `size` positive integers in decreasing order that add up to `n`.""" if size == 1: yield [n] return if limit is None: limit = n start = (n + size - 1) // size stop = min(limit, n - size + 1) + 1 for i in range(start, stop): for tail in sum_to_n(n - i, size - 1, i): yield [i] + tail def consecutivegaps(n, leftmargin = 0, rightmargin = 0): """Compute all possible single consecutive gaps in any sequence of the specified length. Returns (beginindex, length) tuples. Runs in O(n(n+1) / 2) time. Argument is the length of the sequence rather than the sequence itself""" begin = leftmargin while begin < n: length = (n - rightmargin) - begin while length > 0: yield (begin, length) length -= 1 begin += 1 def bytesize(n): """Return the required size in bytes to encode the specified integer""" for i in range(1, 1000): if n < 2**(8*i): return i PyNLPl-1.1.2/pynlpl/textprocessors.py0000644000175000001440000003660712656166755020537 0ustar proyconusers00000000000000# -*- coding: utf8 -*- ############################################################### # PyNLPl - Text Processors # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # http://www.github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Licensed under GPLv3 # # This is a Python library containing text processors # ############################################################### from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import from pynlpl.common import isstring import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout import unicodedata import string import io import array import re from itertools import permutations from pynlpl.statistics import FrequencyList from pynlpl.formats import folia from pynlpl.algorithms import bytesize WHITESPACE = [" ", "\t", "\n", "\r","\v","\f"] EOSMARKERS = ('.','?','!','。',';','؟','。','?','!','।','։','՞','።','᙮','។','៕') REGEXP_URL = re.compile(r"^(?:(?:https?):(?:(?://)|(?:\\\\))|www\.)(?:[\w\d:#@%/;$()~_?\+-=\\\.&](?:#!)?)*") REGEXP_MAIL = re.compile(r"^[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+(?:\.[a-zA-Z]+)+") #email TOKENIZERRULES = (REGEXP_URL, REGEXP_MAIL) class Windower(object): """Moves a sliding window over a list of tokens, upon iteration in yields all n-grams of specified size in a tuple. Example without markers: >>> for ngram in Windower("This is a test .",3, None, None): ... print(" ".join(ngram)) This is a is a test a test . Example with default markers: >>> for ngram in Windower("This is a test .",3): ... print(" ".join(ngram)) This This is This is a is a test a test . test . . """ def __init__(self, tokens, n=1, beginmarker = "", endmarker = ""): """ Constructor for Windower :param tokens: The tokens to iterate over. Should be an itereable. Strings will be split on spaces automatically. :type tokens: iterable :param n: The size of the n-grams to extract :type n: integer :param beginmarker: The marker for the beginning of the sentence, defaults to "". Set to None if no markers are desired. :type beginmarker: string or None :param endmarker: The marker for the end of the sentence, defaults to "". Set to None if no markers are desired. :type endmarker: string or None """ if isinstance(tokens, str) or (sys.version < '3' and isinstance(tokens, unicode)): self.tokens = tuple(tokens.split()) else: self.tokens = tuple(tokens) assert isinstance(n, int) self.n = n self.beginmarker = beginmarker self.endmarker = endmarker def __len__(self): """Returns the number of n-grams in the data (quick computation without iteration) Without markers: >>> len(Windower("This is a test .",3, None, None)) 3 >>> len(Windower("This is a test .",2, None, None)) 4 >>> len(Windower("This is a test .",1, None, None)) 5 With default markers: >>> len(Windower("This is a test .",3)) 7 """ c = (len(self.tokens) - self.n) + 1 if self.beginmarker: c += self.n-1 if self.endmarker: c += self.n-1 return c def __iter__(self): """Yields an n-gram (tuple) at each iteration""" l = len(self.tokens) if self.beginmarker: beginmarker = (self.beginmarker), #tuple if self.endmarker: endmarker = (self.endmarker), #tuple for i in range(-(self.n - 1),l): begin = i end = i + self.n if begin >= 0 and end <= l: yield tuple(self.tokens[begin:end]) elif begin < 0 and end > l: if not self.beginmarker or not self.endmarker: continue else: yield tuple(((begin * -1) * beginmarker ) + self.tokens + ((end - l) * endmarker )) elif begin < 0: if not self.beginmarker: continue else: yield tuple(((begin * -1) * beginmarker ) + self.tokens[0:end]) elif end > l: if not self.endmarker: continue else: yield tuple(self.tokens[begin:] + ((end - l) * endmarker)) class MultiWindower(object): "Extract n-grams of various configurations from a sequence" def __init__(self,tokens, min_n = 1, max_n = 9, beginmarker=None, endmarker=None): if isinstance(tokens, str) or (sys.version < '3' and isinstance(tokens, unicode)): self.tokens = tuple(tokens.split()) else: self.tokens = tuple(tokens) assert isinstance(min_n, int) assert isinstance(max_n, int) self.min_n = min_n self.max_n = max_n self.beginmarker = beginmarker self.endmarker = endmarker def __iter__(self): for n in range(self.min_n, self.max_n + 1): for ngram in Windower(self.tokens,n, self.beginmarker, self.endmarker): yield ngram class ReflowText(object): """Attempts to re-flow a text that has arbitrary line endings in it. Also undoes hyphenisation""" def __init__(self, stream, filternontext=True): self.stream = stream self.filternontext = filternontext def __iter__(self): eosmarkers = ('.',':','?','!','"',"'","„","”","’") emptyline = 0 buffer = "" for line in self.stream: line = line.strip() if line: if emptyline: if buffer: yield buffer yield "" emptyline = 0 buffer = "" if buffer: buffer += ' ' if (line[-1] in eosmarkers): buffer += line yield buffer buffer = "" emptyline = 0 elif len(line) > 2 and line[-1] == '-' and line[-2].isalpha(): #undo hyphenisation buffer += line[:-1] else: if self.filternontext: hastext = False for c in line: if c.isalpha(): hastext = True break else: hastext = True if hastext: buffer += line else: emptyline += 1 #print "BUFFER=[" + buffer.encode('utf-8') + "] emptyline=" + str(emptyline) if buffer: yield buffer def calculate_overlap(haystack, needle, allowpartial=True): """Calculate the overlap between two sequences. Yields (overlap, placement) tuples (multiple because there may be multiple overlaps!). The former is the part of the sequence that overlaps, and the latter is -1 if the overlap is on the left side, 0 if it is a subset, 1 if it overlaps on the right side, 2 if its an identical match""" needle = tuple(needle) haystack = tuple(haystack) solutions = [] #equality check if needle == haystack: return [(needle, 2)] if allowpartial: minl =1 else: minl = len(needle) for l in range(minl,min(len(needle), len(haystack))+1): #print "LEFT-DEBUG", l,":", needle[-l:], " vs ", haystack[:l] #print "RIGHT-DEBUG", l,":", needle[:l], " vs ", haystack[-l:] #Search for overlap left (including partial overlap!) if needle[-l:] == haystack[:l]: #print "LEFT MATCH" solutions.append( (needle[-l:], -1) ) #Search for overlap right (including partial overlap!) if needle[:l] == haystack[-l:]: #print "RIGHT MATCH" solutions.append( (needle[:l], 1) ) if len(needle) <= len(haystack): options = list(iter(Windower(haystack,len(needle),beginmarker=None,endmarker=None))) for option in options[1:-1]: if option == needle: #print "SUBSET MATCH" solutions.append( (needle, 0) ) return solutions class Tokenizer(object): """A tokenizer and sentence splitter, which acts on a file/stream-like object and when iterating over the object it yields a lists of tokens (in case the sentence splitter is active (default)), or a token (if the sentence splitter is deactivated). """ def __init__(self, stream, splitsentences=True, onesentenceperline=False, regexps=TOKENIZERRULES): """ Constructor for Tokenizer :param stream: An iterable or file-object containing the data to tokenize :type stream: iterable or file-like object :param splitsentences: Enable sentence splitter? (default=_True_) :type splitsentences: bool :param onesentenceperline: Assume input has one sentence per line? (default=_False_) :type onesentenceperline: bool :param regexps: Regular expressions to use as tokeniser rules in tokenisation (default=_pynlpl.textprocessors.TOKENIZERRULES_) :type regexps: Tuple/list of regular expressions to use in tokenisation """ self.stream = stream self.regexps = regexps self.splitsentences=splitsentences self.onesentenceperline = onesentenceperline def __iter__(self): buffer = "" for line in self.stream: line = line.strip() if line: if buffer: buffer += "\n" buffer += line if (self.onesentenceperline or not line) and buffer: if self.splitsentences: yield split_sentences(tokenize(buffer)) else: for token in tokenize(buffer, self.regexps): yield token buffer = "" if buffer: if self.splitsentences: yield split_sentences(tokenize(buffer)) else: for token in tokenize(buffer, self.regexps): yield token def tokenize(text, regexps=TOKENIZERRULES): """Tokenizes a string and returns a list of tokens :param text: The text to tokenise :type text: string :param regexps: Regular expressions to use as tokeniser rules in tokenisation (default=_pynlpl.textprocessors.TOKENIZERRULES_) :type regexps: Tuple/list of regular expressions to use in tokenisation :rtype: Returns a list of tokens Examples: >>> for token in tokenize("This is a test."): ... print(token) This is a test . """ for i,regexp in list(enumerate(regexps)): if isstring(regexp): regexps[i] = re.compile(regexp) tokens = [] begin = 0 for i, c in enumerate(text): if begin > i: continue elif i == begin: m = False for regexp in regexps: m = regexp.findall(text[i:i+300]) if m: tokens.append(m[0]) begin = i + len(m[0]) break if m: continue if c in string.punctuation or c in WHITESPACE: prev = text[i-1] if i > 0 else "" next = text[i+1] if i < len(text)-1 else "" if (c == '.' or c == ',') and prev.isdigit() and next.isdigit(): #punctuation in between numbers, keep as one token pass elif (c == "'" or c == "`") and prev.isalpha() and next.isalpha(): #quote in between chars, keep... pass elif c not in WHITESPACE and next == c: #group clusters of identical punctuation together continue elif c == '\r' and prev == '\n': #ignore begin = i+1 continue else: token = text[begin:i] if token: tokens.append(token) if c not in WHITESPACE: tokens.append(c) #anything but spaces and newlines (i.e. punctuation) counts as a token too begin = i + 1 #set the begin cursor if begin <= len(text) - 1: token = text[begin:] tokens.append(token) return tokens def crude_tokenizer(text): """Replaced by tokenize(). Alias""" return tokenize(text) #backwards-compatibility, not so crude anymore def tokenise(text, regexps=TOKENIZERRULES): #for the British """Alias for the British""" return tokenize(text) def is_end_of_sentence(tokens,i ): # is this an end-of-sentence marker? ... and is this either # the last token or the next token is NOT an end of sentence # marker as well? (to deal with ellipsis etc) return tokens[i] in EOSMARKERS and (i == len(tokens) - 1 or not tokens[i+1] in EOSMARKERS) def split_sentences(tokens): """Split sentences (based on tokenised data), returns sentences as a list of lists of tokens, each sentence is a list of tokens""" begin = 0 for i, token in enumerate(tokens): if is_end_of_sentence(tokens, i): yield tokens[begin:i+1] begin = i+1 if begin <= len(tokens)-1: yield tokens[begin:] def strip_accents(s, encoding= 'utf-8'): """Strip characters with diacritics and return a flat ascii representation""" if sys.version < '3': if isinstance(s,unicode): return unicodedata.normalize('NFKD', s).encode('ASCII', 'ignore') else: return unicodedata.normalize('NFKD', unicode(s,encoding)).encode('ASCII', 'ignore') else: if isinstance(s,bytes): s = str(s,encoding) return str(unicodedata.normalize('NFKD', s).encode('ASCII', 'ignore'),'ascii') def swap(tokens, maxdist=2): """Perform a swap operation on a sequence of tokens, exhaustively swapping all tokens up to the maximum specified distance. This is a subset of all permutations.""" assert maxdist >= 2 tokens = list(tokens) if maxdist > len(tokens): maxdist = len(tokens) l = len(tokens) for i in range(0,l - 1): for permutation in permutations(tokens[i:i+maxdist]): if permutation != tuple(tokens[i:i+maxdist]): newtokens = tokens[:i] newtokens += permutation newtokens += tokens[i+maxdist:] yield newtokens if maxdist == len(tokens): break def find_keyword_in_context(tokens, keyword, contextsize=1): """Find a keyword in a particular sequence of tokens, and return the local context. Contextsize is the number of words to the left and right. The keyword may have multiple word, in which case it should to passed as a tuple or list""" if isinstance(keyword,tuple) and isinstance(keyword,list): l = len(keyword) else: keyword = (keyword,) l = 1 n = l + contextsize*2 focuspos = contextsize + 1 for ngram in Windower(tokens,n,None,None): if ngram[focuspos:focuspos+l] == keyword: yield ngram[:focuspos], ngram[focuspos:focuspos+l],ngram[focuspos+l+1:] if __name__ == "__main__": import doctest doctest.testmod() PyNLPl-1.1.2/pynlpl/search.py0000644000175000001440000005375612445064173016664 0ustar proyconusers00000000000000#--------------------------------------------------------------- # PyNLPl - Search Algorithms # by Maarten van Gompel # Centre for Language Studies # Radboud University Nijmegen # http://www.github.com/proycon/pynlpl # proycon AT anaproy DOT nl # # Licensed under GPLv3 # #---------------------------------------------------------------- """This module contains various search algorithms.""" from __future__ import print_function from __future__ import unicode_literals from __future__ import division from __future__ import absolute_import #from pynlpl.common import u import sys if sys.version < '3': from codecs import getwriter stderr = getwriter('utf-8')(sys.stderr) stdout = getwriter('utf-8')(sys.stdout) else: stderr = sys.stderr stdout = sys.stdout from pynlpl.datatypes import FIFOQueue, PriorityQueue from collections import deque from bisect import bisect_left class AbstractSearchState(object): def __init__(self, parent = None, cost = 0): self.parent = parent self.cost = cost def test(self, goalstates = None): """Checks whether this state is a valid goal state, returns a boolean. If no goalstate is defined, then all states will test positively, this is what you usually want for optimisation problems.""" if goalstates: return (self in goalstates) else: return True #raise Exception("Classes derived from AbstractSearchState must define a test() method!") def score(self): """Should return a heuristic value. This needs to be set if you plan to used an informed search algorithm.""" raise Exception("Classes derived from AbstractSearchState must define a score() method if used in informed search algorithms!") def expand(self): """Generates successor states, implement your custom operators in the derived method.""" raise Exception("Classes derived from AbstractSearchState must define an expand() method!") def __eq__(self): """Implement an equality test in the derived method, based only on the state's content (not its path etc!)""" raise Exception("Classes derived from AbstractSearchState must define an __eq__() method!") def __lt__(self, other): assert isinstance(other, AbstractSearchState) return self.score() < other.score() def __gt__(self, other): assert isinstance(other, AbstractSearchState) return self.score() > other.score() def __hash__(self): """Return a unique hash for this state, based on its ID""" raise Exception("Classes derived from AbstractSearchState must define a __hash__() method if the search space is a graph and visited nodes to be are stored in memory!") def depth(self): if not self.parent: return 0 else: return self.parent.depth() + 1 #def __len__(self): # return len(self.path()) def path(self): if not self.parent: return [self] else: return self.parent.path() + [self] def pathcost(self): if not self.parent: return self.cost else: return self.parent.pathcost() + self.cost #def __cmp__(self, other): # if self.score < other.score: # return -1 # elif self.score > other.score: # return 1 # else: # return 0 class AbstractSearch(object): #not a real search, just a base class for DFS and BFS def __init__(self, **kwargs): """For graph-searches graph=True is required (default), otherwise the search may loop forever. For tree-searches, set tree=True for better performance""" self.usememory = True self.poll = lambda x: x.pop self.maxdepth = False #unlimited self.minimize = False #minimize rather than maximize the score function? default: no self.keeptraversal = False self.goalstates = None self.exhaustive = False #only some subclasses use this self.traversed = 0 #Count of number of nodes visited self.solutions = 0 #Counts the number of solutions self.debug = 0 for key, value in kwargs.items(): if key == 'graph': self.usememory = value #search space is a graph? memory required to keep visited states elif key == 'tree': self.usememory = not value; #search space is a tree? memory not required elif key == 'poll': self.poll = value #function elif key == 'maxdepth': self.maxdepth = value elif key == 'minimize': self.minimize = value elif key == 'maximize': self.minimize = not value elif key == 'keeptraversal': #remember entire traversal? self.keeptraversal = value elif key == 'goal' or key == 'goals': if isinstance(value, list) or isinstance(value, tuple): self.goalstates = value else: self.goalstates = [value] elif key == 'exhaustive': self.exhaustive = True elif key == 'debug': self.debug = value self._visited = {} self._traversal = [] self.incomplete = False self.traversed = 0 def reset(self): self._visited = {} self._traversal = [] self.incomplete = False self.traversed = 0 #Count of all visited nodes self.solutions = 0 #Counts the number of solutions found def traversal(self): """Returns all visited states (only when keeptraversal=True), note that this is not equal to the path, but contains all states that were checked!""" if self.keeptraversal: return self._traversal else: raise Exception("No traversal available, algorithm not started with keeptraversal=True!") def traversalsize(self): """Returns the number of nodes visited (also when keeptravel=False). Note that this is not equal to the path, but contains all states that were checked!""" return self.traversed def visited(self, state): if self.usememory: return (hash(state) in self._visited) else: raise Exception("No memory kept, algorithm not started with graph=True!") def __iter__(self): """Generator yielding *all* valid goalstates it can find,""" n = 0 while len(self.fringe) > 0: n += 1 if self.debug: print("\t[pynlpl debug] *************** ITERATION #" + str(n) + " ****************",file=stderr) if self.debug: print("\t[pynlpl debug] FRINGE: ", self.fringe,file=stderr) state = self.poll(self.fringe)() if self.debug: try: print("\t[pynlpl debug] CURRENT STATE (depth " + str(state.depth()) + "): " + str(state),end="",file=stderr) except AttributeError: print("\t[pynlpl debug] CURRENT STATE: " + str(state),end="",file=stderr) print(" hash="+str(hash(state)),file=stderr) try: print(" score="+str(state.score()),file=stderr) except: pass #If node not visited before (or no memory kept): if not self.usememory or (self.usememory and not hash(state) in self._visited): #Evaluate the current state self.traversed += 1 if state.test(self.goalstates): if self.debug: print("\t[pynlpl debug] Valid goalstate, yielding",file=stderr) yield state elif self.debug: print("\t[pynlpl debug] (no goalstate, not yielding)",file=stderr) #Expand the specified state and add to the fringe #if self.debug: print >>stderr,"\t[pynlpl debug] EXPANDING:" statecount = 0 for i, s in enumerate(state.expand()): statecount += 1 if self.debug >= 2: print("\t[pynlpl debug] (Iteration #" + str(n) +") Expanded state #" + str(i+1) + ", adding to fringe: " + str(s),end="",file=stderr) try: print(s.score(),file=stderr) except: print("ERROR SCORING!",file=stderr) pass if not self.maxdepth or s.depth() <= self.maxdepth: self.fringe.append(s) else: if self.debug: print("\t[pynlpl debug] (Iteration #" + str(n) +") Not adding to fringe, maxdepth exceeded",file=stderr) self.incomplete = True if self.debug: print("\t[pynlpl debug] Expanded " + str(statecount) + " states, offered to fringe",file=stderr) if self.keeptraversal: self._traversal.append(state) if self.usememory: self._visited[hash(state)] = True self.prune(state) #calls prune method else: if self.debug: print("\t[pynlpl debug] State already visited before, not expanding again...(hash="+str(hash(state))+")",file=stderr) if self.debug: print("\t[pynlpl debug] Search complete: " + str(self.solutions) + " solution(s), " + str(self.traversed) + " states traversed in " + str(n) + " rounds",file=stderr) def searchfirst(self): """Returns the very first result (regardless of it being the best or not!)""" for solution in self: return solution def searchall(self): """Returns a list of all solutions""" return list(iter(self)) def searchbest(self): """Returns the single best result (if multiple have the same score, the first match is returned)""" finalsolution = None bestscore = None for solution in self: if bestscore == None: bestscore = solution.score() finalsolution = solution elif self.minimize: score = solution.score() if score < bestscore: bestscore = score finalsolution = solution elif not self.minimize: score = solution.score() if score > bestscore: bestscore = score finalsolution = solution return finalsolution def searchtop(self,n=10): """Return the top n best resulta (or possibly less if not enough is found)""" solutions = PriorityQueue([], lambda x: x.score, self.minimize, length=n, blockworse=False, blockequal=False,duplicates=False) for solution in self: solutions.append(solution) return solutions def searchlast(self,n=10): """Return the last n results (or possibly less if not found). Note that the last results are not necessarily the best ones! Depending on the search type.""" solutions = deque([], n) for solution in self: solutions.append(solution) return solutions def prune(self, state): """Pruning method is called AFTER expansion of each node""" #pruning nothing by default pass class DepthFirstSearch(AbstractSearch): def __init__(self, state, **kwargs): assert isinstance(state, AbstractSearchState) self.fringe = [ state ] super(DepthFirstSearch,self).__init__(**kwargs) class BreadthFirstSearch(AbstractSearch): def __init__(self, state, **kwargs): assert isinstance(state, AbstractSearchState) self.fringe = FIFOQueue([state]) super(BreadthFirstSearch,self).__init__(**kwargs) class IterativeDeepening(AbstractSearch): def __init__(self, state, **kwargs): assert isinstance(state, AbstractSearchState) self.state = state self.kwargs = kwargs self.traversed = 0 def __iter__(self): self.traversed = 0 d = 0 while not 'maxdepth' in self.kwargs or d <= self.kwargs['maxdepth']: dfs = DepthFirstSearch(self.state, **self.kwargs) self.traversed += dfs.traversalsize() for match in dfs: yield match if dfs.incomplete: d +=1 else: break def traversal(self): #TODO: add raise Exception("not implemented yet") def traversalsize(self): return self.traversed class BestFirstSearch(AbstractSearch): def __init__(self, state, **kwargs): super(BestFirstSearch,self).__init__(**kwargs) assert isinstance(state, AbstractSearchState) self.fringe = PriorityQueue([state], lambda x: x.score, self.minimize, length=0, blockworse=False, blockequal=False,duplicates=False) class BeamSearch(AbstractSearch): """Local beam search algorithm""" def __init__(self, states, beamsize, **kwargs): if isinstance(states, AbstractSearchState): states = [states] else: assert all( ( isinstance(x, AbstractSearchState) for x in states) ) self.beamsize = beamsize if 'eager' in kwargs: self.eager = kwargs['eager'] else: self.eager = False super(BeamSearch,self).__init__(**kwargs) self.incomplete = True self.duplicates = kwargs['duplicates'] if 'duplicates' in kwargs else False self.fringe = PriorityQueue(states, lambda x: x.score, self.minimize, length=0, blockworse=False, blockequal=False,duplicates= self.duplicates) def __iter__(self): """Generator yielding *all* valid goalstates it can find""" i = 0 while len(self.fringe) > 0: i +=1 if self.debug: print("\t[pynlpl debug] *************** STARTING ROUND #" + str(i) + " ****************",file=stderr) b = 0 #Create a new empty fixed-length priority queue (this implies there will be pruning if more items are offered than it can hold!) successors = PriorityQueue([], lambda x: x.score, self.minimize, length=self.beamsize, blockworse=False, blockequal=False,duplicates= self.duplicates) while len(self.fringe) > 0: b += 1 if self.debug: print("\t[pynlpl debug] *************** ROUND #" + str(i) + " BEAM# " + str(b) + " ****************",file=stderr) #if self.debug: print >>stderr,"\t[pynlpl debug] FRINGE: ", self.fringe state = self.poll(self.fringe)() if self.debug: try: print("\t[pynlpl debug] CURRENT STATE (depth " + str(state.depth()) + "): " + str(state),end="",file=stderr) except AttributeError: print("\t[pynlpl debug] CURRENT STATE: " + str(state),end="",file=stderr) print(" hash="+str(hash(state)),file=stderr) try: print(" score="+str(state.score()),file=stderr) except: pass if not self.usememory or (self.usememory and not hash(state) in self._visited): self.traversed += 1 #Evaluate state if state.test(self.goalstates): if self.debug: print("\t[pynlpl debug] Valid goalstate, yielding",file=stderr) self.solutions += 1 #counts the number of solutions yield state elif self.debug: print("\t[pynlpl debug] (no goalstate, not yielding)",file=stderr) if self.eager: score = state.score() #Expand the specified state and offer to the fringe statecount = offers = 0 for j, s in enumerate(state.expand()): statecount += 1 if self.debug >= 2: print("\t[pynlpl debug] (Round #" + str(i) +" Beam #" + str(b) + ") Expanded state #" + str(j+1) + ", offering to successor pool: " + str(s),end="",file=stderr) try: print(s.score(),end="",file=stderr) except: print("ERROR SCORING!",end="",file=stderr) pass if not self.maxdepth or s.depth() <= self.maxdepth: if not self.eager: #use all successors (even worse ones than the current state) offers += 1 accepted = successors.append(s) else: #use only equal or better successors if s.score() >= score: offers += 1 accepted = successors.append(s) else: accepted = False if self.debug >= 2: if accepted: print(" ACCEPTED",file=stderr) else: print(" REJECTED",file=stderr) else: if self.debug >= 2: print(" REJECTED, MAXDEPTH EXCEEDED.",file=stderr) elif self.debug: print("\t[pynlpl debug] Not offered to successor pool, maxdepth exceeded",file=stderr) if self.debug: print("\t[pynlpl debug] Expanded " + str(statecount) + " states, " + str(offers) + " offered to successor pool",file=stderr) if self.keeptraversal: self._traversal.append(state) if self.usememory: self._visited[hash(state)] = True self.prune(state) #calls prune method (does nothing by default in this search!!!) else: if self.debug: print("\t[pynlpl debug] State already visited before, not expanding again... (hash=" + str(hash(state)) +")",file=stderr) #AFTER EXPANDING ALL NODES IN THE FRINGE/BEAM: #set fringe for next round self.fringe = successors #Pruning is implicit, successors was a fixed-size priority queue if self.debug: print("\t[pynlpl debug] (Round #" + str(i) + ") Implicitly pruned with beamsize " + str(self.beamsize) + "...",file=stderr) #self.fringe.prune(self.beamsize) if self.debug: print(" (" + str(offers) + " to " + str(len(self.fringe)) + " items)",file=stderr) if self.debug: print("\t[pynlpl debug] Search complete: " + str(self.solutions) + " solution(s), " + str(self.traversed) + " states traversed in " + str(i) + " rounds with " + str(b) + " beams",file=stderr) class EarlyEagerBeamSearch(AbstractSearch): """A beam search that prunes early (after each state expansion) and eagerly (weeding out worse successors)""" def __init__(self, state, beamsize, **kwargs): assert isinstance(state, AbstractSearchState) self.beamsize = beamsize super(EarlyEagerBeamSearch,self).__init__(**kwargs) self.fringe = PriorityQueue(state, lambda x: x.score, self.minimize, length=0, blockworse=False, blockequal=False,duplicates= kwargs['duplicates'] if 'duplicates' in kwargs else False) self.incomplete = True def prune(self, state): if self.debug: l = len(self.fringe) print("\t[pynlpl debug] pruning with beamsize " + str(self.beamsize) + "...",end="",file=stderr) self.fringe.prunebyscore(state.score(), retainequalscore=True) self.fringe.prune(self.beamsize) if self.debug: print(" (" + str(l) + " to " + str(len(self.fringe)) + " items)",file=stderr) class BeamedBestFirstSearch(BeamSearch): """Best first search with a beamsize (non-optimal!)""" def prune(self, state): if self.debug: l = len(self.fringe) print("\t[pynlpl debug] pruning with beamsize " + str(self.beamsize) + "...",end="",file=stderr) self.fringe.prune(self.beamsize) if self.debug: print(" (" + str(l) + " to " + str(len(self.fringe)) + " items)",file=stderr) class StochasticBeamSearch(BeamSearch): def prune(self, state): if self.debug: l = len(self.fringe) print("\t[pynlpl debug] pruning with beamsize " + str(self.beamsize) + "...",end="",file=stderr) if not self.exhaustive: self.fringe.prunebyscore(state.score(), retainequalscore=True) self.fringe.stochasticprune(self.beamsize) if self.debug: print(" (" + str(l) + " to " + str(len(self.fringe)) + " items)",file=stderr) class HillClimbingSearch(AbstractSearch): #TODO: TEST """(identical to beamsearch with beam 1, but implemented differently)""" def __init__(self, state, **kwargs): assert isinstance(state, AbstractSearchState) super(HillClimbingSearch,self).__init__(**kwargs) self.fringe = PriorityQueue([state], lambda x: x.score, self.minimize, length=0, blockworse=True, blockequal=False,duplicates=False) #From http://stackoverflow.com/questions/212358/binary-search-in-python def binary_search(a, x, lo=0, hi=None): # can't use a to specify default for hi hi = hi if hi is not None else len(a) # hi defaults to len(a) pos = bisect_left(a,x,lo,hi) # find insertion position return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end PyNLPl-1.1.2/README.rst0000644000175000001440000000616713024723552015205 0ustar proyconusers00000000000000PyNLPl - Python Natural Language Processing Library ===================================================== .. image:: https://travis-ci.org/proycon/pynlpl.svg?branch=master :target: https://travis-ci.org/proycon/pynlpl .. image:: http://readthedocs.org/projects/pynlpl/badge/?version=latest :target: http://pynlpl.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status .. image:: http://applejack.science.ru.nl/lamabadge.php/pynlpl :target: http://applejack.science.ru.nl/languagemachines/ PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotatation). The library is a divided into several packages and modules. It works on Python 2.7, as well as Python 3. The following modules are available: - ``pynlpl.datatypes`` - Extra datatypes (priority queues, patterns, tries) - ``pynlpl.evaluation`` - Evaluation & experiment classes (parameter search, wrapped progressive sampling, class evaluation (precision/recall/f-score/auc), sampler, confusion matrix, multithreaded experiment pool) - ``pynlpl.formats.cgn`` - Module for parsing CGN (Corpus Gesproken Nederlands) part-of-speech tags - ``pynlpl.formats.folia`` - Extensive library for reading and manipulating the documents in `FoLiA `_ format (Format for Linguistic Annotation). - ``pynlpl.formats.fql`` - Extensive library for the FoLiA Query Language (FQL), built on top of ``pynlpl.formats.folia``. FQL is currently documented `here `__. - ``pynlpl.formats.cql`` - Parser for the Corpus Query Language (CQL), as also used by Corpus Workbench and Sketch Engine. Contains a convertor to FQL. - ``pynlpl.formats.giza`` - Module for reading GIZA++ word alignment data - ``pynlpl.formats.moses`` - Module for reading Moses phrase-translation tables. - ``pynlpl.formats.sonar`` - Largely obsolete module for pre-releases of the SoNaR corpus, use ``pynlpl.formats.folia`` instead. - ``pynlpl.formats.timbl`` - Module for reading Timbl output (consider using `python-timbl `_ instead though) - ``pynlpl.lm.lm`` - Module for simple language model and reader for ARPA language model data as well (used by SRILM). - ``pynlpl.search`` - Various search algorithms (Breadth-first, depth-first, beam-search, hill climbing, A star, various variants of each) - ``pynlpl.statistics`` - Frequency lists, Levenshtein, common statistics and information theory functions - ``pynlpl.textprocessors`` - Simple tokeniser, n-gram extraction API Documentation can be found `here `__.