bzr-search-1.7.0~bzr94/BUGS0000644000000000000000000000057011024711446013463 0ustar 00000000000000For current bugs and to file bugs, please see launchpad: https://launchpad.net/bzr-search. Some key caveats though (not bugs per se): - memory scaling: Full text indexing currently requires a significant amount of memory. To index the history of 'bzr' requires nearly 200MB of memory (revno 3494). Larger trees are exceedingly likely to require as-much or more. bzr-search-1.7.0~bzr94/COPYING0000644000000000000000000004310511022666054014036 0ustar 00000000000000 GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc. 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) year name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License. bzr-search-1.7.0~bzr94/DESIGN0000644000000000000000000001425711027421666013710 0ustar 00000000000000This document contains notes about the design of bzr search. Naming documents ++++++++++++++++ I plan to use a single namespace for all documents: revisions, inventories and file texts, branch tags etc. Currently this is: ('r', '', revision_id) ('f', file_id, revision_id) What to index +++++++++++++ I think we can sensibly index revisions, tree-shapes, and file texts. One very interesting question is how to index deltas - for instance should searching for 'foo bar' return only the document-versions where 'foo bar' was introduced? Or those where it was merged (as bzr annotate does), or any document-version where 'foo bar' is present at all ? In the short term, I'm punting on a real answer to this because it requires serious thought, and I want to deliver something for people to play with, so I'm going to simply annotate every text, and take only the terms from lines ascribed to version being indexed. This will require some care to avoid either being pathologically slow, or blowing out memory on large trees. The current file text indexing solution uses make_mpdifs. This should be efficient as it has been tuned for bundle use. It allows us to only index new-in-a-version terms (though still at line granularity). Indexing bzr.dev stabilises at 180MB for me. (Very) large trees have been observed to need 1GB of ram. If thrashing occurs, change the group size in inde.py down from 2500 revisions - the memory drop is roughly proportional to that figure. Indexing non-text ================= I think it would be good to have a filtering capability so that e.g. openoffice files can generate term-lists. This relates to the 'index deltas' question in that external filters are likely n Indexing paths ============== There are several use cases for path indices: * Mapping fileid, revisionid search hits back to paths for display during hits on documents. This appears to need a concordance of the entire inventory state for every file hit we can produce. * Searching for a path ("show me the file README.txt") * Accelerating per-file log (by identifying path-change events which the file graph index elides, and-or Mapping fileid, revisionids to paths ------------------------------------ If we insert path ('p', '', PATH) as a document, and as terms insert the fileid, revisionid pair as a phrase, then we can search for the phrase fileid, revisionid to return the paths that match - and by virtue of our data model that will be unique. Indexing Phrases ================ For now, a naive approach is being used: There is a phrase-index for N-term phrases. It contains tuples of terms, mapping to a posting list for the phrase. For generic any-N phrase support we need a N->phrase_N_index mapping. However to bootstrap the process the simplest approach of adding a start,length tuple to the names value has been used. Possible future improvements to this are to: * Use termid tuples rather than terms (to save space) How to index ++++++++++++ To allow incremental index operations, indices should be committed to regularly during the index process. To allow cheap 'what is not indexed' queries, we can use revision ids as a 'indexed' flag in the index. So probably within we should index by file_id (for annotation efficiency) and across we effectively index topologically (though I think partial indices are a really good idea to allow automatic-updates). Where to store indices ++++++++++++++++++++++ There are many possible places to store indices. Ideally they get deleted when a branch is deleted, can be shared between branches, update automatically etc. For now, I'm going to store them in .bzr/bzr-search when the bzrdir is a MetaDir, and just refuse to index other branches. Storing in bzr-search is a bit of an abstraction violation, but we have a split-out structure for a reason, and by prefixing it with bzr-search I leave 'search' available for a 'core' version of this feature. When indexing a bzr-svn branch (actual svn branch, not a branch created *from* svn), the index is stored in $bzrconfig/bzr-search/svn-lookaside/SVN_UUID/BRANCHPATH. Search engine to use ++++++++++++++++++++ I looked at: pylucence (java dependency, hard to integrate into our VCS), xapwrap (abandonware), lupy(abandonware). So in the interested of expendiency, I'm writing a trivial boolean term index for this project. Disk Format +++++++++++ I've modelled this after a pack repository. So we have a lock dir to control commits. A names file listing the indices. Second cut - in progress ======================== Each name in the names file is a pack and the start, length coordinates in the pack for the three top level indices that make up a component. These are the revisions index, the documents index and the terms index. These can be recreated by scanning a component's pack, if they were to get lost. The pack contains these top level indices and a set of posting list indices. Posting list ------------ A GraphIndex(0, 1) (doc_id -> "") This allows set based querying to identify document ids that are matches. If we have already selected (say) ids 45 and 200, there is no need to examine all the document ids in a particular posting list. Document ids ------------ A GraphIndex(0, 1) (doc_id -> doc_key) Document ids are a proxy for a specific document in the corpus. Actual corpus identifiers are long (10's of bytes), so using a simple serial number for each document we refer keeps the index size compact. Once a search is complete the selected document ids are resolved into document keys within the branch - and they can then be used to show the results to users. Terms index ----------- A GraphIndex(0, 1) (term,) -> document_count, start, end of the terms posting list in the pack. This allows the construction of a posting list index object for the term, which is then used to whittle down/expand the set of documents to return. The document count is used in one fairly cheap optimisation - we select the term with the smallest number of references to start our search, in the hope that that will reduce the total IO commensurately. Revisions index --------------- (revision,) -> nothing This index is used purely to detect whether a revision is already indexed - if it is we don't need to index it again. bzr-search-1.7.0~bzr94/NEWS0000644000000000000000000001137111663335516013511 0ustar 00000000000000------------------------ bzr-search Release Notes ------------------------ .. contents:: IN DEVELOPMENT -------------- NOTES WHEN UPGRADING: CHANGES: * Hits were returning overly large summaries due to behaviour tweaks in bzrlib. (Robert Collins, #461957) * Selftest compatibility with bzr 2.0.0 (Robert Collins, #461951, #461952) * 'StaticTuple' is no longer shown in search suggestion descriptions. (Robert Collins) * Requires bzr.dev with the log filtering patch for full functionality. users not using bzr 1.6b3 or later. (Robert Collins) * New index format '2' used by default, which uses BTree indices. This adds a dependency on bzrlib 1.7 for the BTree support code. To upgrade a search index, remove the .bzr/bzr-search folder and reindex. (Robert Collins) FEATURES: * ``bzr log -m`` will now use a search index for message searches that can be converted into a bzr search. While not yet optimised it is about three times faster than normal log with a message regex. Currently only ``\bword\b`` searches can be accelerated this way - which is a word matching regex search. (Robert Collins) * Access to an index over a Bazaar smart server connection will now use HPSS custom commands, for better performance. (Jelmer Vernooij) IMPROVEMENTS: * Compatibility with python 2.6 (as long as bzrlib is also compatible.) (Matt Nordhoff, Robert Collins) * Compatibility with split-inventory repositories (requires a bzrlib that supports them). (Robert Collins) * Avoiding triggering loading of unused bzrlib xml serializers. (Jelmer Vernooij) * Will now index revision committers, authors and bugs. (#320236, Gary van der Merwe) BUGFIXES: * Bug 293906 caused by changes in bzrlib has been fixed. This bug caused suggestions to fail in some circumstances. (Robert Collins) * Does not break when bzr-svn is present and incompatible with bzr. (Gary van der Merwe) * Fix compatibility with newer versions of bzr which support searching on specific properties in `bzr log`. (Jelmer Vernooij, #844806) * When a 0 revision pack is added and the existing repo is empty, do not crash. This is probably a stacked branch situation but we don't actually know how to reproduce. (Martin Pool, #627202) * Works on bzr 2.2 (temporarily broken due to CombinedGraphIndex API changes) (Robert Collins). API BREAKS: TESTING: INTERNALS: 1.6 --- NOTES WHEN UPGRADING: CHANGES: * Requires bzr 1.6b3. There is a new pre-1.6 branch of bzr-search for users not using bzr 1.6b3 or later. (Robert Collins) FEATURES: * New command ``index`` to create a search index for a branch. This indexes the revisions (commit message only) and file texts. (Robert Collins) * New command ``search`` to search a search index from within bzr. When given -s this will perform a suggestion search, looking for possible search terms starting with the current search. (Robert Collins) IMPROVEMENTS: * Giving ``-`` as the first letter of a search term will exclude hits for it from the search results. e.g. ``bzr search -- foo -bar``. (Robert Collins) BUGFIXES: * Handles file ids and paths containing any of '"&<> - characters that were xml escaped to place in the xml atttributes of serialised inventories. (Robert Collins) API BREAKS: TESTING: INTERNALS: * Added ``index.open_index_url`` and ``index.index_url`` convenience functions for obtaining an index and creating/updating an index. (Robert Collins) * Added ``index.init_index`` to create new indices. (Robert Collins) * Added ``Index.index_revisions`` to request indexing of revisions. (Robert Collins) * Added ``Index.indexed_revisions`` to report on indexed revisions. (Robert Collins) * Added ``Index.search`` to perform simple set based searches for terms. (Robert Collins) * New modules: ``commands``, ``errors``, ``index``. These contain the console ui, exceptions, and the search index core respectively. (Robert Collins) * New module ``inventory`` containing ``paths_from_ids``, a helper for efficient extraction of paths from inventory xml files without creating a full Inventory object. This is approximately 5 times faster than creating the full object. (Robert Collins) * New module ``transport`` containing ``FileView`` to map a packs contents as a transport object, allowing bzr indices to be stored in a pack. (Robert Collins) * New subclass of ``GraphIndex``, ``SuggestableGraphIndex`` used for generating search suggestions/recommendations. (Robert Collins) bzr-search-1.7.0~bzr94/README0000644000000000000000000000217611023657660013672 0ustar 00000000000000search, a bzr plugin for searching within bzr branches/repositories. Copyright (C) 2008 Robert Collins This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA The bzr search plugin +++++++++++++++++++++ Introduction ============ search generates indices of bzr revisions which can then be searched quickly. It has no external dependencies other than bzr itself. Commands ======== `bzr search TERM` will return the list of documents TERM occurs in. `bzr index [URL]` will create an index of a branch. Documentation ============= See `bzr help search` or `bzr help plugins/search`. bzr-search-1.7.0~bzr94/__init__.py0000644000000000000000000000742611666545510015131 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """search is a bzr plugin for searching bzr content. Commands ======== `bzr search TERM` will return the list of documents TERM occurs in. `bzr index [URL]` will create an index of a branch. Documentation ============= See `bzr help search` or `bzr help plugins/search`. """ # Relative because at __init__ time the module does not exist. from bzrlib.branch import Branch from bzrlib import log from bzrlib.commands import plugin_cmds import index from bzrlib.smart.request import request_handlers as smart_request_handlers for command in [ 'index', 'search', ]: plugin_cmds.register_lazy("cmd_" + command, [], "bzrlib.plugins.search.commands") version_info = (1, 7, 0, 'dev', 0) def auto_index_branch(result): """Handled for the post_change_branch_tip hook to update a search index.""" from bzrlib.plugins.search import errors try: search_index = index.open_index_branch(result.branch) except errors.NoSearchIndex: return search_index.index_branch(result.branch, result.new_revid) def _install_hooks(): """Install the hooks this plugin uses.""" Branch.hooks.install_named_hook('post_change_branch_tip', auto_index_branch, "index") _install_hooks() if getattr(log, 'log_adapters', None): # disable the regex search when bzr-search is active index._original_make_search_filter = log._make_search_filter log.log_adapters.insert(log.log_adapters.index(log._make_search_filter), index.make_disable_search_filter) log.log_adapters.remove(index._original_make_search_filter) log._make_search_filter = index.make_disable_search_filter # provide bzr-search based searches log.log_adapters.insert(log.log_adapters.index(log._make_revision_objects), index.make_log_search_filter) smart_request_handlers.register_lazy( "Branch.open_index", 'bzrlib.plugins.search.remote', 'SmartServerBranchRequestOpenIndex', info='read') smart_request_handlers.register_lazy( "Branch.init_index", 'bzrlib.plugins.search.remote', 'SmartServerBranchRequestInitIndex', info='semi') smart_request_handlers.register_lazy( "Index.index_revisions", 'bzrlib.plugins.search.remote', 'SmartServerIndexRequestIndexRevisions', info='idem') smart_request_handlers.register_lazy( "Index.indexed_revisions", 'bzrlib.plugins.search.remote', 'SmartServerIndexRequestIndexedRevisions', info='read') smart_request_handlers.register_lazy( "Index.suggest", 'bzrlib.plugins.search.remote', 'SmartServerIndexRequestSuggest', info='read') smart_request_handlers.register_lazy( "Index.search", 'bzrlib.plugins.search.remote', 'SmartServerIndexRequestSearch', info='read') def test_suite(): # Thunk across to load_tests for niceness with older bzr versions from bzrlib.tests import TestLoader loader = TestLoader() return loader.loadTestsFromModuleNames(['bzrlib.plugins.search']) def load_tests(standard_tests, module, loader): standard_tests.addTests(loader.loadTestsFromModuleNames( ['bzrlib.plugins.search.tests'])) return standard_tests bzr-search-1.7.0~bzr94/commands.py0000644000000000000000000000550411273673661015171 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """The command console user interface for bzr search.""" import bzrlib.commands from bzrlib.option import Option from bzrlib.plugins.search import errors from bzrlib.plugins.search import index as _mod_index from bzrlib.transport import get_transport class cmd_index(bzrlib.commands.Command): """Create or update a search index. This locates documents in bzr at a given url and creates a search index for that url. """ _see_also = ['search'] takes_args = ['url?'] def run(self, url=None): if url is None: url = "." trans = get_transport(url) _mod_index.index_url(trans.base) class cmd_search(bzrlib.commands.Command): """Perform a search within bzr history. This locates documents that match the query and reports them to the console. """ encoding_type = 'replace' _see_also = ['index'] takes_options = [Option('suggest', short_name='s', help="Suggest possible terms to complete the search."), Option('directory', short_name='d', type=unicode, help='Branch to search rather than the one in the current directory.'), ] takes_args = ['query+'] def run(self, query_list=[], suggest=False, directory="."): trans = get_transport(directory) index = _mod_index.open_index_url(trans.base) # XXX: Have a query translator etc. query = [(query_item,) for query_item in query_list] index._branch.lock_read() try: if suggest: terms = index.suggest(query) terms = [tuple(term) for term in terms] terms.sort() self.outf.write("Suggestions: %s\n" % terms) else: seen_count = 0 for result in index.search(query): self.outf.write(result.document_name()) self.outf.write(" Summary: '%s'\n" % result.summary()) seen_count += 1 if seen_count == 0: raise errors.NoMatch(query_list) finally: index._branch.unlock() bzr-search-1.7.0~bzr94/errors.py0000644000000000000000000000305411666545510014677 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """Error objects for search functions.""" from bzrlib.errors import BzrError class CannotIndex(BzrError): """Raised when a particular control dir class is unrecognised.""" _fmt = "Cannot index %(thing)r, it is not a known control dir type." def __init__(self, thing): self.thing = thing class NoSearchIndex(BzrError): """Raised when there is no search index for a url.""" _fmt = "No search index present for %(url)r. Please see 'bzr help index'." def __init__(self, url): self.url = url class NoMatch(BzrError): """Raised by the ui when no searches are found. The library functions are generators and raising exceptions there is ugly. """ _fmt = "No matches were found for the search %(search)s." def __init__(self, search): self.search = search bzr-search-1.7.0~bzr94/index.py0000644000000000000000000020254711700414646014474 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """The core logic for search.""" from bisect import bisect_left from itertools import chain import math import re from bzrlib import branch as _mod_branch from bzrlib import (ui, trace) from bzrlib.btree_index import BTreeGraphIndex, BTreeBuilder from bzrlib.bzrdir import BzrDirMeta1 import bzrlib.config from bzrlib.errors import ( NotBranchError, NoSuchFile, UnknownFormatError, UnknownSmartMethod, BadIndexKey, ) from bzrlib.index import CombinedGraphIndex, GraphIndex, InMemoryGraphIndex from bzrlib.lockdir import LockDir try: from bzrlib.osutils import md5 except ImportError: from md5 import new as md5 from bzrlib.osutils import chunks_to_lines from bzrlib.pack import ContainerWriter from bzrlib.plugins.search import errors from bzrlib.plugins.search.inventory import paths_from_ids from bzrlib.plugins.search.transport import FileView from bzrlib.multiparent import NewText from bzrlib.revision import NULL_REVISION _xml_serializers = None def xml_serializers(): global _xml_serializers if _xml_serializers is not None: return _xml_serializers _xml_serializers = [] try: from bzrlib.xml4 import _Serializer_v4 _xml_serializers.append(_Serializer_v4) except ImportError: pass try: from bzrlib.xml5 import Serializer_v5 _xml_serializers.append(Serializer_v5) except ImportError: pass try: from bzrlib.xml6 import Serializer_v6 _xml_serializers.append(Serializer_v6) except ImportError: pass try: from bzrlib.xml7 import Serializer_v7 _xml_serializers.append(Serializer_v7) except ImportError: pass try: from bzrlib.xml8 import Serializer_v8 _xml_serializers.append(Serializer_v8) except ImportError: pass return _xml_serializers from bzrlib.transport import get_transport from bzrlib.tsort import topo_sort _FORMAT_1 = 'bzr-search search folder 1\n' _FORMAT_2 = 'bzr-search search folder 2\n' # _FORMATS definitions are the end of the module, so that they can use index # subclasses. _tokeniser_re = None def _ensure_regexes(): global _tokeniser_re if _tokeniser_re is None: # NB: Perhaps we want to include non-ascii, or is there some unicode # magic to generate good terms? (Known to be a hard problem, but this # is sufficient for an indexer that may not even live a week!) _tokeniser_re = re.compile("[^A-Za-z0-9_]") def init_index(branch, format_number=2): """Initialise an index on branch. :param format_number: The index format to create. Currently 1 by default. """ if isinstance(branch.bzrdir, BzrDirMeta1): transport = branch.bzrdir.transport transport.mkdir('bzr-search') index_transport = transport.clone('bzr-search') elif branch._format.network_name() == 'subversion': # We can't write to the 'bzrdir' as it is virtual uuid = branch.repository.uuid branch_path = branch.get_branch_path() config = bzrlib.config.config_dir() transport = get_transport(bzrlib.config.config_dir()) path = 'bzr-search/svn-lookaside/' + uuid + '/' + branch_path paths = path.split('/') for path in paths: transport = transport.clone(path) transport.ensure_base() index_transport = transport elif getattr(branch.bzrdir, "_call", None) is not None: # FIXME 2011-11-17 JRV: Is there a better way to probe # for smart server branches ? from bzrlib.plugins.search.remote import RemoteIndex try: return RemoteIndex.init(branch) except UnknownSmartMethod: raise errors.CannotIndex(branch) else: raise errors.CannotIndex(branch) lockdir = LockDir(index_transport, 'names-lock') lockdir.create() lockdir.lock_write() try: if format_number == 1: format = _FORMAT_1 elif format_number == 2: format = _FORMAT_2 else: raise Exception("unknown format number %s" % format_number) index_transport.put_bytes('format', format) names_list = _FORMATS[format][0](0, 1) index_transport.put_file('names', names_list.finish()) index_transport.mkdir('obsolete') index_transport.mkdir('indices') index_transport.mkdir('upload') finally: lockdir.unlock() return open_index_url(branch.bzrdir.root_transport.base) def index_url(url): """Create or update an index at url. :param url: The url to index. :return: The resulting search index. """ branch = _mod_branch.Branch.open(url) branch.lock_read() try: _last_revid = branch.last_revision() try: index = open_index_url(url) index.index_branch(branch, _last_revid) except errors.NoSearchIndex: index = init_index(branch) graph = branch.repository.get_graph() searcher = graph._make_breadth_first_searcher([_last_revid]) revs_to_index = set() while True: try: next_revs, ghosts = searcher.next_with_ghosts() except StopIteration: break revs_to_index.update(next_revs) if NULL_REVISION in revs_to_index: revs_to_index.remove(NULL_REVISION) index.index_revisions(branch, revs_to_index) finally: branch.unlock() return index def open_index_url(url): """Open a search index at url. :param url: The url to open the index from. :return: A search index. :raises: NoSearchIndex if no index can be located. """ try: branch = _mod_branch.Branch.open(url) except NotBranchError: raise errors.NoSearchIndex(url) return open_index_branch(branch) def open_index_branch(branch): """Open a search index at a branch. This could do look-aside stuff for svn branches etc in the future. :param branch: The branch to get an index for. :raises: NoSearchIndex if no index can be located. """ if branch._format.network_name() == 'subversion': # We can't write to the 'bzrdir' as it is virtual uuid = branch.repository.uuid branch_path = branch.get_branch_path() config = bzrlib.config.config_dir() transport = get_transport(bzrlib.config.config_dir()) path = 'bzr-search/svn-lookaside/' + uuid + '/' + branch_path transport = transport.clone(path) commits_only = False elif getattr(branch.bzrdir, "_call", None) is not None: # FIXME 2011-11-17 JRV: Is there a better way to probe # for smart server branches ? from bzrlib.plugins.search.remote import RemoteIndex try: return RemoteIndex.open(branch) except UnknownSmartMethod: # Fall back to traditional methods... transport = branch.bzrdir.transport.clone('bzr-search') commits_only = False else: transport = branch.bzrdir.transport.clone('bzr-search') commits_only = False return Index(transport, branch, commits_only=commits_only) # XXX: This wants to be a PackCollection subclass with RepositoryPackCollection # being a sibling. For now though, copy and paste FTW. class Index(object): """A bzr content index. :ivar _format: The format tuple - see _FORMATS. """ def __init__(self, index_transport, branch, commits_only=False): """Create an index stored at index_transport. :param index_transport: The path where the index data should be stored. :param branch: The branch this Index is indexing. :param commits_only: If True, when indexing only attempt to index commits, not file texts. Useful for foreign formats (often commits are the most mature part of such plugins), or for some projects where file contents may not be useful to index. """ self._transport = index_transport try: format = self._transport.get_bytes('format') except NoSuchFile: raise errors.NoSearchIndex(index_transport) self._upload_transport = self._transport.clone('upload') self._obsolete_transport = self._transport.clone('obsolete') self._indices_transport = self._transport.clone('indices') try: self._format = _FORMATS[format] except KeyError: raise UnknownFormatError(format, 'bzr-search') self._orig_names = {} self._current_names = {} self._revision_indices = [] self._term_doc_indices = {} self._revision_index = CombinedGraphIndex(self._revision_indices) # because terms may occur in many component indices, we don't use a # CombinedGraphIndex for grouping the term indices or doc indices. self._lock = LockDir(index_transport, 'names-lock') self._branch = branch self._commits_only = commits_only def _add_terms(self, index, terms): """Add a set of term posting lists to a in progress index. A term is a single index key (e.g. ('first',)). A posting list is an iterable of full index keys (e.g. ('r', '', REVID) for a revision, or ('t', FILEID, REVID) for a file text.) :param index: A ComponentIndexBuilder. :param terms: An iterable of term -> posting list. """ for term, posting_list in terms: index.add_term(term, posting_list) def all_terms(self): """Return an iterable of all the posting lists in the index. :return: An iterator of (term -> document ids). """ self._refresh_indices() result = {} for value, component in self._current_names.values(): terms = component.all_terms() for term, posting_list in terms.iteritems(): result.setdefault(term, set()).update(posting_list) return result.iteritems() def _document_ids_to_keys(self, document_ids): """Expand document ids to keys. :param document_ids: An iterable of (index, doc_id) tuples. :result: An iterable of document keys. """ indices = {} # group by index for index, doc_id in document_ids: doc_ids = indices.setdefault(index, set()) doc_ids.add((doc_id,)) for index, doc_ids in indices.items(): doc_index = self._term_doc_indices[index] for node in doc_index.iter_entries(doc_ids): yield tuple(node[2].split(' ', 2)) def index_branch(self, branch, tip_revision): """Index revisions from a branch. :param branch: The branch to index. :param tip_revision: The tip of the branch. """ branch.lock_read() try: graph = branch.repository.get_graph() searcher = graph._make_breadth_first_searcher([tip_revision]) self._refresh_indices() revision_index = self._revision_index revs_to_index = set() while True: try: next_revs, ghosts = searcher.next_with_ghosts() except StopIteration: break else: rev_keys = [(rev,) for rev in next_revs] indexed_revs = set([node[1][0] for node in revision_index.iter_entries(rev_keys)]) unindexed_revs = next_revs - indexed_revs searcher.stop_searching_any(indexed_revs) revs_to_index.update(unindexed_revs) if NULL_REVISION in revs_to_index: revs_to_index.remove(NULL_REVISION) self.index_revisions(branch, revs_to_index) finally: branch.unlock() def index_revisions(self, branch, revisions_to_index): """Index some revisions from branch. :param branch: A branch to index. :param revisions_to_index: A set of revision ids to index. """ branch.lock_read() try: outer_bar = ui.ui_factory.nested_progress_bar() try: return self._index_revisions(branch, revisions_to_index, outer_bar) finally: outer_bar.finished() finally: branch.unlock() def _index_revisions(self, locked_branch, revisions_to_index, outer_bar): """Helper for indexed_revisions.""" if not revisions_to_index: return _ensure_regexes() graph = locked_branch.repository.get_graph() parent_map = graph.get_parent_map(revisions_to_index) order = topo_sort(parent_map) order_dict = {} for pos, revid in enumerate(order): order_dict[revid] = pos # 5000 uses 1GB on a mysql tree. # Testing shows 1500 or so is a sweet spot for bzr, 2500 for python - ideally this wouldn't matter. # Interesting only 2 times reduction in memory was observed every down # at a group of 50, though it does slowly grow as it increases. group_size = 2000 groups = len(order) / group_size + 1 for offset in range(groups): outer_bar.update("Indexing...", offset, groups) revision_group = order[offset * group_size:(offset + 1) * group_size] builder = ComponentIndexBuilder(self._format) # here: index texts # here: index inventory/paths # here: index revisions steps = ui.ui_factory.nested_progress_bar() try: steps.update("Indexing texts", 0, 4) if not self._commits_only: terms = self._terms_for_texts(locked_branch.repository, revision_group) self._add_terms(builder, terms) steps.update("Indexing paths", 1, 4) terms = self._terms_for_file_terms( locked_branch.repository, terms, order_dict) self._add_terms(builder, terms) steps.update("Indexing commits", 2, 4) terms = self._terms_for_revs(locked_branch.repository, revision_group) self._add_terms(builder, terms) for rev_id in revision_group: builder.add_revision(rev_id) steps.update("Saving group", 3, 4) self._add_index(builder) except BadIndexKey as bKey: trace.warning("%s, indexing incomplete" % str(bKey)) finally: steps.finished() def _add_index(self, builder, to_remove=None, allow_pack=True): """Add a new component index to the list of indices. :param builder: A component builder supporting the upload_index call. :param to_remove: An optional iterable of components to remove. :param allow_pack: Whether an auto pack is permitted by this operation. """ # The index name is the md5sum of the revision index serialised form. index_name, index_value, elements = builder.upload_index( self._upload_transport) if index_name in self._current_names: raise Exception("md5 collision! rad! %s" % index_name) # The component is uploaded, we only need to rename to activate. self._lock.lock_write() try: self._refresh_indices(to_remove=to_remove) if index_name in self._current_names: raise Exception( "md5 collision with concurrent writer! rad! %s" % index_name) # Serialise the index list new_names = self._format[0](0, 1) new_names.add_node((index_name,), index_value, ()) for name, (value, index) in self._current_names.items(): new_names.add_node((name,), value, ()) # Now, as the last step, rename the new index into place and update # the disk list of names. for element in elements: self._upload_transport.rename(element, '../indices/' + element) self._transport.put_file('names', new_names.finish()) index = ComponentIndex(self._format, index_name, index_value, self._indices_transport) self._orig_names[index_name] = (index_value, index) # Cleanup obsoleted if needed, if we are removing things. if to_remove: self._obsolete_transport.delete_multi( self._obsolete_transport.list_dir('.')) finally: self._lock.unlock() # Move any no-longer-referenced packs out of the indices to the # obsoleted area. if to_remove: for component in to_remove: relpath = component.name + '.pack' self._indices_transport.rename(relpath, '../obsolete/' + relpath) # Add in-memory self._add_index_to_memory(index_name, index_value, index) # Its safely inserted. Trigger a pack ? if not allow_pack: return total_revisions = self._revision_index.key_count() max_components = int(math.ceil(math.log(max(1, total_revisions), 2))) if max_components < 1: max_components = 1 excess = len(self._current_names) - max_components if excess < 1: return old_components = [] for name, (value, component) in self._current_names.iteritems(): old_components.append((component.revision_index.key_count(), name)) old_components.sort() del old_components[excess + 1:] components = [self._current_names[name][1] for length, name in old_components] # Note: we don't recurse here because of two things: # B) we don't want to regress infinitely; a flag to _add_index would do # this. # C) We need to remove components too. combiner = ComponentCombiner(self._format, components, self._upload_transport) self._add_index(combiner, to_remove=components, allow_pack=False) def _add_index_to_memory(self, name, value, index): """Add an index (with meta-value value) to the in-memory index list.""" self._current_names[name] = (value, index) # Note This insert at one less than the end, not quite the same as # append :(. self._revision_index.insert_index(-1, index.revision_index) self._term_doc_indices[index] = index.document_index def indexed_revisions(self): """Return the revision_keys that this index contains terms for.""" self._refresh_indices() for node in self._revision_index.iter_all_entries(): yield node[1] def _refresh_indices(self, to_remove=None): """Merge on-disk index lists into the memory top level index list. :param to_remove: An optional list of components to remove from memory even if they are still listed on disk. """ names = self._format[1](self._transport, 'names', None) new_names = {} merged_names = {} deleted_names = set() if to_remove: for component in to_remove: deleted_names.add(component.name) added_names = set() same_names = set() for node in names.iter_all_entries(): name = node[1][0] value = node[2] new_names[name] = [value, None] for name in new_names: if name not in self._orig_names: added_names.add(name) elif name in self._current_names: same_names.add(name) else: # in our last read; not in memory anymore: deleted_names.add(name) # XXX perhaps cross-check the size? for name in added_names: # TODO: byte length of the indices here. value = new_names[name][0] component = ComponentIndex(self._format, name, value, self._indices_transport) self._add_index_to_memory(name, value, component) for name in deleted_names: self._remove_component_from_memory(name) self._orig_names = new_names def _remove_component_from_memory(self, name): """Remove the component name from the index list in memory.""" index = self._current_names[name][1] del self._term_doc_indices[index] pos = self._revision_index._indices.index(index.revision_index) self._revision_index._indices.pop(pos) self._revision_index._index_names.pop(pos) del self._current_names[name] def _search_work(self, termlist): """Core worker logic for performing searches. :param termlist: An iterable of terms to search for. :return: An iterator over (component, normalised_termlist, matching_document_keys). Components where the query does not hit anytthing are not included in the iterator. Using an empty query results in all components being returned but no document keys being listed for each component. """ _ensure_regexes() self._refresh_indices() # Use a set to remove duplicates new_termlist = set() exclude_terms = set() for term in termlist: if term[0][0] == '-': # exclude this term exclude_terms.add((term[0][1:],) + term[1:]) else: new_termlist.add(term) # remove duplicates that were included *and* excluded termlist = new_termlist - exclude_terms term_keys = [None, set(), set()] for term in termlist: term_keys[len(term)].add(term) for term in exclude_terms: term_keys[len(term)].add(term) for value, component in self._current_names.values(): term_index = component.term_index # TODO: push into Component found_term_count = 0 # TODO: use dequeues? term_info = [] exclude_info = [] for node in chain(term_index.iter_entries(term_keys[1]), component.term_2_index.iter_entries(term_keys[2])): term_id, posting_count, posting_start, posting_length = \ node[2].split(" ") info = (int(posting_count), term_id, int(posting_start), int(posting_length)) if node[1] not in exclude_terms: term_info.append(info) found_term_count += 1 else: exclude_info.append(info) excluded = 1 if not termlist: yield component, termlist, None continue if len(term_info) != len(termlist): # One or more terms missing - no hits are possible. continue # load the first document list: term_info.sort() _, term_id, posting_start, posting_length = term_info.pop(0) posting_stop = posting_start + posting_length post_name = "term_list." + term_id filemap = {post_name:(posting_start, posting_stop)} view = FileView(self._indices_transport, component.name + '.pack', filemap) post_index = self._format[1](view, post_name, posting_length) common_doc_keys = set([node[1] for node in post_index.iter_all_entries()]) # Now we whittle down the nodes we need - still going in sorted # order. (possibly doing concurrent reduction would be better). while common_doc_keys and term_info: common_doc_keys = self._select_doc_keys(common_doc_keys, term_info.pop(0), component) if common_doc_keys: # exclude from largest-first, which should give us less # exclusion steps. exclude_info.sort(reverse=True) while common_doc_keys and exclude_info: common_doc_keys.difference_update(self._select_doc_keys( common_doc_keys, exclude_info.pop(0), component)) yield component, termlist, common_doc_keys def search(self, termlist): """Trivial set-based search of the index. :param termlist: A list of terms. :return: An iterator of SearchResults for documents indexed by all terms in the termlist. """ found_documents = [] if not termlist: return for component, termlist, common_doc_keys in self._search_work(termlist): common_doc_ids = [key[0] for key in common_doc_keys] found_documents = [(component, doc_id) for doc_id in common_doc_ids] for doc_key in self._document_ids_to_keys(found_documents): if doc_key[0] == 'f': # file text yield FileTextHit(self, self._branch.repository, doc_key[1:3], termlist) elif doc_key[0] == 'r': # revision yield RevisionHit(self._branch.repository, doc_key[2:3]) elif doc_key[0] == 'p': # path yield PathHit(doc_key[2]) else: raise Exception("unknown doc type %r" % (doc_key,)) def _select_doc_keys(self, key_filter, term_info, component): """Select some document keys from a term. :param key_filter: An iterable of document keys to constrain the search. :param term_info: The index metadata about the terms posting list. :param component: The component being searched within. """ _, term_id, posting_start, posting_length = term_info posting_stop = posting_start + posting_length post_name = "term_list." + term_id filemap = {post_name:(posting_start, posting_stop)} view = FileView(self._indices_transport, component.name + '.pack', filemap) post_index = self._format[1](view, post_name, posting_length) return set([node[1] for node in post_index.iter_entries(key_filter)]) def suggest(self, termlist): """Generate suggestions for extending a search. :param termlist: A list of terms. :return: An iterator of terms that start with the last search term in termlist, and match the rest of the search. """ found_documents = [] if not termlist: return suggest_term = termlist[-1] suggestions = set() for component, termlist, common_doc_keys in self._search_work(termlist[:-1]): if len(suggest_term) == 1: suggest_index = component.term_index else: suggest_index = component.term_2_index for node in suggest_index.iter_entries_starts_with(suggest_term): suggestion = node[1] if common_doc_keys: term_id, posting_count, posting_start, posting_length = \ node[2].split(" ") posting_count = int(posting_count) posting_start = int(posting_start) posting_length = int(posting_length) posting_stop = posting_start + posting_length post_name = "term_list." + term_id filemap = {post_name:(posting_start, posting_stop)} view = FileView(self._indices_transport, component.name + '.pack', filemap) post_index = self._format[1](view, post_name, posting_length) common_doc_keys = set([node[1] for node in post_index.iter_entries(common_doc_keys)]) if len(common_doc_keys): # This suggestion matches other terms in the qury: suggestions.add(suggestion) else: suggestions.add(suggestion) return suggestions def _terms_for_file_terms(self, repository, file_terms, order_dict): """Generate terms for the path of every file_id, revision_id in terms. :param repository: The repository to access inventories from. :param terms: Text terms than have been inserted. :param order_dict: A mapping from revision id to order from the topological sort prepared for the indexing operation. :return: An iterable of (term, posting_list) for the file_id, revision_id pairs mentioned in terms. """ terms = {} # What revisions do we need inventories for: revision_ids = {} for term, posting_list in file_terms: for post in posting_list: if post[0] != 'f': raise ValueError("Unknown post type for %r" % post) fileids = revision_ids.setdefault(post[2], set()) fileids.add(post[1]) order = list(revision_ids) order.sort(key=lambda revid:order_dict.get(revid)) group_size = 100 groups = len(order) / group_size + 1 bar = ui.ui_factory.nested_progress_bar() try: for offset in range(groups): bar.update("Extract revision paths", offset, groups) inventory_group = order[offset * group_size:(offset + 1) * group_size] serializer = repository._serializer if type(serializer) in xml_serializers(): # Fast path for flat-file serializers. group_keys = [(revid,) for revid in inventory_group] stream = repository.inventories.get_record_stream( group_keys, 'unordered', True) for record in stream: bytes = record.get_bytes_as('fulltext') revision_id = record.key[-1] path_dict = paths_from_ids(bytes, serializer, revision_ids[revision_id]) for file_id, path in path_dict.iteritems(): terms[(file_id, revision_id)] = [('p', '', path)] else: # Public api way - 5+ times slower on xml inventories for inventory in repository.iter_inventories(inventory_group): revision_id = inventory.revision_id for file_id in revision_ids[revision_id]: path = inventory.id2path(file_id).encode('utf8') terms[(file_id, revision_id)] = [('p', '', path)] finally: bar.finished() return terms.iteritems() def _terms_for_revs(self, repository, revision_ids): """Generate the posting list for the revision texts of revision_ids. :param revision_ids: An iterable of revision_ids. :return: An iterable of (term, posting_list) for the revision texts (not the inventories or user texts) of revision_ids. """ terms = {} for revision in repository.get_revisions(revision_ids): # its a revision, second component is ignored, third is id. document_key = ('r', '', revision.revision_id) # components of a revision: # parents - not indexed (but we could) # commit message (done) # author (done) # committer (done) # properties (todo - names only?) # bugfixes (done) # other filters? message_utf8 = revision.message.encode('utf8') commit_terms = set(_tokeniser_re.split(message_utf8)) # Note: this does not provide any specific support for searching on # the domain of authors. for author in [revision.committer]+revision.get_apparent_authors(): name, email = bzrlib.config.parse_username(author) commit_terms.add(email.encode('utf8')) name_utf8 = name.encode('utf8') commit_terms.update(set(_tokeniser_re.split(name_utf8))) # Note: it might be nice to turn these back into lp:xxx and so # forth where we can do so. See QBzr for prior art. for bug in revision.properties.get('bugs', '').splitlines(): bug_url = bug.split(' ')[0] if isinstance(bug_url, unicode): bug_url = bug_url.encode('utf') commit_terms.add(bug_url) for term in commit_terms: if not term: continue posting_list = terms.setdefault((term,), set()) posting_list.add(document_key) return terms.iteritems() def _terms_for_texts(self, repository, revision_ids): """Generate the posting list for the file texts of revision_ids. :param revision_ids: An iterable of revision_ids. :return: An iterable of (term, posting_list) for the revision texts (not the inventories or user texts) of revision_ids. """ terms = {} files = {} for item in repository.item_keys_introduced_by(revision_ids): if item[0] != 'file': continue # partitions the files by id, to avoid serious memory overload. file_versions = files.setdefault(item[1], set()) for file_version in item[2]: file_versions.add((item[1], file_version)) for file_id, file_keys in files.iteritems(): file_keys = list(file_keys) group_size = 100 groups = len(file_keys) / group_size + 1 for offset in range(groups): file_key_group = file_keys[offset * group_size:(offset + 1) * group_size] for diff, key in zip(repository.texts.make_mpdiffs(file_key_group), file_key_group): document_key = ('f',) + key for hunk in diff.hunks: if type(hunk) == NewText: for line in hunk.lines: line_terms = _tokeniser_re.split(line) for term in line_terms: if not term: continue posting_list = terms.setdefault((term,), set()) posting_list.add(document_key) return terms.items() class FileTextHit(object): """A match found during a search in a file text.""" def __init__(self, index, repository, text_key, termlist): """Create a FileTextHit. :param index: The index the search result is from, to look up the path of the hit. NB :param repository: A repository to extract revisions from. :param text_key: The text_key that was hit. :param termlist: The query that was issued, used for generating summaries. """ self.index = index self.repository = repository self.text_key = text_key self.termlist = termlist def document_name(self): """The name of the document found, for human consumption.""" # Perhaps need to utf_decode this? path = self.index.search((self.text_key,)).next() return "%s in revision '%s'." % (path.document_name(), self.text_key[1]) def summary(self): """Get a summary of the hit, for display to users.""" lines = self.repository.iter_files_bytes([ (self.text_key[0], self.text_key[1], "")]).next()[1] lines = chunks_to_lines(lines) # We could look for the best match, try to get context, line numbers # etc. This is complex - what if 'foo' is on line 1 and 'bar' on line # 54. # NB: This does not handle phrases correctly - but - make it work. flattened_terms = set([' '.join(term) for term in self.termlist]) for line in lines: line_terms = set(_tokeniser_re.split(line)) if len(line_terms.intersection(flattened_terms)) > 0: return line[:-1].decode('utf8', 'replace') raise ValueError("no match? wtf? %r" % lines) class PathHit(object): """A match found during a search in a file path.""" def __init__(self, path_utf8): """Create a PathHit. :param path_utf8: The path (utf8 encoded). """ self.path_utf8 = path_utf8 def document_name(self): """The name of the document found, for human consumption.""" return self.path_utf8.decode("utf8") def summary(self): """Get a summary of the hit.""" return self.document_name() class RevisionHit(object): """A match found during a search in a revision object.""" def __init__(self, repository, revision_key): """Create a RevisionHit. :param repository: A repository to extract revisions from. :param revision_key: The revision_key that was hit. """ self.repository = repository self.revision_key = revision_key def document_name(self): """The name of the document found, for human consumption.""" # Perhaps need to utf_decode this? return "Revision id '%s'." % self.revision_key[0] def summary(self): """Get a summary of the revision.""" # Currently, just the commit first line. revision = self.repository.get_revision(self.revision_key[-1]) return revision.message.splitlines()[0] class ComponentIndex(object): """A single component in the aggregate search index. Components are a single pack containing: The relevant files are: - an index listing indexed revisions (name.rix) - an index mapping terms to posting lists (name.tix) - an index mapping document ids to document keys (name.dix) - A posting-list per term (name.N) listing the document ids the term indexes. The index implementation is selected from the format tuple. """ def __init__(self, format, name, value, transport): """Create a ComponentIndex. :param format: The format object for this bzr-search folder. :param name: The name of the index. :param value: The value string from the names list for this component. """ lengths = value.split(' ') lengths = [int(length) for length in lengths] filemap = { "revisions": (lengths[0], lengths[0] + lengths[1]), "terms": (lengths[2], lengths[2] + lengths[3]), "documents": (lengths[4], lengths[4] + lengths[5]), "terms_2": (lengths[6], lengths[6] + lengths[7]), } self._format = format view = FileView(transport, name + '.pack', filemap) rev_index = self._format[1](view, "revisions", lengths[1]) term_index = self._format[1](view, "terms", lengths[3]) term_2_index = self._format[1](view, "terms_2", lengths[7]) doc_index = self._format[1](view, "documents", lengths[5]) self.revision_index = rev_index self.term_index = term_index self.term_2_index = term_2_index self.document_index = doc_index self.name = name self.transport = transport def all_terms(self): """As per Index, but for a single component.""" result = {} for node in chain(self.term_index.iter_all_entries(), self.term_2_index.iter_all_entries()): # XXX: Duplicated logic with search(). term = node[1] term_id, posting_count, posting_start, posting_length = \ node[2].split(" ") posting_count = int(posting_count) posting_start = int(posting_start) posting_length = int(posting_length) posting_stop = posting_start + posting_length post_name = "term_list." + term_id filemap = {post_name:(posting_start, posting_stop)} view = FileView(self.transport, self.name + '.pack', filemap) post_index = self._format[1](view, post_name, posting_length) doc_ids = set([node[1] for node in post_index.iter_all_entries()]) posting_list = set(self._document_ids_to_keys(doc_ids)) result[term] = posting_list return result def _document_ids_to_keys(self, doc_ids): """Expand document ids to keys. :param document_ids: An iterable of doc_id tuples. :result: An iterable of document keys. """ indices = {} for node in self.document_index.iter_entries(doc_ids): yield tuple(node[2].split(' ', 2)) def indexed_revisions(self): """Return the revision_keys that this index contains terms for.""" for node in self.revision_index.iter_all_entries(): yield node[1] class ComponentCreator(object): """Base class for classes that create ComponentIndices.""" def add_document(self, document_key): """Add a document key to the index. :param document_key: A document key e.g. ('r', '', 'some-rev-id'). :return: The document id allocated within this index. """ if document_key in self._document_ids: return self._document_ids[document_key] next_id = str(self.document_index.key_count()) self.document_index.add_node((next_id,), "%s %s %s" % document_key, ()) self._document_ids[document_key] = next_id return next_id def _add_index_to_pack(self, index, name, writer, index_bytes=None): """Add an index to a pack. This ensures the index is encoded as plain bytes in the pack allowing arbitrary readvs. :param index: The index to write to the pack. :param name: The name of the index in the pack. :param writer: a ContainerWriter. :param index_bytes: Optional - the contents of the serialised index. :return: A start, length tuple for reading the index back from the pack. """ if index_bytes is None: index_file = index.finish() index_bytes = index_file.read() del index_file pos, size = writer.add_bytes_record(index_bytes, [(name,)]) length = len(index_bytes) offset = size - length start = pos + offset return start, length class ComponentIndexBuilder(ComponentCreator): """Creates a component index.""" def __init__(self, format): self.document_index = format[0](0, 1) self._document_ids = {} self.terms = {} self.revision_index = format[0](0, 1) self.posting_lists = {} self._format = format def add_term(self, term, posting_list): """Add a term to the index. :param term: A term, e.g. ('foo',). :param posting_list: A list of the document_key's that this term indexes. :return: None. """ if type(term) != tuple: raise ValueError("terms need to be tuples %r" % term) for component in term: if type(component) != str: raise ValueError( "terms must be bytestrings at this layer %r" % term) term_id = self.term_id(term) if term_id is None: term_id = str(len(self.terms)) self.terms[term] = term_id self.posting_lists[term_id] = set() existing_posting_list = self.posting_lists[term_id] for document_key in posting_list: existing_posting_list.add(self.add_document(document_key)) def add_revision(self, revision_id): """List a revision as having been indexed by this index.""" self.revision_index.add_node((revision_id,), '', ()) def posting_list(self, term): """Return an iterable of document ids for term. Unindexed terms return an empty iterator. """ term_id = self.term_id(term) if term_id is None: return [] else: return self.posting_lists[term_id] def term_id(self, term): """Return the term id of term. :param term: The term to get an id for. :return: None for a term not in the component, otherwise the string term id. """ try: return self.terms[term] except KeyError: return None def upload_index(self, upload_transport): """Upload the index in preparation for insertion. :param upload_transport: The transport to upload to. :return: The index name, the value for the names list, and a list of the filenames that comprise the index. """ # Upload preparatory to renaming into place. # write to disc. index_file = self.revision_index.finish() index_bytes = index_file.read() del index_file index_name = md5(index_bytes).hexdigest() write_stream = upload_transport.open_write_stream(index_name + ".pack") writer = ContainerWriter(write_stream.write) writer.begin() rev_start, rev_length = self._add_index_to_pack(self.revision_index, "revisions", writer, index_bytes) del index_bytes # generate a new term index with the length of the serialised posting # lists. term_indices = {} term_indices[1] = self._format[0](0, 1) term_indices[2] = self._format[0](0, 2) for term, term_id in self.terms.iteritems(): posting_list = self.posting_lists[term_id] post_index = self._format[0](0, 1) for doc_id in posting_list: post_index.add_node((doc_id,), "", ()) posting_name = "term_list." + term_id start, length = self._add_index_to_pack(post_index, posting_name, writer) # The below can be factored out and reused with the # ComponentCombiner if we get rid of self.terms and use terms # directly until we serialise the posting lists, rather than # assigning ids aggressively. # How many document ids, and the range for the file view when we # read the pack later. term_value = "%s %d %d %d" % (term_id, len(posting_list), start, length) term_indices[len(term)].add_node(term, term_value, ()) term_start, term_length = self._add_index_to_pack(term_indices[1], "terms", writer) term_2_start, term_2_length = self._add_index_to_pack(term_indices[2], "terms2", writer) doc_start, doc_length = self._add_index_to_pack(self.document_index, "documents", writer) writer.end() write_stream.close() index_value = "%d %d %d %d %d %d %d %d" % (rev_start, rev_length, term_start, term_length, doc_start, doc_length, term_2_start, term_2_length) elements = [index_name + ".pack"] return index_name, index_value, elements class ComponentCombiner(ComponentCreator): """Combines components into a new single larger component.""" def __init__(self, format, components, transport): """Create a combiner. :param format: The format of component to create. :param components: An iterable of components. :param transport: A transport to upload the combined component to. :return: A tuple - the component name, the value for the names file, and the elements list for the component. """ self.components = list(components) self.transport = transport self._format = format def _copy_documents(self): """Copy the document references from components to a new component. This popules self.component_docid with the mappings from each component's document ids to the output document ids. """ self._document_ids = {} self.document_index = self._format[0](0, 1) self.component_docids = {} for component in self.components: component_docs = {} self.component_docids[component] = component_docs for node in component.document_index.iter_all_entries(): # duplication with _document_ids_to_keys document_key = tuple(node[2].split(' ', 2)) doc_id = self.add_document(document_key) # Map from the old doc id to the new doc it component_docs[node[1]] = doc_id self.doc_start, self.doc_length = self._add_index_to_pack( self.document_index, "documents", self.writer) # Clear out used objects del self._document_ids del self.document_index def _copy_posting_lists(self): """Copy the posting lists from components to the new component. This uses self.component_docid to map document ids across efficiently, and self.terms to determine what to copy from. It populates self.term_index as it progresses. """ term_indices = {1:self._format[0](0, 1), 2:self._format[0](0, 2) } for term, posting_lists in self.terms.iteritems(): posting_list = set() for component, posting_line in posting_lists: elements = posting_line.split(' ') _, term_id, posting_start, posting_length = elements posting_start = int(posting_start) posting_length = int(posting_length) posting_stop = posting_start + posting_length post_name = "term_list." + term_id filemap = {post_name:(posting_start, posting_stop)} view = FileView(component.transport, component.name + '.pack', filemap) post_index = self._format[1](view, post_name, posting_length) doc_mapping = self.component_docids[component] for node in post_index.iter_all_entries(): posting_list.add(doc_mapping[node[1]]) post_index = self._format[0](0, 1) for doc_id in posting_list: post_index.add_node((doc_id,), '', ()) term_id = str(term_indices[1].key_count() + term_indices[2].key_count()) start, length = self._add_index_to_pack( post_index, "term_list." + term_id, self.writer) # How many document ids, and the range for the file view when we # read the pack later. term_value = "%s %d %d %d" % (term_id, len(posting_list), start, length) term_indices[len(term)].add_node(term, term_value, ()) self.term_indices = term_indices # Clear out used objects del self.terms del self.component_docids def _copy_revisions(self): """Copy the revisions from components to a new component. This also creates the writer. """ # Merge revisions: revisions = set() for component in self.components: for node in component.revision_index.iter_all_entries(): revisions.add(node[1]) revision_index = self._format[0](0, 1) for revision in revisions: revision_index.add_node(revision, '', ()) index_file = revision_index.finish() index_bytes = index_file.read() del index_file self.index_name = md5(index_bytes).hexdigest() self.write_stream = self.transport.open_write_stream( self.index_name + ".pack") self.writer = ContainerWriter(self.write_stream.write) self.writer.begin() self.rev_start, self.rev_length = self._add_index_to_pack( revision_index, "revisions", self.writer, index_bytes) def combine(self): """Combine the components.""" # Note on memory pressue: deleting the source index caches # as soon as they are copied would reduce memory pressure. self._copy_revisions() self._copy_documents() self._scan_terms() self._copy_posting_lists() self.term_start, self.term_length = self._add_index_to_pack( self.term_indices[1], "terms", self.writer) self.term_2_start, self.term_2_length = self._add_index_to_pack( self.term_indices[2], "terms2", self.writer) self.writer.end() self.write_stream.close() index_value = "%d %d %d %d %d %d %d %d" % (self.rev_start, self.rev_length, self.term_start, self.term_length, self.doc_start, self.doc_length, self.term_2_start, self.term_2_length) elements = [self.index_name + ".pack"] return self.index_name, index_value, elements def _scan_terms(self): """Scan the terms in all components to prepare to copy posting lists.""" self.terms = {} for component in self.components: for node in chain(component.term_index.iter_all_entries(), component.term_2_index.iter_all_entries()): term = node[1] posting_info = node[2] term_set = self.terms.setdefault(term, set()) term_set.add((component, posting_info)) def upload_index(self, upload_transport): """Thunk for use by Index._add_index.""" self.transport = upload_transport return self.combine() class SuggestableGraphIndex(GraphIndex): """A subclass of GraphIndex which adds starts_with searches. These searches are used for providing suggestions. """ def iter_entries_starts_with(self, key): """Iterate over nodes which match key. The first length()-1 terms in key must match exactly, and the last term in key is used as a starts_with test. :param key: The key to search with. """ # Make it work: # Partly copied from iter_entries() # PERFORMANCE TODO: parse and bisect all remaining data at some # threshold of total-index processing/get calling layers that expect to # read the entire index to use the iter_all_entries method instead. half_page = self._transport.recommended_page_size() // 2 # For when we don't know the length to permit bisection, or when the # index is fully buffered in ram. if self._size is None or self._nodes is not None: if len(key) > 1: candidates = self.iter_entries_prefix([key[:-1] + (None,)]) else: candidates = self.iter_all_entries() for candidate in candidates: if candidate[1][-1].startswith(key[-1]): yield candidate else: # Bisect to find the start. # TODO: If we know a reasonable upper bound we could do one IO for # the remainder. # loop parsing more until wwe have one range covering the # suggestions. step = self._size //2 search = [(step, key)] found = self._lookup_keys_via_location(search) while True: step = step // 2 if found[0][1] not in [-1, 1]: # We can now figure out where to start answering from. break search = [(found[0][0][0] + step * found[0][1], key)] found = self._lookup_keys_via_location(search) while True: if self._nodes: # we triggered a full read - everything is in _nodes now. for result in self.iter_entries_starts_with(key): yield result return lower_index = self._parsed_key_index(key) parsed_range = self._parsed_key_map[lower_index] last_key = parsed_range[1] if last_key[:-1] > key[:-1]: # enough is parsed break if last_key[:-1] == key[:-1]: if (last_key[-1] > key[-1] and not last_key[-1].startswith(key[-1])): # enough is parsed break hi_parsed = self._parsed_byte_map[lower_index][1] if hi_parsed == self._size: # all parsed break next_probe = hi_parsed + half_page - 1 if lower_index + 1 < len(self._parsed_byte_map): next_bottom = self._parsed_byte_map[lower_index +1][0] if next_bottom <= next_probe: # read before the parsed area next_probe = next_bottom - 800 self._read_and_parse([(next_probe, 800)]) # Now, scan for all keys in the potential range, and test them for # being candidates, yielding if they are. if self.node_ref_lists: raise ValueError("TODO: implement resolving of reference lists" " on starts_with searches.") lower_index = self._parsed_key_index(key) parsed_range = self._parsed_byte_map[lower_index] for offset, candidate_node in self._keys_by_offset.iteritems(): if offset < parsed_range[0] or offset >= parsed_range[1]: continue candidate_key = candidate_node[0] if (candidate_key[:-1] == key[:-1] and candidate_key[-1].startswith(key[-1])): if self.node_ref_lists: value, refs = self._bisect_nodes[candidate_key] yield (self, candidate_key, value, self._resolve_references(refs)) else: value = self._bisect_nodes[candidate_key] yield (self, candidate_key, value) class SuggestableBTreeGraphIndex(BTreeGraphIndex): """A subclass of BTreeGraphIndex which adds starts_with searches. These searches are used for providing suggestions. """ def iter_entries_starts_with(self, key): """Iterate over nodes which match key. The first length()-1 terms in key must match exactly, and the last term in key is used as a starts_with test. :param key: The key to search with. """ if not self.key_count(): return # The lowest node to read in the next row. low_index = 0 # the highest node to read in the next row. high_index = 0 # Loop down the rows, setting low_index to the lowest node in the row # that we need to read, and high_index to the highest. key_prefix = key[:-1] key_suffix = key[-1] lower_key = key higher_key = key_prefix + (key_suffix[:-1] + chr(ord(key_suffix[-1]) + 1),) for row_pos, next_row_start in enumerate(self._row_offsets[1:-1]): # find the lower node and higher node bounding the suggestion range node_indices = set([low_index, high_index]) nodes = self._get_internal_nodes(node_indices) # Lower edge low_node = nodes[low_index] position = bisect_left(low_node.keys, lower_key) node_offset = next_row_start + low_node.offset low_index = node_offset + position # Higher edge high_node = nodes[high_index] position = bisect_left(high_node.keys, higher_key) node_offset = next_row_start + high_node.offset high_index = node_offset + position # We should now be at the _LeafNodes node_indices = range(low_index, high_index + 1) # TODO: We may *not* want to always read all the nodes in one # big go. Consider setting a max size on this. group_size = 100 groups = len(node_indices) / group_size + 1 for offset in range(groups): node_group = node_indices[offset * group_size:(offset + 1) * group_size] nodes = self._get_leaf_nodes(node_group) for node in nodes.values(): # TODO bisect the edge nodes? / find boundaries and so skip # some work. items = sorted(node.items()) low_value = (key, ()) start_pos = bisect_left(items, low_value) for pos in xrange(start_pos, len(items)): node_key, (value, refs) = items[pos] if node_key[:-1] != key_prefix: # Shouldn't happen, but may. continue if not node_key[-1].startswith(key_suffix): # A node that doesn't match if node_key[-1] > key_suffix: # and is after: stop break else: # was before the search start point. continue if self.node_ref_lists: yield (self, node_key, value, refs) else: yield (self, node_key, value) _original_make_search_filter = None def query_from_match(match): """Create a query from a 'bzr log' match dictionary or string. :param match: Dictionary mapping properties to user search strings :return: None or a bzr-search query """ if match is None: return None if isinstance(match, basestring): # Older versions of bzr just provided match as a plain string return query_from_regex(match) if match.keys() == ['']: return query_from_regex(match['']) # FIXME: Support searching on other properties as well return None def make_disable_search_filter(branch, generate_delta, match, log_rev_iterator): """Disable search filtering if bzr-search will be active. This filter replaces the default search filter, using the original filter if a bzr-search filter cannot be used. :param branch: The branch being logged. :param generate_delta: Whether to generate a delta for each revision. :param match: Dictionary mapping property names to user search strings :param log_rev_iterator: An input iterator containing all revisions that could be displayed, in lists. :return: An iterator over ((rev_id, revno, merge_depth), rev, delta). """ try: open_index_branch(branch) query = query_from_match(match) if query: return log_rev_iterator except errors.NoSearchIndex: pass return _original_make_search_filter(branch, generate_delta, match, log_rev_iterator) def make_log_search_filter(branch, generate_delta, match, log_rev_iterator): """Filter revisions by using a search index. This filter looks up revids in the search index along with the search string, if the search string regex can be converted into a bzr-search query. :param branch: The branch being logged. :param generate_delta: Whether to generate a delta for each revision. :param match: Dictionary mapping property names to user search strings :param log_rev_iterator: An input iterator containing all revisions that could be displayed, in lists. :return: An iterator over ((rev_id, revno, merge_depth), rev, delta). """ # Can we possibly search on this regex? query = query_from_match(match) if not query: return log_rev_iterator try: index = open_index_branch(branch) except errors.NoSearchIndex: return log_rev_iterator return _filter_log(index, query, log_rev_iterator) def _filter_log(index, query, log_rev_iterator): """Filter log_rev_iterator's revision ids on query in index.""" rev_ids = set() # TODO: we could lazy evaluate the search, for each revision we see - this # would allow searches that hit everything to be less-than-completely # evaluated before the first result is shown. OTOH knowing a miss will # require reading the entire search anyhow. Note that we can do better - # if we looked up the document id of the revision, we could search explicitly # for the document id in the search up front, and do many small searches. This is # likely better in terms of memory use. Needs refactoring etc. for result in index.search(query): if type(result) != RevisionHit: continue rev_ids.add(result.revision_key[0]) for batch in log_rev_iterator: new_revs = [] for item in batch: if item[0][0] in rev_ids: new_revs.append(item) yield new_revs def query_from_regex(regex): """Convert a regex into a bzr-search query.""" # Most trivial implementation ever if not regex: return None if regex.count("\\b") != 2: return None regex = regex[2:-2] if regex.count("\\b") != 0: return None # Any additional whitespace implies something we can't search on: _ensure_regexes() if _tokeniser_re.search(regex): return None return [(regex,)] _FORMATS = { # format: index builder, index reader, index deletes _FORMAT_1:(InMemoryGraphIndex, SuggestableGraphIndex, False), _FORMAT_2:(BTreeBuilder, SuggestableBTreeGraphIndex, True) } bzr-search-1.7.0~bzr94/inventory.py0000644000000000000000000001267611465377274015441 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """Inventory related helpers for indexing.""" import re from bzrlib import lazy_regex from bzrlib.lazy_import import lazy_import lazy_import(globals(), """ from bzrlib import xml_serializer """) _file_ids_name_regex = lazy_regex.lazy_compile( r'file_id="(?P[^"]+)"' r'(?:.* name="(?P[^"]*)")?' r'(?:.* parent_id="(?P[^"]+)")?' ) _escape_re = lazy_regex.lazy_compile("[&'\"<>]") _unescape_re = lazy_regex.lazy_compile("&|'|"|<|>") _unescape_map = { '&': '&', "'": "'", """: '"', "<": '<', ">": '>', } def _unescape_replace(match, map=_unescape_map): return map[match.group()] escape_map = { "&":'&', "'":"'", # FIXME: overkill "\"":""", "<":"<", ">":">", } def _escape_replace(match, map=escape_map): return map[match.group()] def paths_from_ids(xml_inventory, serializer, file_ids): """Extract the paths for some file_ids from xml_inventory.""" if not serializer.support_altered_by_hack: raise ValueError("Cannot process with serializer %r" % serializer) search = _file_ids_name_regex.search # escaped ids to match against the xml: escaped_to_raw_ids = {} for file_id in file_ids: escaped_to_raw_ids[_escape_re.sub(_escape_replace, file_id)] = file_id unresolved_ids = set(escaped_to_raw_ids) # TODO: only examine lines we need to, break early, track unprocessed found_ids = {} id_paths = {} result = {} if type(xml_inventory) == str: xml_inventory = xml_inventory.splitlines() for line in xml_inventory: match = search(line) if match is None: continue file_id, name, parent_id = match.group('file_id', 'name', 'parent_id') if name is None and parent_id is None: # format 5 root name = '' found_ids[file_id] = (name, parent_id) if parent_id is None: # no parent, stash its name now to avoid special casing # later. path = _unescape_re.sub(_unescape_replace, name) id_paths[file_id] = path if file_id in unresolved_ids: result[escaped_to_raw_ids[file_id]] = path needed_ids = set(unresolved_ids) while needed_ids: # --- # lookup_ids_here # --- missing_ids = set() for file_id in needed_ids: name, parent_id = found_ids.get(file_id, (None, None)) if name is None: # Unresolved id itself missing_ids.add(file_id) else: # We have resolved it, do we have its parent if parent_id is not None and parent_id not in found_ids: # No, search for it missing_ids.add(parent_id) if missing_ids == needed_ids: # We didn't find anything on this pass raise Exception("Did not find ids %s" % missing_ids) needed_ids = missing_ids # We have looked up the path-to-root for all asked ids, # now to resolve it while unresolved_ids: wanted_file_id = unresolved_ids.pop() path = id_paths.get(wanted_file_id) if path is not None: result[escaped_to_raw_ids[wanted_file_id]] = path continue lookup_stack = [wanted_file_id] lookup_names = [] # May be looked up already while lookup_stack: file_id = lookup_stack[-1] name, parent_id = found_ids[file_id] parent_path = id_paths.get(parent_id, None) if parent_path is None: # recurse: lookup_stack.append(parent_id) lookup_names.append(name) else: # resolve: path = _unescape_re.sub(_unescape_replace, name) if parent_path: parent_path = parent_path + '/' + path else: parent_path = path id_paths[file_id] = parent_path if file_id == wanted_file_id: result[escaped_to_raw_ids[file_id]] = parent_path lookup_stack.pop(-1) while lookup_stack: file_id = lookup_stack.pop(-1) path = _unescape_re.sub(_unescape_replace, lookup_names.pop(-1)) if parent_path: parent_path = parent_path + '/' + path else: parent_path = path id_paths[file_id] = parent_path if file_id == wanted_file_id: result[escaped_to_raw_ids[file_id]] = parent_path return result bzr-search-1.7.0~bzr94/remote.py0000644000000000000000000002366211663455313014663 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2011 Jelmer Vernooij # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """Smart server integration for bzr-search.""" from bzrlib import remote from bzrlib.bzrdir import BzrDir from bzrlib.errors import ( ErrorFromSmartServer, UnexpectedSmartServerResponse, ) from bzrlib.smart.branch import ( SmartServerBranchRequest, ) from bzrlib.smart.request import SuccessfulSmartServerResponse from bzrlib.plugins.search import errors, index def _encode_termlist(termlist): return ["\0".join([k.encode('utf-8') for k in term]) for term in termlist] def _decode_termlist(termlist): return [tuple([k.decode('utf-8') for k in term.split('\0')]) for term in termlist] class RemoteIndex(object): """Index accessed over a smart server.""" def __init__(self, client, path, branch=None): self._client = client self._path = path self._branch = branch def _call(self, method, *args, **err_context): try: return self._client.call(method, *args) except ErrorFromSmartServer, err: self._translate_error(err, **err_context) def _call_expecting_body(self, method, *args, **err_context): try: return self._client.call_expecting_body(method, *args) except ErrorFromSmartServer, err: self._translate_error(err, **err_context) def _call_with_body_bytes(self, method, args, body_bytes, **err_context): try: return self._client.call_with_body_bytes(method, args, body_bytes) except ErrorFromSmartServer, err: self._translate_error(err, **err_context) def _call_with_body_bytes_expecting_body(self, method, args, body_bytes, **err_context): try: return self._client.call_with_body_bytes_expecting_body( method, args, body_bytes) except errors.ErrorFromSmartServer, err: self._translate_error(err, **err_context) def _translate_error(self, err, **context): remote._translate_error(err, index=self, **context) @classmethod def open(cls, branch): # This might raise UnknownSmartMethod, # but the caller should handle that. response = branch._call("Branch.open_index", branch._remote_path()) if response == ('no', ): raise errors.NoSearchIndex(branch.user_transport) if response != ('yes', ): raise UnexpectedSmartServerResponse(response) return RemoteIndex(branch._client, branch._remote_path(), branch) @classmethod def init(cls, branch): response = branch._call("Branch.init_index", branch._remote_path()) if response != ('ok', ): raise errors.UnexpectedSmartServerResponse(response) return RemoteIndex(branch._client, branch._remote_path(), branch) def index_branch(self, branch, tip_revision): """Index revisions from a branch. :param branch: The branch to index. :param tip_revision: The tip of the branch. """ self.index_revisions(branch, [tip_revision]) def index_revisions(self, branch, revisions_to_index): """Index some revisions from branch. :param branch: A branch to index. :param revisions_to_index: A set of revision ids to index. """ body = "\n".join(revisions_to_index) response = self._call_with_body_bytes( 'Index.index_revisions', (self._path, branch._remote_path(),), body) if response != ('ok', ): raise errors.UnexpectedSmartServerResponse(response) def indexed_revisions(self): """Return the revision_keys that this index contains terms for.""" response, handler = self._call_expecting_body( 'Index.indexed_revisions', self._path) if response != ('ok', ): raise errors.UnexpectedSmartServerResponse(response) byte_stream = handler.read_streamed_body() data = "" for bytes in byte_stream: data += bytes lines = data.split("\n") data = lines.pop() for revid in lines: yield (revid, ) def search(self, termlist): """Trivial set-based search of the index. :param termlist: A list of terms. :return: An iterator of SearchResults for documents indexed by all terms in the termlist. """ index._ensure_regexes() response, handler = self._call_expecting_body('Index.search', self._path, _encode_termlist(termlist)) if response != ('ok', ): raise errors.UnexpectedSmartServerResponse(response) byte_stream = handler.read_streamed_body() data = "" ret = [] for bytes in byte_stream: data += bytes lines = data.split("\n") data = lines.pop() for l in lines: if l[0] == 'r': hit = index.RevisionHit(self._branch.repository, (l[1:], )) elif l[0] == 't': hit = index.FileTextHit(self, self._branch.repository, tuple(l[1:].split("\0")), termlist) elif l[0] == 'p': hit = index.PathHit(l[1:]) else: raise AssertionError("Unknown hit kind %r" % l[0]) # We can't yield, since the caller might try to look up results # over the same medium. ret.append(hit) return iter(ret) def suggest(self, termlist): """Generate suggestions for extending a search. :param termlist: A list of terms. :return: An iterator of terms that start with the last search term in termlist, and match the rest of the search. """ response = self._call('Index.suggest', self._path, _encode_termlist(termlist)) if response[0] != 'ok': raise UnexpectedSmartServerResponse(response) return [(suggestion.decode('utf-8'),) for suggestion in response[1]] class SmartServerBranchRequestOpenIndex(SmartServerBranchRequest): """Open an index file.""" def do_with_branch(self, branch): """open an index.""" try: idx = index.open_index_branch(branch) except errors.NoSearchIndex: return SuccessfulSmartServerResponse(('no', )) else: return SuccessfulSmartServerResponse(('yes', )) class SmartServerBranchRequestInitIndex(SmartServerBranchRequest): """Create an index.""" def do_with_branch(self, branch, format=None): """Create an index.""" if format is None: idx = index.init_index(branch) else: idx = index.init_index(branch, format) return SuccessfulSmartServerResponse(('ok', )) class SmartServerIndexRequest(SmartServerBranchRequest): """Base class for index requests.""" def do_with_branch(self, branch, *args): idx = index.open_index_branch(branch) return self.do_with_index(idx, *args) def do_with_index(self, index, *args): raise NotImplementedError(self.do_with_index) class SmartServerIndexRequestIndexRevisions(SmartServerIndexRequest): """Index a set of revisions.""" def do_body(self, body_bytes): revids = body_bytes.split("\n") self._index.index_revisions(self._branch, revids) return SuccessfulSmartServerResponse(('ok', )) def do_with_index(self, index, branch_path): self._index = index transport = self.transport_from_client_path(branch_path) controldir = BzrDir.open_from_transport(transport) if controldir.get_branch_reference() is not None: raise errors.NotBranchError(transport.base) self._branch = controldir.open_branch(ignore_fallbacks=True) # Indicate we want a body return None class SmartServerIndexRequestIndexedRevisions(SmartServerIndexRequest): """Retrieve the set of revisions in the index.""" def body_stream(self, index): for revid in index.indexed_revisions(): yield "%s\n" % "\0".join(revid) def do_with_index(self, index): return SuccessfulSmartServerResponse(('ok', ), body_stream=self.body_stream(index)) class SmartServerIndexRequestSuggest(SmartServerIndexRequest): """Suggest alternative terms.""" def do_with_index(self, index, termlist): suggestions = index.suggest(_decode_termlist(termlist)) return SuccessfulSmartServerResponse( ('ok', [suggestion.encode('utf-8') for (suggestion,) in suggestions])) class SmartServerIndexRequestSearch(SmartServerIndexRequest): """Search for terms.""" def body_stream(self, results): for hit in results: if isinstance(hit, index.FileTextHit): yield "t%s\0%s\n" % hit.text_key elif isinstance(hit, index.RevisionHit): yield "r%s\n" % hit.revision_key[0] elif isinstance(hit, index.PathHit): yield "p%s\n" % hit.path_utf8 else: raise AssertionError("Unknown hit type %r" % hit) def do_with_index(self, index, termlist): results = index.search(_decode_termlist(termlist)) return SuccessfulSmartServerResponse( ('ok',), body_stream=self.body_stream(results)) bzr-search-1.7.0~bzr94/setup.py0000755000000000000000000000124011055211112014475 0ustar 00000000000000#!/usr/bin/env python2.4 from distutils.core import setup bzr_plugin_name = 'search' bzr_plugin_version = (1, 7, 0, 'dev', 0) bzr_commands = ['index', 'search'] bzr_minimum_version = (1, 6, 0) if __name__ == '__main__': setup(name="bzr search", version="1.7.0dev0", description="bzr search plugin.", author="Robert Collins", author_email="bazaar@lists.canonical.com", license = "GNU GPL v2", url="https://launchpad.net/bzr-search", packages=['bzrlib.plugins.search', 'bzrlib.plugins.search.tests', ], package_dir={'bzrlib.plugins.search': '.'}) bzr-search-1.7.0~bzr94/tests/0000755000000000000000000000000011022666054014142 5ustar 00000000000000bzr-search-1.7.0~bzr94/transport.py0000644000000000000000000000665511026204373015416 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """Transport facilities to support the index engine. The primary class here is FileView, an adapter for exposing a number of files in a pack (with identity encoding only!) such that they can be accessed via readv. """ from cStringIO import StringIO class FileView(object): """An adapter from a pack file to multiple smaller readvable files. A typical use for this is to embed GraphIndex objects in a pack and then use this to allow the GraphIndex logic to readv while actually reading from the pack. Currently only the get and readv methods are supported, all the rest of the transport interface will raise AttributeError - this is deliberate to catch unexpected uses. """ def __init__(self, backing_transport, backing_file, file_map): """Create a FileView. :param backing_transport: The transport the pack file is located on. :param backing_file: The url fragment name of the pack file. :param file_map: A dict from file url fragments, to byte ranges in the pack file. Pack file header and trailer overhead should not be included in these ranges. """ self._backing_transport = backing_transport self._backing_file = backing_file self._file_map = file_map def get(self, relpath): """See Transport.get.""" start, stop = self._file_map[relpath] length = stop - start _, bytes = self._backing_transport.readv(self._backing_file, [(start, length)]).next() return StringIO(bytes) def readv(self, relpath, offsets, adjust_for_latency=False, upper_limit=None): """See Transport.readv. This adapter will clip results back to the range defined by the file_map. """ base, upper_limit = self._file_map[relpath] # adjust offsets new_offsets = [] for offset, length in offsets: new_offsets.append((offset + base, length)) for offset, data in self._backing_transport.readv(self._backing_file, new_offsets, adjust_for_latency=adjust_for_latency, upper_limit=upper_limit): if offset + len(data) > upper_limit: upper_trim = len(data) + offset - upper_limit else: upper_trim = None if offset < base: lower_trim = base - offset offset = base else: lower_trim = 0 data = data[lower_trim:upper_trim] offset = offset - base yield offset, data def recommended_page_size(self): """See Transport.recommended_page_size.""" return self._backing_transport.recommended_page_size() bzr-search-1.7.0~bzr94/tests/__init__.py0000644000000000000000000000221511663264666016271 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """Tests for the search bzr plugin.""" def load_tests(standard_tests, module, loader): test_modules = [ 'blackbox', 'errors', 'index', 'inventory', 'remote', 'transport', ] standard_tests.addTests(loader.loadTestsFromModuleNames( ['bzrlib.plugins.search.tests.test_' + name for name in test_modules])) return standard_tests bzr-search-1.7.0~bzr94/tests/test_blackbox.py0000644000000000000000000001150311044464143017337 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """Tests for the commands supplied by search.""" from bzrlib.plugins.search.index import ( index_url, init_index, open_index_url, ) from bzrlib.tests import TestCaseWithTransport class TestSearch(TestCaseWithTransport): def test_no_parameters_error(self): self.run_bzr_error(['needs one or more'], ['search']) def test_no_index_error(self): self.run_bzr_error(['No search index'], ['search', 'robert']) def test_no_hits_error(self): branch = self.make_branch('.') init_index(branch) self.run_bzr_error(['No matches'], ['search', 'robert']) def test_simple_hits(self): tree = self.make_branch_and_tree('.') init_index(tree.branch) rev_id1 = tree.commit('first post') index_url(self.get_url('.')) index = open_index_url(self.get_url('.')) out, err = self.run_bzr(['search', 'post']) self.assertEqual('', err) self.assertEqual("Revision id '%s'. Summary: 'first post'\n" % rev_id1, out) def test_simple_exclusion(self): tree = self.make_branch_and_tree('.') init_index(tree.branch) rev_id1 = tree.commit('first post') rev_id2 = tree.commit('second post') index_url(self.get_url('.')) index = open_index_url(self.get_url('.')) out, err = self.run_bzr(['search', '--', 'post', '-first']) self.assertEqual('', err) self.assertEqual("Revision id '%s'. Summary: 'second post'\n" % rev_id2, out) def test_directory_option(self): tree = self.make_branch_and_tree('otherdir') init_index(tree.branch) rev_id1 = tree.commit('first post') index_url(self.get_url('otherdir')) out, err = self.run_bzr(['search', '-d', 'otherdir', 'post']) self.assertEqual('', err) self.assertEqual("Revision id '%s'. Summary: 'first post'\n" % rev_id1, out) def test_summary_first_line(self): tree = self.make_branch_and_tree('.') init_index(tree.branch) rev_id1 = tree.commit('this message\nhas two lines') index_url(self.get_url('.')) index = open_index_url(self.get_url('.')) out, err = self.run_bzr(['search', 'two']) self.assertEqual('', err) self.assertEqual("Revision id '%s'. Summary: 'this message'\n" % rev_id1, out) def test_search_suggestions_works(self): # Bare bones - no ui love as yet: tree = self.make_branch_and_tree('.') init_index(tree.branch) rev_id1 = tree.commit('this message\nhas two lines') rev_id2 = tree.commit('this message does not\n') index_url(self.get_url('.')) index = open_index_url(self.get_url('.')) out, err = self.run_bzr(['search', '-s', 'tw']) self.assertEqual('', err) self.assertEqual("Suggestions: [('two',)]\n", out) out, err = self.run_bzr(['search', '-s', 't']) self.assertEqual('', err) self.assertEqual("Suggestions: [('this',), ('two',)]\n", out) out, err = self.run_bzr(['search', '-s', 'too']) self.assertEqual('', err) self.assertEqual("Suggestions: []\n", out) class TestIndex(TestCaseWithTransport): def test_index_branch(self): branch = self.make_branch('a-branch') out, error = self.run_bzr(['index', 'a-branch']) self.assertEqual('', error) self.assertEqual('', out) def test_index_branch_content(self): tree = self.make_branch_and_tree('a-branch') tree.commit('a search term') out, error = self.run_bzr(['index', 'a-branch']) self.assertEqual('', error) self.assertEqual('', out) self.assertSubset(set([('a',), ('search',), ('term',)]), set(dict(open_index_url('a-branch').all_terms()))) def test_index_no_branch(self): self.run_bzr_error(['Not a branch'], ['index', '.']) def test_index_pwd_branch(self): tree = self.make_branch_and_tree('a-branch') out, error = self.run_bzr(['index'], working_dir='a-branch') self.assertEqual('', error) self.assertEqual('', out) open_index_url(self.get_url('a-branch')) bzr-search-1.7.0~bzr94/tests/test_errors.py0000644000000000000000000000303711023055404017062 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """Tests for error formatting.""" from bzrlib.plugins.search import errors from bzrlib.tests import TestCaseWithTransport class TestErrors(TestCaseWithTransport): def test_cannot_index(self): error = errors.CannotIndex('a branch') self.assertEqualDiff( "Cannot index 'a branch', it is not a known control dir type.", str(error)) def test_no_index_error(self): error = errors.NoSearchIndex('a url') self.assertEqualDiff( "No search index present for 'a url'. Please see 'bzr help index'.", str(error)) def test_no_match(self): error = errors.NoMatch(['a', 'search', 'here']) self.assertEqualDiff( "No matches were found for the search ['a', 'search', 'here'].", str(error)) bzr-search-1.7.0~bzr94/tests/test_index.py0000644000000000000000000011272111632144527016671 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """Tests for the index layer.""" from bzrlib import version_info as bzrlib_version from bzrlib.errors import NotBranchError, UnknownFormatError from bzrlib.btree_index import BTreeGraphIndex, BTreeBuilder from bzrlib.index import InMemoryGraphIndex, GraphIndex from bzrlib import log from bzrlib.plugins import search from bzrlib.plugins.search import errors, index from bzrlib.tests import ( condition_isinstance, multiply_tests, split_suite_by_condition, TestCaseWithTransport, ) def load_tests(basic_tests, module, test_loader): """Parameterise the class tests to test all formats.""" component_tests, other_tests = split_suite_by_condition(basic_tests, condition_isinstance(( TestComponentIndexBuilder, TestComponentCombiner))) graph_suggestion, other_tests = split_suite_by_condition(other_tests, condition_isinstance(TestGraphIndexSuggestions)) scenarios = [(format_string[:-1], {'format':format}) for format_string, format in index._FORMATS.items()] multiply_tests(component_tests, scenarios, other_tests) scenarios = [ ("GraphIndex", {'format': (InMemoryGraphIndex, index.SuggestableGraphIndex)}), ("BTree", {'format': (BTreeBuilder, index.SuggestableBTreeGraphIndex)})] multiply_tests(graph_suggestion, scenarios, other_tests) return other_tests class TestIndex(TestCaseWithTransport): def test_init_index_default(self): branch = self.make_branch('foo') search_index = index.init_index(branch) # We should have some basic files on disk, and a valid index returned. self.assertIsInstance(search_index, index.Index) transport = self.get_transport('foo/.bzr/bzr-search') # We expect two files: # - format, containing 'bzr-search search folder 1\n' # - a names file, which is an empty GraphIndex self.assertEqual('bzr-search search folder 2\n', transport.get_bytes('format')) names_list = BTreeGraphIndex(transport, 'names', None) self.assertEqual([], list(names_list.iter_all_entries())) # And a number of empty directories self.assertTrue(transport.has('obsolete')) self.assertTrue(transport.has('upload')) self.assertTrue(transport.has('indices')) def test_init_index_1(self): branch = self.make_branch('foo') search_index = index.init_index(branch, 1) # We should have some basic files on disk, and a valid index returned. self.assertIsInstance(search_index, index.Index) transport = self.get_transport('foo/.bzr/bzr-search') # We expect two files: # - format, containing 'bzr-search search folder 1\n' # - a names file, which is an empty GraphIndex self.assertEqual('bzr-search search folder 1\n', transport.get_bytes('format')) names_list = GraphIndex(transport, 'names', None) self.assertEqual([], list(names_list.iter_all_entries())) # And a number of empty directories self.assertTrue(transport.has('obsolete')) self.assertTrue(transport.has('upload')) self.assertTrue(transport.has('indices')) def test_init_index_2(self): branch = self.make_branch('foo') search_index = index.init_index(branch, 2) # We should have some basic files on disk, and a valid index returned. self.assertIsInstance(search_index, index.Index) transport = self.get_transport('foo/.bzr/bzr-search') # We expect two files: # - format, containing 'bzr-search search folder 1\n' # - a names file, which is an empty GraphIndex self.assertEqual('bzr-search search folder 2\n', transport.get_bytes('format')) names_list = BTreeGraphIndex(transport, 'names', None) self.assertEqual([], list(names_list.iter_all_entries())) # And a number of empty directories self.assertTrue(transport.has('obsolete')) self.assertTrue(transport.has('upload')) self.assertTrue(transport.has('indices')) def test_init_index_unindexable(self): # any non-metadir will do here: branch = self.make_branch('foo', format='weave') self.assertRaises(errors.CannotIndex, index.init_index, branch) def test_open_no_index_error(self): err = self.assertRaises(errors.NoSearchIndex, index.open_index_url, self.get_url()) self.assertEqual(self.get_url(), err.url) def test_open_index_wrong_format_errors(self): branch = self.make_branch('foo', format='pack-0.92') search_index = index.init_index(branch) transport = self.get_transport('foo/.bzr/bzr-search') transport.put_bytes('format', 'garbage\n') self.assertRaises(UnknownFormatError, index.Index, transport, branch) def test_open_index_missing_format_raises_NoSearchIndex(self): branch = self.make_branch('foo', format='pack-0.92') transport = self.get_transport('foo/.bzr/bzr-search') transport.mkdir('.') self.assertRaises(errors.NoSearchIndex, index.Index, transport, branch) def test_index_url_not_branch(self): self.assertRaises(NotBranchError, index.index_url, self.get_url()) def test_index_url_returns_index(self): branch = self.make_branch('foo') search_index = index.index_url(self.get_url('foo')) self.assertIsInstance(search_index, index.Index) def test_index_url_does_index(self): tree = self.make_branch_and_tree('foo') revid = tree.commit('first post') rev_index = index.index_url(self.get_url('foo')) self.assertEqual(set([(revid,)]), set(rev_index.indexed_revisions())) def test_index_url_is_incremental(self): tree = self.make_branch_and_tree('foo') # two initial commits (as we want to avoid the first autopack) revid1 = tree.commit('1') revid2 = tree.commit('2') rev_index = index.index_url(self.get_url('foo')) self.assertEqual(set([(revid1,), (revid2,)]), set(rev_index.indexed_revisions())) base_names = rev_index._current_names.keys() self.assertEqual(1, len(base_names)) revid3 = tree.commit('3') rev_index = index.index_url(self.get_url('foo')) self.assertEqual(set([(revid1,), (revid2,), (revid3,)]), set(rev_index.indexed_revisions())) new_names = rev_index._current_names.keys() self.assertSubset(base_names, new_names) self.assertEqual(2, len(new_names)) # The new index should only have revid3 in it. new_name = list(set(new_names) - set(base_names))[0] new_component = rev_index._current_names[new_name][1] self.assertEqual([(revid3,)], [node[1] for node in new_component.revision_index.iter_all_entries()]) def test_index_combining(self): # After inserting 1 revision, we get one pack, # After 2 we should still have 1, but also two discards # 3 should map to 2 packs, as should 4 (but with 2 discard) # To test: we create four revisions: tree = self.make_branch_and_tree('foo') tree.add(['README.txt'], ['an-id'], ['file']) tree.put_file_bytes_non_atomic('an-id', "file") revid1 = tree.commit('1') revid2 = tree.commit('2') revid3 = tree.commit('3') revid4 = tree.commit('4') rev_index = index.init_index(tree.branch) def get_names(): return [name + '.pack' for name in rev_index._current_names] # index one revision rev_index.index_revisions(tree.branch, [revid1]) self.assertEqual(1, len(rev_index._current_names)) names = get_names() # index the second, should pack rev_index.index_revisions(tree.branch, [revid2]) self.assertEqual(1, len(rev_index._current_names)) obsolete_names = rev_index._obsolete_transport.list_dir('.') # There should be two - the old name and one more. self.assertSubset(names, obsolete_names) self.assertEqual(2, len(obsolete_names)) names = get_names() # index the third, should not pack, and not clean obsoletes, and leave # the existing pack in place. rev_index.index_revisions(tree.branch, [revid3]) self.assertEqual(2, len(rev_index._current_names)) self.assertEqual(obsolete_names, rev_index._obsolete_transport.list_dir('.')) # new names should be the pack for revid3 new_names = set(get_names()) - set(names) self.assertEqual(1, len(new_names)) # index the fourth, which should pack the new name and the fourth one # stil leaving the previous one untouched, should clean obsoletes and # put what was new on three into it rev_index.index_revisions(tree.branch, [revid4]) self.assertEqual(2, len(rev_index._current_names)) obsolete_names = rev_index._obsolete_transport.list_dir('.') # the revid3 pack should have been obsoleted self.assertSubset(new_names, obsolete_names) self.assertEqual(2, len(obsolete_names)) new_names = set(get_names()) - set(names) self.assertEqual(1, len(new_names)) self.assertEqual({ ("1",):set([('r', '', revid1)]), ("2",):set([('r', '', revid2)]), ("3",):set([('r', '', revid3)]), ("4",):set([('r', '', revid4)]), ('jrandom@example.com',): set([('r', '', revid1), ('r', '', revid2), ('r', '', revid3), ('r', '', revid4)]), ('an-id', revid1):set([('p', '', 'README.txt')]), ("file",):set([('f', 'an-id', revid1)]), }, dict(rev_index.all_terms())) self.assertEqual(set([(revid1,), (revid2,), (revid3,), (revid4,)]), set(rev_index.indexed_revisions())) class TestIndexRevisions(TestCaseWithTransport): """Tests for indexing of a set of revisions.""" def test_empty_one_revision(self): # Hugish smoke test - really want smaller units of testing... tree = self.make_branch_and_tree('') tree.add(['README.txt'], ['an-id'], ['file']) tree.put_file_bytes_non_atomic('an-id', "This is the first commit to this working tree.\n" ) rev_index = index.init_index(tree.branch) # The double-space is a cheap smoke test for the tokeniser. bugs = "http://bugtrack.org/1234\nhttp://bugtrack.org/5678" revid = tree.commit('first post', committer="Joe Soap ", authors=["Foo Baa "], revprops={'bugs':bugs}) rev_index.index_revisions(tree.branch, [revid]) self.assertEqual(set([(revid,)]), set(rev_index.indexed_revisions())) # reopen - it should retain the indexed revisions. rev_index = index.open_index_url('') self.assertEqual(set([(revid,)]), set(rev_index.indexed_revisions())) # The terms posting-lists for a simple commit should be: # The date (TODO, needs some thought on how to represent a date term) # The commiter name, email, commit message, bug fixes, properties # paths present # content of documents. expected_terms = { ('first',): set([('r', '', revid), ('f', 'an-id', revid)]), ('post',): set([('r', '', revid)]), ("This",): set([('f', 'an-id', revid)]), ("is",): set([('f', 'an-id', revid)]), ("the",): set([('f', 'an-id', revid)]), ("commit",): set([('f', 'an-id', revid)]), ("to",): set([('f', 'an-id', revid)]), ("this",): set([('f', 'an-id', revid)]), ("working",): set([('f', 'an-id', revid)]), ("tree",): set([('f', 'an-id', revid)]), ('an-id', revid): set([('p', '', 'README.txt')]), ('Baa',): set([('r', '', revid)]), ('Foo',): set([('r', '', revid)]), ('Joe',): set([('r', '', revid)]), ('Soap',): set([('r', '', revid)]), ('foo@example.com',): set([('r', '', revid)]), ('joe@acme.com',): set([('r', '', revid)]), ('http://bugtrack.org/1234',): set([('r', '', revid)]), ('http://bugtrack.org/5678',): set([('r', '', revid)]), } all_terms = {} for term, posting_list in rev_index.all_terms(): all_terms[term] = set(posting_list) self.assertEqual(expected_terms, all_terms) def test_deleted_path_not_indexed_format_1(self): tree = self.make_branch_and_tree('') rev_index = index.init_index(tree.branch, 1) tree.add(['README.txt'], ['an-id'], ['file']) tree.put_file_bytes_non_atomic('an-id', "content.\n") revid = tree.commit('add') tree.remove(['README.txt']) revid2 = tree.commit('delete') rev_index.index_revisions(tree.branch, [revid, revid2]) self.assertEqual(set([(revid,), (revid2,)]), set(rev_index.indexed_revisions())) rev_index = index.open_index_url('') all_terms = {} for term, posting_list in rev_index.all_terms(): all_terms[term] = set(posting_list) # A deleted path is indexed at the point of deletion, and format one # does not support this, so must not have a posting list for it. self.assertFalse(('an-id', revid2) in all_terms) return # To test for presence, we would look for: self.assertSubset([('an-id', revid2)], all_terms) self.assertEqual(set([('p', '', 'README.txt')]), all_terms[('an-id', revid2)]) def test_knit_snapshots_not_indexed(self): # knit snapshots are a contributing factor to getting too-many hits. # instead only new lines should really be considered. # Setup - knits do not expose where snapshots occur, so to test # this we create three versions of a file, which differ nearly entirely # between serial versions. This should trigger the heuristics on # aggregate size causing the third one to be a snapshot; it should not # be indexed with content matching the lines carried across from the # first or second commits. # Need a knit-compression using format to test: tree = self.make_branch_and_tree('', format="1.9") rev_index, revid3 = self.make_indexes_deltas_fixture(tree) tree.lock_read() self.assertEqual('fulltext', tree.branch.repository.texts._index.get_method(('an-id', revid3))) tree.unlock() self.assertIndexedDeltas(tree, rev_index, revid3) def test_2a_indexes_deltas(self): # With 2a formats we should be indexing deltas. # Setup - All 2a commits are full text, so its pretty simple, we just # reuse the setup for test_knit_snapshots_not_indexed but do not make # any assertions about the storage of the texts. tree = self.make_branch_and_tree('', format="2a") rev_index, revid3 = self.make_indexes_deltas_fixture(tree) self.assertIndexedDeltas(tree, rev_index, revid3) def make_indexes_deltas_fixture(self, tree): """Setup a tree with tree commits to be indexed.""" tree.add(['README.txt'], ['an-id'], ['file']) tree.put_file_bytes_non_atomic('an-id', "small\ncontent\n") rev_index = index.init_index(tree.branch) tree.commit('') tree.put_file_bytes_non_atomic('an-id', "more\nmore\ncontent\nmore\nmore\nmore\n") tree.commit('') tree.put_file_bytes_non_atomic('an-id', "other\nother\ncontent\nother\nother\nother\n") revid3 = tree.commit('') return rev_index, revid3 def assertIndexedDeltas(self, tree, rev_index, revid3): """Assert that tree's text get indexed using deltas not full texts.""" rev_index.index_revisions(tree.branch, [revid3]) self.assertEqual(set([(revid3,)]), set(rev_index.indexed_revisions())) rev_index = index.open_index_url('') expected_terms = { ('an-id', revid3): set([('p', '', 'README.txt')]), ('jrandom@example.com',): set([('r', '', revid3)]), ('other',): set([('f', 'an-id', revid3)]), } all_terms = {} for term, posting_list in rev_index.all_terms(): all_terms[term] = set(posting_list) self.assertEqual(expected_terms, all_terms) class TestSearching(TestCaseWithTransport): def test_search_no_hits(self): tree = self.make_branch_and_tree('') rev_index = index.init_index(tree.branch) # No exception because its a generator (and thus not guaranteed to run # to completion). self.assertEqual([], list(rev_index.search([('missing_term',)]))) def test_search_trivial(self): tree = self.make_branch_and_tree('tree') rev_index = index.init_index(tree.branch) # The double-space is a cheap smoke test for the tokeniser. revid = tree.commit('first post') rev_index.index_revisions(tree.branch, [revid]) results = list(rev_index.search([('post',)])) self.assertEqual(1, len(results)) self.assertIsInstance(results[0], index.RevisionHit) self.assertEqual((revid,), results[0].revision_key) def test_search_trivial_exclude(self): tree = self.make_branch_and_tree('tree') rev_index = index.init_index(tree.branch) # The double-space is a cheap smoke test for the tokeniser. revid1 = tree.commit('first post') revid2 = tree.commit('second post') rev_index.index_revisions(tree.branch, [revid1, revid2]) results = list(rev_index.search([('post',), ('-first',)])) self.assertEqual(1, len(results)) self.assertIsInstance(results[0], index.RevisionHit) self.assertEqual((revid2,), results[0].revision_key) def test_search_only_exclude(self): tree = self.make_branch_and_tree('tree') rev_index = index.init_index(tree.branch) # The double-space is a cheap smoke test for the tokeniser. revid1 = tree.commit('first post') revid2 = tree.commit('second post') rev_index.index_revisions(tree.branch, [revid1, revid2]) self.assertRaises(TypeError, list, rev_index.search([('-first',)])) self.knownFailure('exclude-only searches not implemented') results = list(rev_index.search([('-first',)])) self.assertEqual(1, len(results)) self.assertIsInstance(results[0], index.RevisionHit) self.assertEqual((revid2,), results[0].revision_key) def test_suggestions_trivial(self): tree = self.make_branch_and_tree('tree') rev_index = index.init_index(tree.branch) revid = tree.commit('first') rev_index.index_branch(tree.branch, revid) # f matches self.assertEqual([('first',)], list(rev_index.suggest([('f',)]))) self.assertEqual([('first',)], list(rev_index.suggest([('fi',)]))) self.assertEqual([('first',)], list(rev_index.suggest([('fir',)]))) self.assertEqual([('first',)], list(rev_index.suggest([('fir',)]))) self.assertEqual([('first',)], list(rev_index.suggest([('firs',)]))) self.assertEqual([('first',)], list(rev_index.suggest([('first',)]))) self.assertEqual([], list(rev_index.suggest([('firste',)]))) def test_suggestions_two_terms(self): """With two terms only matching suggestions are made.""" tree = self.make_branch_and_tree('tree') rev_index = index.init_index(tree.branch) revid = tree.commit('term suggestion') rev_index.index_branch(tree.branch, revid) # suggesting ('term',), ('suggest',) matches suggestion, # and suggestion ('missing',), ('suggest',) matches nothing. self.assertEqual([('suggestion',)], list(rev_index.suggest([('term',), ('suggest',)]))) self.assertEqual([], list(rev_index.suggest([('missing',), ('suggest',)]))) class TestResults(TestCaseWithTransport): def test_TextHit(self): tree = self.make_branch_and_tree('tree') search_index = index.init_index(tree.branch) tree.add(['README.txt'], ['an-id'], ['file']) tree.put_file_bytes_non_atomic('an-id', "This is the \nfirst commit \nto this working tree.\n" ) rev_id1 = tree.commit('commit') search_index.index_branch(tree.branch, rev_id1) query = [('commit',)] result = index.FileTextHit(search_index, tree.branch.repository, ('an-id', rev_id1), query) tree.lock_read() self.addCleanup(tree.unlock) self.assertEqualDiff( u"README.txt in revision '%s'." % (rev_id1), result.document_name()) self.assertEqual(('an-id', rev_id1), result.text_key) self.assertEqual('first commit ', result.summary()) def test_RevisionHit(self): tree = self.make_branch_and_tree('tree') rev_id1 = tree.commit('a multi\nline message') result = index.RevisionHit(tree.branch.repository, (rev_id1,)) tree.lock_read() self.addCleanup(tree.unlock) self.assertEqualDiff(u"Revision id '%s'." % rev_id1, result.document_name()) self.assertEqual((rev_id1,), result.revision_key) self.assertEqual('a multi', result.summary()) class TestComponentIndexBuilder(TestCaseWithTransport): def test_documents(self): builder = index.ComponentIndexBuilder(self.format) self.assertEqual("0", builder.add_document(('r', '', 'revid'))) self.assertEqual("1", builder.add_document(('r', '', 'other-revid'))) self.assertEqual("0", builder.add_document(('r', '', 'revid'))) doc_index = builder.document_index nodes = sorted(list(doc_index.iter_all_entries())) self.assertEqual([(doc_index, ("0",), "r revid"), (doc_index, ("1",), "r other-revid")], nodes) def test_posting_list(self): builder = index.ComponentIndexBuilder(self.format) # adding a term adds its documents builder.add_term(("term1",), [('r', '', 'revid'), ('r', '', 'other-revid')]) doc_index = builder.document_index nodes = sorted(list(doc_index.iter_all_entries())) self.assertEqual([(doc_index, ("0",), "r revid"), (doc_index, ("1",), "r other-revid")], nodes) # and the term refers to document ids self.assertEqual(set(["0", "1"]), set(builder.posting_list(("term1",)))) # adding a new term adds unique documents builder.add_term(("term2",), [('r', '', 'revid'), ('r', '', 'third-revid')]) nodes = sorted(list(doc_index.iter_all_entries())) # and refers to the correct ids self.assertEqual([(doc_index, ("0",), "r revid"), (doc_index, ("1",), "r other-revid"), (doc_index, ("2",), "r third-revid")], nodes) self.assertEqual(set(["0", "1"]), set(builder.posting_list(("term1",)))) self.assertEqual(set(["0", "2"]), set(builder.posting_list(("term2",)))) # adding a term twice extends the posting list rather than replacing it # or erroring. builder.add_term(("term1",), [('r', '', 'revid'), ('r', '', 'fourth-revid')]) nodes = sorted(list(doc_index.iter_all_entries())) # and refers to the correct ids self.assertEqual([(doc_index, ("0",), "r revid"), (doc_index, ("1",), "r other-revid"), (doc_index, ("2",), "r third-revid"), (doc_index, ("3",), "r fourth-revid"), ], nodes) self.assertEqual(set(["0", "1", "3"]), set(builder.posting_list(("term1",)))) self.assertEqual(set(["0", "2"]), set(builder.posting_list(("term2",)))) def test_2_term_posting_list(self): builder = index.ComponentIndexBuilder(self.format) # adding a term adds its documents builder.add_term(("term1", "term12"), [('r', '', 'revid'), ('r', '', 'other-revid')]) doc_index = builder.document_index nodes = sorted(list(doc_index.iter_all_entries())) self.assertEqual([(doc_index, ("0",), "r revid"), (doc_index, ("1",), "r other-revid")], nodes) # and the term refers to document ids self.assertEqual(set(["0", "1"]), set(builder.posting_list(("term1", "term12")))) # adding a new term adds unique documents builder.add_term(("term2", "term12"), [('r', '', 'revid'), ('r', '', 'third-revid')]) nodes = sorted(list(doc_index.iter_all_entries())) # and refers to the correct ids self.assertEqual([(doc_index, ("0",), "r revid"), (doc_index, ("1",), "r other-revid"), (doc_index, ("2",), "r third-revid")], nodes) self.assertEqual(set(["0", "1"]), set(builder.posting_list(("term1", "term12")))) self.assertEqual(set(["0", "2"]), set(builder.posting_list(("term2", "term12")))) # adding a term twice extends the posting list rather than replacing it # or erroring. builder.add_term(("term1", "term12"), [('r', '', 'revid'), ('r', '', 'fourth-revid')]) nodes = sorted(list(doc_index.iter_all_entries())) # and refers to the correct ids self.assertEqual([(doc_index, ("0",), "r revid"), (doc_index, ("1",), "r other-revid"), (doc_index, ("2",), "r third-revid"), (doc_index, ("3",), "r fourth-revid"), ], nodes) self.assertEqual(set(["0", "1", "3"]), set(builder.posting_list(("term1", "term12")))) self.assertEqual(set(["0", "2"]), set(builder.posting_list(("term2", "term12")))) # Single-element terms are not erroneously being used self.assertEqual(set(), set(builder.posting_list(("term1",)))) self.assertEqual(set(), set(builder.posting_list(("term2",)))) def test_add_revision(self): builder = index.ComponentIndexBuilder(self.format) # adding a revision lists the revision, does not alter document keys # etc. builder.add_revision('foo') nodes = sorted(list(builder.document_index.iter_all_entries())) self.assertEqual([], nodes) self.assertEqual({}, builder.terms) nodes = sorted(list(builder.revision_index.iter_all_entries())) self.assertEqual([(builder.revision_index, ("foo",), "")], nodes) class TestComponentCombiner(TestCaseWithTransport): def test_combine_two_components_overlapping_data(self): # create one component: transport = self.get_transport() components = [] builder = index.ComponentIndexBuilder(self.format) builder.add_revision('rev1') builder.add_revision('rev-common') builder.add_term(("term1",), [('r', '', 'rev1'), ('r', '', 'rev-common')]) builder.add_term(("term-common",), [('r', '', 'rev1'), ('r', '', 'rev-common')]) builder.add_term(("term", "complex"), [('f', 'foo', 'rev1')]) name, value, elements = builder.upload_index(transport) component1 = index.ComponentIndex(self.format, name, value, transport) components.append(component1) builder = index.ComponentIndexBuilder(self.format) builder.add_revision('rev-common') builder.add_revision('rev2') builder.add_term(("term-common",), [('r', '', 'rev2'), ('r', '', 'rev-common')]) builder.add_term(("term2",), [('r', '', 'rev2'), ('r', '', 'other-revid')]) name, value, elements = builder.upload_index(transport) component2 = index.ComponentIndex(self.format, name, value, transport) components.append(component2) combiner = index.ComponentCombiner(self.format, components, transport) name, value, elements = combiner.combine() combined = index.ComponentIndex(self.format, name, value, transport) terms = {} terms[('term-common',)] = set([('r', '', 'rev-common'), ('r', '', 'rev1'), ('r', '', 'rev2')]) terms[('term1',)] = set([('r', '', 'rev-common'), ('r', '', 'rev1')]) terms[('term2',)] = set([('r', '', 'other-revid'), ('r', '', 'rev2')]) terms[('term', 'complex')] = set([('f', 'foo', 'rev1')]) self.assertEqual(terms, combined.all_terms()) self.assertEqual(set([('rev1',), ('rev2',), ('rev-common',)]), set(combined.indexed_revisions())) def test_combine_two_components_path_spaces(self): # create one component: transport = self.get_transport() components = [] builder = index.ComponentIndexBuilder(self.format) builder.add_revision('revid') builder.add_term(("file", "revid"), [('p', '', 'file path')]) name, value, elements = builder.upload_index(transport) component1 = index.ComponentIndex(self.format, name, value, transport) components.append(component1) builder = index.ComponentIndexBuilder(self.format) builder.add_revision('revid1') name, value, elements = builder.upload_index(transport) component2 = index.ComponentIndex(self.format, name, value, transport) components.append(component2) combiner = index.ComponentCombiner(self.format, components, transport) name, value, elements = combiner.combine() combined = index.ComponentIndex(self.format, name, value, transport) terms = {('file', 'revid'): set([('p', '', 'file path')])} self.assertEqual(terms, combined.all_terms()) self.assertEqual(set([('revid',), ('revid1',)]), set(combined.indexed_revisions())) class TestAutoIndex(TestCaseWithTransport): def test_no_index_no_error(self): tree = self.make_branch_and_tree("foo") search._install_hooks() tree.commit('foo') def test_index_is_populated(self): search._install_hooks() tree = self.make_branch_and_tree("foo") search_index = index.init_index(tree.branch) revid1 = tree.commit('foo') self.assertEqual(set([(revid1,)]), set(search_index.indexed_revisions())) class TestGraphIndexSuggestions(TestCaseWithTransport): """Tests for the SuggestableGraphIndex subclass.""" def test_key_length_1_no_hits(self): builder = self.format[0](0, 1) # We want nodes before and after the suggestions, to check boundaries. builder.add_node(('pre',), '', ()) builder.add_node(('prep',), '', ()) transport = self.get_transport() length = transport.put_file('index', builder.finish()) query_index = self.format[1](transport, 'index', length) # Now, searching for suggestions for 'pref' should find nothing. self.assertEqual([], list(query_index.iter_entries_starts_with(('pref',)))) def test_key_length_1_iteration(self): builder = self.format[0](0, 1) # We want nodes before and after the suggestions, to check boundaries. builder.add_node(('pre',), '', ()) builder.add_node(('prep',), '', ()) # We want some entries to find. builder.add_node(('pref',), '', ()) builder.add_node(('preferential',), '', ()) transport = self.get_transport() length = transport.put_file('index', builder.finish()) query_index = self.format[1](transport, 'index', length) # Now, searching for suggestions for 'pref' should find 'pref' and # 'preferential'. self.assertEqual([ (query_index, ('pref',), ''), (query_index, ('preferential',), ''), ], list(query_index.iter_entries_starts_with(('pref',)))) def test_key_length_2_no_hits(self): builder = self.format[0](0, 2) # We want nodes before and after the suggestions, to check boundaries. # As there are two elements in each key, we want to check this for each # element, which implies 4 boundaries: builder.add_node(('pre', 'pref'), '', ()) builder.add_node(('pref', 'pre'), '', ()) builder.add_node(('pref', 'prep'), '', ()) builder.add_node(('prep', 'pref'), '', ()) transport = self.get_transport() length = transport.put_file('index', builder.finish()) query_index = self.format[1](transport, 'index', length) # Now, searching for suggestions for 'pref', 'pref' should find # nothing. self.assertEqual([], list(query_index.iter_entries_starts_with(('pref', 'pref')))) def test_key_length_2_iteration(self): builder = self.format[0](0, 2) # We want nodes before and after the suggestions, to check boundaries. # - the first element of the key must be an exact match, the second is # a startswith match, so provide non-match entries that match the second # in case of bugs there. builder.add_node(('pre', 'pref'), '', ()) builder.add_node(('pref', 'pre'), '', ()) builder.add_node(('pref', 'prep'), '', ()) builder.add_node(('prep', 'pref'), '', ()) # We want some entries to find. builder.add_node(('pref', 'pref'), '', ()) builder.add_node(('pref', 'preferential'), '', ()) transport = self.get_transport() length = transport.put_file('index', builder.finish()) query_index = self.format[1](transport, 'index', length) # Now, searching for suggestions for 'pref' should find 'pref' and # 'preferential'. self.assertEqual([ (query_index, ('pref', 'pref'), ''), (query_index, ('pref', 'preferential'), ''), ], sorted(query_index.iter_entries_starts_with(('pref', 'pref')))) class TestLogFilter(TestCaseWithTransport): def test_registered(self): self.assertTrue(index.make_disable_search_filter in log.log_adapters) self.assertTrue(index.make_log_search_filter in log.log_adapters) self.assertFalse(index._original_make_search_filter in log.log_adapters) def test_get_filter_no_index(self): tree = self.make_branch_and_tree('foo') base_iterator = 'base' # bzr-search won't kick in self.assertEqual(base_iterator, index.make_log_search_filter( tree.branch, False, {'': "\\bword\\b"}, base_iterator)) # so the disabling wrapper must. self.assertNotEqual(base_iterator, index.make_disable_search_filter( tree.branch, False, {'': "\\bword\\b"}, base_iterator)) def test_get_filter_too_complex(self): """A too complex regex becomes a baseline search.""" # We test this by searching for something that a index search would # miss but a regex search finds tree = self.make_branch_and_tree('foo') revid = tree.commit('first post') index.index_url(self.get_url('foo')) rev = tree.branch.repository.get_revision(revid) input_iterator = [[((revid, 1, 0), rev, None)]] if bzrlib_version >= (2, 5): match = {'': "st po"} else: match = "st po" rev_log_iterator = index.make_disable_search_filter( tree.branch, False, match, input_iterator) self.assertNotEqual(input_iterator, rev_log_iterator) # everything matches self.assertEqual(input_iterator, list(rev_log_iterator)) # bzr-search won't kick in self.assertEqual(input_iterator, index.make_log_search_filter( tree.branch, False, match, input_iterator)) def test_get_filter_searchable_regex(self): """A parsable regex becomes a index search.""" # We test this by searching for something that a index search would # miss hit, and crippling the baseline search reference. self.saved_orig = index._original_make_search_filter def restore(): index._original_make_search_filter = self.saved_orig self.addCleanup(restore) index._original_make_search_filter = None tree = self.make_branch_and_tree('foo') revid = tree.commit('first post') revid2 = tree.commit('second post') index.index_url(self.get_url('foo')) input_iterator = [ [((revid2, 2, 0), None, None), ((revid, 1, 0), None, None)]] # the disabled filter must not kick in self.assertEqual(input_iterator, index.make_disable_search_filter( tree.branch, False, {'': "\\bfirst\\b"}, input_iterator)) # we must get a functional search from the log search filter. rev_log_iterator = index.make_log_search_filter( tree.branch, False, {'': "\\bfirst\\b"}, input_iterator) self.assertNotEqual(input_iterator, rev_log_iterator) # rev id 2 should be filtered out. expected_result = [[((revid, 1, 0), None, None)]] self.assertEqual(expected_result, list(rev_log_iterator)) def test_query_from_regex(self): self.assertEqual(None, index.query_from_regex("foo")) self.assertEqual(None, index.query_from_regex("fo o")) self.assertEqual(None, index.query_from_regex("\\bfoo \\b")) self.assertEqual([("foo",)], index.query_from_regex("\\bfoo\\b")) bzr-search-1.7.0~bzr94/tests/test_inventory.py0000644000000000000000000001351111273673353017621 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """Tests for the inventory specific logic.""" from bzrlib.plugins.search import inventory from bzrlib.tests import TestCaseWithTransport class TestPathFromInventory(TestCaseWithTransport): def get_inventory(self, tree, revid): tree.lock_read() try: inventories = tree.branch.repository.inventories return inventories.get_record_stream([(revid,)], 'unordered', True).next().get_bytes_as('fulltext') finally: tree.unlock() def test_paths_from_ids_basic(self): tree = self.make_branch_and_tree('foo', format="1.14") t = self.get_transport('foo') t.mkdir('subdir') t.mkdir('subdir/nested-dir') t.put_bytes('subdir/nested-dir/the file', 'content') # The ids are in reverse sort order to try to exercise corner cases in # xml processing. tree.add(['subdir', 'subdir/nested-dir', 'subdir/nested-dir/the file'], ['3dir', '2dir', '1file']) revid = tree.commit('first post') xml = self.get_inventory(tree, revid) serializer = tree.branch.repository._serializer # All individually: self.assertEqual({'1file':'subdir/nested-dir/the file'}, inventory.paths_from_ids(xml, serializer, ['1file'])) self.assertEqual({'2dir':'subdir/nested-dir'}, inventory.paths_from_ids(xml, serializer, ['2dir'])) self.assertEqual({'3dir':'subdir'}, inventory.paths_from_ids(xml, serializer, ['3dir'])) # In twos: self.assertEqual({'1file':'subdir/nested-dir/the file', '2dir':'subdir/nested-dir'}, inventory.paths_from_ids(xml, serializer, ['1file', '2dir'])) self.assertEqual({'1file':'subdir/nested-dir/the file', '3dir':'subdir'}, inventory.paths_from_ids(xml, serializer, ['1file', '3dir'])) self.assertEqual({'2dir':'subdir/nested-dir', '3dir':'subdir'}, inventory.paths_from_ids(xml, serializer, ['3dir', '2dir'])) # All together self.assertEqual({'1file':'subdir/nested-dir/the file', '2dir':'subdir/nested-dir', '3dir':'subdir'}, inventory.paths_from_ids(xml, serializer, ['1file', '2dir', '3dir'])) def test_paths_from_ids_rich_root(self): tree = self.make_branch_and_tree('foo', format="rich-root-pack") tree.set_root_id('a-root') t = self.get_transport('foo') t.mkdir('subdir') tree.add(['subdir'], ['3dir']) revid = tree.commit('first post') xml = self.get_inventory(tree, revid) serializer = tree.branch.repository._serializer # All individually: self.assertEqual({'3dir':'subdir'}, inventory.paths_from_ids(xml, serializer, ['3dir'])) self.assertEqual({'a-root':''}, inventory.paths_from_ids(xml, serializer, ['a-root'])) # In twos: self.assertEqual({'3dir':'subdir', 'a-root':''}, inventory.paths_from_ids(xml, serializer, ['3dir', 'a-root'])) def test_format_5_unique_root(self): from bzrlib.xml5 import Serializer_v5 serializer = Serializer_v5() xml = [ '\n', '\n', '\n', '\n', '\n' ] # All individually is enough to test the lookup of the root id/path: self.assertEqual({'__init__':'__init__.py'}, inventory.paths_from_ids(xml, serializer, ['__init__'])) self.assertEqual({'tests':'tests'}, inventory.paths_from_ids(xml, serializer, ['tests'])) self.assertEqual({'test.py':'tests/test.py'}, inventory.paths_from_ids(xml, serializer, ['test.py'])) def test_escaped_chars(self): """Inventories with escaping attributes (&'"<>) are matched ok.""" from bzrlib.xml5 import Serializer_v5 serializer = Serializer_v5() xml = [ '\n', '\n', '\n', '\n', '\n' ] # Lookup an id that has every escape self.assertEqual({'&\'"<>':'__init__.py'}, inventory.paths_from_ids(xml, serializer, ['&\'"<>'])) # Get the path for a name which is escaped self.assertEqual({'test.py':'&\'"<>/><"\'&'}, inventory.paths_from_ids(xml, serializer, ['test.py'])) bzr-search-1.7.0~bzr94/tests/test_remote.py0000644000000000000000000001164711663256321017062 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2011 Jelmer Vernooij # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """Tests for the smart server verbs.""" from bzrlib import tests from bzrlib.branch import Branch from bzrlib.smart import ( request as smart_req, ) from bzrlib.plugins.search import ( errors, index, ) from bzrlib.plugins.search.remote import ( RemoteIndex, SmartServerBranchRequestOpenIndex, ) class TestSmartServerBranchRequestOpenIndex( tests.TestCaseWithMemoryTransport): def test_missing(self): """For an empty branch, the result is ('no', ).""" backing = self.get_transport() request = SmartServerBranchRequestOpenIndex(backing) self.make_branch('.') self.assertEqual(smart_req.SmartServerResponse(('no', )), request.execute('')) def test_present(self): """For a branch with an index, ('yes', ) is returned.""" backing = self.get_transport() request = SmartServerBranchRequestOpenIndex(backing) b = self.make_branch('.') index.init_index(b) self.assertEqual(smart_req.SmartServerResponse(('yes', )), request.execute('')) class TestRemoteIndex(tests.TestCaseWithTransport): def test_no_index(self): local_branch = self.make_branch('.') remote_transport = self.make_smart_server('.') remote_branch = Branch.open_from_transport(remote_transport) self.assertRaises(errors.NoSearchIndex, RemoteIndex.open, remote_branch) def test_open(self): local_branch = self.make_branch('.') index.init_index(local_branch) remote_transport = self.make_smart_server('.') remote_branch = Branch.open_from_transport(remote_transport) idx = RemoteIndex.open(remote_branch) self.assertIsInstance(idx, RemoteIndex) def test_init(self): local_branch = self.make_branch('.') remote_transport = self.make_smart_server('.') remote_branch = Branch.open_from_transport(remote_transport) idx = index.init_index(remote_branch) self.assertIsInstance(idx, RemoteIndex) def test_init_exists(self): local_branch = self.make_branch('.') index.init_index(local_branch) remote_transport = self.make_smart_server('.') remote_branch = Branch.open_from_transport(remote_transport) #self.assertRaises( index.init_index, remote_branch) class TestWithRemoteIndex(tests.TestCaseWithTransport): def make_remote_index(self): tree = self.make_branch_and_tree('.') local_branch = tree.branch index.init_index(local_branch) remote_transport = self.make_smart_server('.') remote_branch = Branch.open_from_transport(remote_transport) return tree, remote_branch, RemoteIndex.open(remote_branch) def test_index_revisions(self): tree, branch, index = self.make_remote_index() tree.commit(message="message", rev_id='revid1') index.index_revisions(branch, ['revid1']) self.assertEquals([('revid1',)], list(index.indexed_revisions())) def test_indexed_revisions(self): tree, branch, remote_index = self.make_remote_index() tree.commit(message="message", rev_id='revid1') self.assertEquals([], list(remote_index.indexed_revisions())) local_index = index.open_index_branch(tree.branch) local_index.index_revisions(tree.branch, ['revid1']) self.assertEquals([('revid1',)], list(remote_index.indexed_revisions())) def test_suggest(self): tree, branch, remote_index = self.make_remote_index() tree.commit(message="first", rev_id='revid1') local_index = index.open_index_branch(tree.branch) local_index.index_revisions(tree.branch, ['revid1']) self.assertEquals([(u'first',)], list(remote_index.suggest([(u'f',)]))) def test_search(self): tree, branch, remote_index = self.make_remote_index() # The double-space is a cheap smoke test for the tokeniser. revid = tree.commit('first post') remote_index.index_revisions(branch, [revid]) results = list(remote_index.search([('post',)])) self.assertEqual(1, len(results)) self.assertIsInstance(results[0], index.RevisionHit) self.assertEqual((revid,), results[0].revision_key) bzr-search-1.7.0~bzr94/tests/test_transport.py0000644000000000000000000000440211024076032017600 0ustar 00000000000000# search, a bzr plugin for searching within bzr branches/repositories. # Copyright (C) 2008 Robert Collins # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as published # by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # """Tests for the transport layer.""" from bzrlib.plugins.search import transport from bzrlib.tests import TestCaseWithTransport class TestFileView(TestCaseWithTransport): def get_bulk_and_view_data(self): """Get sample data for a view on a file.""" bulk_data = [] for count in range(4096): bulk_data.append(str(count)) bulk_data = ":".join(bulk_data) view_data = bulk_data[400:1600] file_map = {"Foo.1": (400, 1600)} base_transport = self.get_transport(".") base_transport.put_bytes("foo.pack", bulk_data) return bulk_data, view_data, file_map def test_get(self): bulk_data, view_data, file_map = self.get_bulk_and_view_data() base_transport = self.get_transport(".") view = transport.FileView(base_transport, "foo.pack", file_map) # Doing a get() returns a file which only contains the view_data. visible_bytes = view.get("Foo.1").read() self.assertEqual(visible_bytes, view_data) def test_readv(self): bulk_data, view_data, file_map = self.get_bulk_and_view_data() base_transport = self.get_transport(".") view = transport.FileView(base_transport, "foo.pack", file_map) # Doing a readv for '' on view is trimmed to the data between 400 and # 1600. for offset, data in view.readv('Foo.1', [(0, 10), (700, 100)], True, 800): matching_data = view_data[offset:offset + len(data)] self.assertEqual(matching_data, data)