mechanize-0.2.5/0000755000175000017500000000000011545173600012141 5ustar johnjohnmechanize-0.2.5/setup.cfg0000644000175000017500000000007311545173600013762 0ustar johnjohn[egg_info] tag_build = tag_date = 0 tag_svn_revision = 0 mechanize-0.2.5/COPYING.txt0000644000175000017500000001250611545150644014021 0ustar johnjohnAll the code with the exception of _gzip.py is covered under either the BSD-style license immediately below, or (at your choice) the ZPL 2.1. The code in _gzip.py is taken from the effbot.org library, and falls under the effbot.org license (also BSD-style) that appears at the end of this file. Copyright (c) 2002-2010 John J. Lee Copyright (c) 1997-1999 Gisle Aas Copyright (c) 1997-1999 Johnny Lee Copyright (c) 2003 Andy Lester BSD-style License ================== All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the contributors nor the names of their employers may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ZPL 2.1 ================== Zope Public License (ZPL) Version 2.1 A copyright notice accompanies this license document that identifies the copyright holders. This license has been certified as open source. It has also been designated as GPL compatible by the Free Software Foundation (FSF). Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions in source code must retain the accompanying copyright notice, this list of conditions, and the following disclaimer. 2. Redistributions in binary form must reproduce the accompanying copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Names of the copyright holders must not be used to endorse or promote products derived from this software without prior written permission from the copyright holders. 4. The right to distribute this software or to use it for any purpose does not give you the right to use Servicemarks (sm) or Trademarks (tm) of the copyright holders. Use of them is covered by separate agreement with the copyright holders. 5. If any files are modified, you must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. Disclaimer THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -------------------------------------------------------------------- The effbot.org Library is Copyright (c) 1999-2004 by Secret Labs AB Copyright (c) 1999-2004 by Fredrik Lundh By obtaining, using, and/or copying this software and/or its associated documentation, you agree that you have read, understood, and will comply with the following terms and conditions: Permission to use, copy, modify, and distribute this software and its associated documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appears in all copies, and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of Secret Labs AB or the author not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. SECRET LABS AB AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL SECRET LABS AB OR THE AUTHOR BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. -------------------------------------------------------------------- mechanize-0.2.5/README.txt0000644000175000017500000000041511545150644013642 0ustar johnjohnSee INSTALL.txt for installation instructions. See docs/html/index.html and docstrings for documentation. If you have a git working tree rather than a release, you'll only have the markdown source, e.g. mechanize/index.txt; release.py is used to build the HTML docs. mechanize-0.2.5/INSTALL.txt0000644000175000017500000000104711545150644014015 0ustar johnjohnTo install mechanize: See the web page for the version of Python required (included here as docs/html/index.html). To install the package, run the following command: python setup.py install Alternatively, just copy the whole mechanize directory into a directory on your Python path (e.g. unix: /usr/local/lib/python2.7/site-packages, Windows: C:\Python27\Lib\site-packages). Only copy the mechanize directory that's inside the distributed tarball / zip archive, not the entire mechanize-x.x.x directory! John J. Lee July 2010 mechanize-0.2.5/docs/0000755000175000017500000000000011545173600013071 5ustar johnjohnmechanize-0.2.5/docs/faq.txt0000644000175000017500000003507011545150644014411 0ustar johnjohn% mechanize -- FAQ
* Which version of Python do I need? Python 2.4, 2.5, 2.6, or 2.7. Python 3 is not yet supported. * Does mechanize depend on BeautifulSoup? No. mechanize offers a few classes that make use of BeautifulSoup, but these classes are not required to use mechanize. mechanize bundles BeautifulSoup version 2, so that module is no longer required. A future version of mechanize will support BeautifulSoup version 3, at which point mechanize will likely no longer bundle the module. * Does mechanize depend on ClientForm? No, ClientForm is now part of mechanize. * Which license? mechanize is dual-licensed: you may pick either the [BSD license](http://www.opensource.org/licenses/bsd-license.php), or the [ZPL 2.1](http://www.zope.org/Resources/ZPL) (both are included in the distribution). Usage ----- * I'm not getting the HTML page I expected to see. [Debugging tips](hints.html) * `Browser` doesn't have all of the forms/links I see in the HTML. Why not? Perhaps the default parser can't cope with invalid HTML. Try using the included BeautifulSoup 2 parser instead: ~~~~{.python} import mechanize browser = mechanize.Browser(factory=mechanize.RobustFactory()) browser.open("http://example.com/") print browser.forms ~~~~ Alternatively, you can process the HTML (and headers) arbitrarily: ~~~~{.python} browser = mechanize.Browser() browser.open("http://example.com/") html = browser.response().get_data().replace("
", "
") response = mechanize.make_response( html, [("Content-Type", "text/html")], "http://example.com/", 200, "OK") browser.set_response(response) ~~~~ * Is JavaScript supported? No, sorry. See [FAQs](#change-value) [below](#script). * My HTTP response data is truncated. `mechanize.Browser's` response objects support the `.seek()` method, and can still be used after `.close()` has been called. Response data is not fetched until it is needed, so navigation away from a URL before fetching all of the response will truncate it. Call `response.get_data()` before navigation if you don't want that to happen. * I'm *sure* this page is HTML, why does `mechanize.Browser` think otherwise? ~~~~{.python} b = mechanize.Browser( # mechanize's XHTML support needs work, so is currently switched off. If # we want to get our work done, we have to turn it on by supplying a # mechanize.Factory (with XHTML support turned on): factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True) ) ~~~~ * Why don't timeouts work for me? Timeouts are ignored with with versions of Python earlier than 2.6. Timeouts do not apply to DNS lookups. * Is there any example code? Look in the `examples/` directory. Note that the examples on the [forms page](./forms.html) are executable as-is. Contributions of example code would be very welcome! Cookies ------- * Doesn't the standard Python library module, `Cookie`, do this? No: module `Cookie` does the server end of the job. It doesn't know when to accept cookies from a server or when to send them back. Part of mechanize has been contributed back to the standard library as module `cookielib` (there are a few differences, notably that `cookielib` contains thread synchronization code; mechanize does not use `cookielib`). * Which HTTP cookie protocols does mechanize support? Netscape and [RFC 2965](http://www.ietf.org/rfc/rfc2965.txt). RFC 2965 handling is switched off by default. * What about RFC 2109? RFC 2109 cookies are currently parsed as Netscape cookies, and treated by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled, or as Netscape cookies otherwise. * Why don't I have any cookies? See [here](hints.html#cookies). * My response claims to be empty, but I know it's not! Did you call `response.read()` (e.g., in a debug statement), then forget that all the data has already been read? In that case, you may want to use `mechanize.response_seek_wrapper`. `mechanize.Browser` always returns [seekable responses](doc.html#seekable-responses), so it's not necessary to use this explicitly in that case. * What's the difference between the `.load()` and `.revert()` methods of `CookieJar`? `.load()` *appends* cookies from a file. `.revert()` discards all existing cookies held by the `CookieJar` first (but it won't lose any existing cookies if the loading fails). * Is it threadsafe? No. As far as I know, you can use mechanize in threaded code, but it provides no synchronisation: you have to provide that yourself. * How do I do Refer to the API documentation in docstrings. Forms ----- * Doesn't the standard Python library module, `cgi`, do this? No: the `cgi` module does the server end of the job. It doesn't know how to parse or fill in a form or how to send it back to the server. * How do I figure out what control names and values to use? `print form` is usually all you need. In your code, things like the `HTMLForm.items` attribute of `HTMLForm` instances can be useful to inspect forms at runtime. Note that it's possible to use item labels instead of item names, which can be useful — use the `by_label` arguments to the various methods, and the `.get_value_by_label()` / `.set_value_by_label()` methods on `ListControl`. * What do those `'*'` characters mean in the string representations of list controls? A `*` next to an item means that item is selected. * What do those parentheses (round brackets) mean in the string representations of list controls? Parentheses `(foo)` around an item mean that item is disabled. * Why doesn't turn up in the data returned by `.click*()` when that control has non-`None` value? Either the control is disabled, or it is not successful for some other reason. 'Successful' (see [HTML 4 specification](http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.2)) means that the control will cause data to get sent to the server. * Why does mechanize not follow the HTML 4.0 / RFC 1866 standards for `RADIO` and multiple-selection `SELECT` controls? Because by default, it follows browser behaviour when setting the initially-selected items in list controls that have no items explicitly selected in the HTML. Use the `select_default` argument to `ParseResponse` if you want to follow the RFC 1866 rules instead. Note that browser behaviour violates the HTML 4.01 specification in the case of `RADIO` controls. * Why does `.click()`ing on a button not work for me? * Clicking on a `RESET` button doesn't do anything, by design - this is a library for web automation, not an interactive browser. Even in an interactive browser, clicking on `RESET` sends nothing to the server, so there is little point in having `.click()` do anything special here. * Clicking on a `BUTTON TYPE=BUTTON` doesn't do anything either, also by design. This time, the reason is that that `BUTTON` is only in the HTML standard so that one can attach JavaScript callbacks to its events. Their execution may result in information getting sent back to the server. mechanize, however, knows nothing about these callbacks, so it can't do anything useful with a click on a `BUTTON` whose type is `BUTTON`. * Generally, JavaScript may be messing things up in all kinds of ways. See the answer to the next question. * How do I change `INPUT TYPE=HIDDEN` field values (for example, to emulate the effect of JavaScript code)? As with any control, set the control's `readonly` attribute false. ~~~~{.python} form.find_control("foo").readonly = False # allow changing .value of control foo form.set_all_readonly(False) # allow changing the .value of all controls ~~~~ * I'm having trouble debugging my code. See [here](hints.html) for few relevant tips. * I have a control containing a list of integers. How do I select the one whose value is nearest to the one I want? ~~~~{.python} import bisect def closest_int_value(form, ctrl_name, value): values = map(int, [item.name for item in form.find_control(ctrl_name).items]) return str(values[bisect.bisect(values, value) - 1]) form["distance"] = [closest_int_value(form, "distance", 23)] ~~~~ General ------- * I want to see what my web browser is doing, but standard network sniffers like [wireshark](http://www.wireshark.org/) or netcat (nc) don't work for HTTPS. How do I sniff HTTPS traffic? Three good options: * Mozilla plugin: [LiveHTTPHeaders](http://livehttpheaders.mozdev.org/). * [ieHTTPHeaders](http://www.blunck.info/iehttpheaders.html) does the same for MSIE. * Use [`lynx`](http://lynx.browser.org/) `-trace`, and filter out the junk with a script. * JavaScript is messing up my web-scraping. What do I do? JavaScript is used in web pages for many purposes -- for example: creating content that was not present in the page at load time, submitting or filling in parts of forms in response to user actions, setting cookies, etc. mechanize does not provide any support for JavaScript. If you come across this in a page you want to automate, you have four options. Here they are, roughly in order of simplicity. * Figure out what the JavaScript is doing and emulate it in your Python code: for example, by manually adding cookies to your `CookieJar` instance, calling methods on `HTMLForm`s, calling `urlopen`, etc. See [above](#change-value) re forms. * Use Java's [HtmlUnit](http://htmlunit.sourceforge.net/) or [HttpUnit](http://httpunit.sourceforge.net) from Jython, since they know some JavaScript. * Instead of using mechanize, automate a browser instead. For example use MS Internet Explorer via its COM automation interfaces, using the [Python for Windows extensions](http://starship.python.net/crew/mhammond/), aka pywin32, aka win32all (e.g. [simple function](http://vsbabu.org/mt/archives/2003/06/13/ie_automation.html), [pamie](http://pamie.sourceforge.net/); [pywin32 chapter from the O'Reilly book](http://www.oreilly.com/catalog/pythonwin32/chapter/ch12.html)) or [ctypes](http://python.net/crew/theller/ctypes/) ([example](http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305273)). [This](http://www.brunningonline.net/simon/blog/archives/winGuiAuto.py.html) kind of thing may also come in useful on Windows for cases where the automation API is lacking. For Firefox, there is [PyXPCOM](https://developer.mozilla.org/en/PyXPCOM). * Get ambitious and automatically delegate the work to an appropriate interpreter (Mozilla's JavaScript interpreter, for instance). This is what HtmlUnit and httpunit do. I did a spike along these lines some years ago, but I think it would (still) be quite a lot of work to do well. * Misc links * The following libraries can be useful for dealing with bad HTML: [lxml.html](http://codespeak.net/lxml/lxmlhtml.html), [html5lib](http://code.google.com/p/html5lib/), [BeautifulSoup 3](http://www.crummy.com/software/BeautifulSoup/CHANGELOG.html), [mxTidy](http://www.egenix.com/files/python/mxTidy.html) and [mu-Tidylib](http://utidylib.berlios.de/). * [Selenium](http://www.openqa.org/selenium/): In-browser web functional testing. If you need to test websites against real browsers, this is a standard way to do it. * O'Reilly book: [Spidering Hacks](http://oreilly.com/catalog/9780596005771). Very Perl-oriented. * Standard extensions for web development with Firefox, which are also handy if you're scraping the web: [Web Developer](http://chrispederick.com/work/webdeveloper/) (amongst other things, this can display HTML form information), [Firebug](http://getfirebug.com/). * Similar functionality for IE6 and IE7: [Internet Explorer Developer Toolbar](http://www.google.co.uk/search?q=internet+explorer+developer+toolbar&btnI=I'm+Feeling+Lucky) (IE8 comes with something equivalent built-in, as does Google Chrome). * [Open source functional testing tools](http://www.opensourcetesting.org/functional.php). * [A HOWTO on web scraping](http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html) from Dave Kuhlman. * Will any of this code make its way into the Python standard library? The request / response processing extensions to `urllib2` from mechanize have been merged into `urllib2` for Python 2.4. The cookie processing has been added, as module `cookielib`. There are other features that would be appropriate additions to `urllib2`, but since Python 2 is heading into bugfix-only mode, and I'm not using Python 3, they're unlikely to be added. * Where can I find out about the relevant standards? * [HTML 4.01 Specification](http://www.w3.org/TR/html401/) * [Draft HTML 5 Specification](http://dev.w3.org/html5/spec/) * [RFC 1866](http://www.ietf.org/rfc/rfc1866.txt) - the HTML 2.0 standard (you don't want to read this) * [RFC 1867](http://www.ietf.org/rfc/rfc1867.txt) - Form-based file upload * [RFC 2616](http://www.ietf.org/rfc/rfc2616.txt) - HTTP 1.1 Specification * [RFC 3986](http://www.ietf.org/rfc/rfc3986.txt) - URIs * [RFC 3987](http://www.ietf.org/rfc/rfc3987.txt) - IRIs
mechanize-0.2.5/docs/html/0000755000175000017500000000000011545173600014035 5ustar johnjohnmechanize-0.2.5/docs/html/doc.html0000644000175000017500000007261411545150742015504 0ustar johnjohn mechanize — Documentation

mechanize — Documentation

This documentation is in need of reorganisation!

This page is the old ClientCookie documentation. It deals with operation on the level of urllib2 Handler objects, and also with adding headers, debugging, and cookie handling. See the front page for more typical use.

Examples

import mechanize
response = mechanize.urlopen("http://example.com/")

This function behaves identically to urllib2.urlopen(), except that it deals with cookies automatically.

Here is a more complicated example, involving Request objects (useful if you want to pass Requests around, add headers to them, etc.):

import mechanize
request = mechanize.Request("http://example.com/")
# note we're using the urlopen from mechanize, not urllib2
response = mechanize.urlopen(request)
# let's say this next request requires a cookie that was set
# in response
request2 = mechanize.Request("http://example.com/spam.html")
response2 = mechanize.urlopen(request2)

print response2.geturl()
print response2.info() # headers
print response2.read() # body (readline and readlines work too)

In these examples, the workings are hidden inside the mechanize.urlopen() function, which is an extension of urllib2.urlopen(). Redirects, proxies and cookies are handled automatically by this function (note that you may need a bit of configuration to get your proxies correctly set up: see urllib2 documentation).

There is also a urlretrieve() function, which works like urllib.urlretrieve().

An example at a slightly lower level shows how the module processes cookies more clearly:

# Don't copy this blindly!  You probably want to follow the examples
# above, not this one.
import mechanize

# Build an opener that *doesn't* automatically call .add_cookie_header()
# and .extract_cookies(), so we can do it manually without interference.
class NullCookieProcessor(mechanize.HTTPCookieProcessor):
def http_request(self, request): return request
def http_response(self, request, response): return response
opener = mechanize.build_opener(NullCookieProcessor)

request = mechanize.Request("http://example.com/")
response = mechanize.urlopen(request)
cj = mechanize.CookieJar()
cj.extract_cookies(response, request)
# let's say this next request requires a cookie that was set in response
request2 = mechanize.Request("http://example.com/spam.html")
cj.add_cookie_header(request2)
response2 = mechanize.urlopen(request2)

The CookieJar class does all the work. There are essentially two operations: .extract_cookies() extracts HTTP cookies from Set-Cookie (the original Netscape cookie standard) and Set-Cookie2 (RFC 2965) headers from a response if and only if they should be set given the request, and .add_cookie_header() adds Cookie headers if and only if they are appropriate for a particular HTTP request. Incoming cookies are checked for acceptability based on the host name, etc. Cookies are only set on outgoing requests if they match the request’s host name, path, etc.

Note that if you’re using mechanize.urlopen() (or if you’re using mechanize.HTTPCookieProcessor by some other means), you don’t need to call .extract_cookies() or .add_cookie_header() yourself. If, on the other hand, you want to use mechanize to provide cookie handling for an HTTP client other than mechanize itself, you will need to use this pair of methods. You can make your own request and response objects, which must support the interfaces described in the docstrings of .extract_cookies() and .add_cookie_header().

There are also some CookieJar subclasses which can store cookies in files and databases. FileCookieJar is the abstract class for CookieJars that can store cookies in disk files. LWPCookieJar saves cookies in a format compatible with the libwww-perl library. This class is convenient if you want to store cookies in a human-readable file:

import mechanize
cj = mechanize.LWPCookieJar()
cj.revert("cookie3.txt")
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
r = opener.open("http://foobar.com/")
cj.save("cookie3.txt")

The .revert() method discards all existing cookies held by the CookieJar (it won’t lose any existing cookies if the load fails). The .load() method, on the other hand, adds the loaded cookies to existing cookies held in the CookieJar (old cookies are kept unless overwritten by newly loaded ones).

MozillaCookieJar can load and save to the Mozilla/Netscape/lynx-compatible 'cookies.txt' format. This format loses some information (unusual and nonstandard cookie attributes such as comment, and also information specific to RFC 2965 cookies). The subclass MSIECookieJar can load (but not save) from Microsoft Internet Explorer’s cookie files on Windows.

Important note

Only use names you can import directly from the mechanize package, and that don’t start with a single underscore. Everything else is subject to change or disappearance without notice.

Cooperating with Browsers

Firefox since version 3 persists cookies in an sqlite database, which is not supported by MozillaCookieJar.

The subclass MozillaCookieJar differs from CookieJar only in storing cookies using a different, Firefox 2/Mozilla/Netscape-compatible, file format known as “cookies.txt”. The lynx browser also uses this format. This file format can’t store RFC 2965 cookies, so they are downgraded to Netscape cookies on saving. LWPCookieJar itself uses a libwww-perl specific format (`Set-Cookie3’) — see the example above. Python and your browser should be able to share a cookies file (note that the file location here will differ on non-unix OSes):

WARNING: you may want to back up your browser’s cookies file if you use MozillaCookieJar to save cookies. I think it works, but there have been bugs in the past!

import os, mechanize
cookies = mechanize.MozillaCookieJar()
cookies.load(os.path.join(os.environ["HOME"], "/.netscape/cookies.txt"))
# see also the save and revert methods

Note that cookies saved while Mozilla is running will get clobbered by Mozilla — see MozillaCookieJar.__doc__.

MSIECookieJar does the same for Microsoft Internet Explorer (MSIE) 5.x and 6.x on Windows, but does not allow saving cookies in this format. In future, the Windows API calls might be used to load and save (though the index has to be read directly, since there is no API for that, AFAIK; there’s also an unfinished MSIEDBCookieJar, which uses (reads and writes) the Windows MSIE cookie database directly, rather than storing copies of cookies as MSIECookieJar does).

import mechanize
cj = mechanize.MSIECookieJar(delayload=True)
cj.load_from_registry() # finds cookie index file from registry

A true delayload argument speeds things up.

On Windows 9x (win 95, win 98, win ME), you need to supply a username to the .load_from_registry() method:

cj.load_from_registry(username="jbloggs")

Konqueror/Safari and Opera use different file formats, which aren’t yet supported.

Saving cookies in a file

If you have no need to co-operate with a browser, the most convenient way to save cookies on disk between sessions in human-readable form is to use LWPCookieJar. This class uses a libwww-perl specific format (`Set-Cookie3’). Unlike MozilliaCookieJar, this file format doesn’t lose information.

Supplying a CookieJar

You might want to do this to use your browser’s cookies, to customize CookieJar’s behaviour by passing constructor arguments, or to be able to get at the cookies it will hold (for example, for saving cookies between sessions and for debugging).

If you’re using the higher-level urllib2-like interface (urlopen(), etc), you’ll have to let it know what CookieJar it should use:

import mechanize
cookies = mechanize.CookieJar()
# build_opener() adds standard handlers (such as HTTPHandler and
# HTTPCookieProcessor) by default. The cookie processor we supply
# will replace the default one.
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

r = opener.open("http://example.com/") # GET
r = opener.open("http://example.com/", data) # POST

The urlopen() function uses a global OpenerDirector instance to do its work, so if you want to use urlopen() with your own CookieJar, install the OpenerDirector you built with build_opener() using the mechanize.install_opener() function, then proceed as usual:

mechanize.install_opener(opener)
r = mechanize.urlopen("http://example.com/")

Of course, everyone using urlopen is using the same global CookieJar instance!

You can set a policy object (must satisfy the interface defined by mechanize.CookiePolicy), which determines which cookies are allowed to be set and returned. Use the policy argument to the CookieJar constructor, or use the .set\_policy() method. The default implementation has some useful switches:

from mechanize import CookieJar, DefaultCookiePolicy as Policy
cookies = CookieJar()
# turn on RFC 2965 cookies, be more strict about domains when setting and
# returning Netscape cookies, and block some domains from setting cookies
# or having them returned (read the DefaultCookiePolicy docstring for the
# domain matching rules here)
policy = Policy(rfc2965=True, strict_ns_domain=Policy.DomainStrict,
blocked_domains=["ads.net", ".ads.net"])
cookies.set_policy(policy)

Additional Handlers

The following handlers are provided in addition to those provided by urllib2:

HTTPRobotRulesProcessor

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. This kind of program can place significant loads on web servers, so there is a standard for a robots.txt file by which web site operators can request robots to keep out of their site, or out of particular areas of it. This handler uses the standard Python library’s robotparser module. It raises mechanize.RobotExclusionError (subclass of mechanize.HTTPError) if an attempt is made to open a URL prohibited by robots.txt.

HTTPEquivProcessor

The <META HTTP-EQUIV> tag is a way of including data in HTML to be treated as if it were part of the HTTP headers. mechanize can automatically read these tags and add the HTTP-EQUIV headers to the response object’s real HTTP headers. The HTML is left unchanged.

HTTPRefreshProcessor

The Refresh HTTP header is a non-standard header which is widely used. It requests that the user-agent follow a URL after a specified time delay. mechanize can treat these headers (which may have been set in <META HTTP-EQUIV> tags) as if they were 302 redirections. Exactly when and how Refresh headers are handled is configurable using the constructor arguments.

HTTPRefererProcessor

The Referer HTTP header lets the server know which URL you’ve just visited. Some servers use this header as state information, and don’t like it if this is not present. It’s a chore to add this header by hand every time you make a request. This adds it automatically. NOTE: this only makes sense if you use each handler for a single chain of HTTP requests (so, for example, if you use a single HTTPRefererProcessor to fetch a series of URLs extracted from a single page, this will break). mechanize.Browser does this properly.

Example:

import mechanize
cookies = mechanize.CookieJar()

opener = mechanize.build_opener(mechanize.HTTPRefererProcessor,
mechanize.HTTPEquivProcessor,
mechanize.HTTPRefreshProcessor,
)
opener.open("http://www.rhubarb.com/")

Seekable responses

Response objects returned from (or raised as exceptions by) mechanize.SeekableResponseOpener, mechanize.UserAgent (if .set_seekable_responses(True) has been called) and mechanize.Browser() have .seek(), .get_data() and .set_data() methods:

import mechanize
opener = mechanize.OpenerFactory(mechanize.SeekableResponseOpener).build_opener()
response = opener.open("http://example.com/")
# same return value as .read(), but without affecting seek position
total_nr_bytes = len(response.get_data())
assert len(response.read()) == total_nr_bytes
assert len(response.read()) == 0 # we've already read the data
response.seek(0)
assert len(response.read()) == total_nr_bytes
response.set_data("blah\n")
assert response.get_data() == "blah\n"
...

This caching behaviour can be avoided by using mechanize.OpenerDirector. It can also be avoided with mechanize.UserAgent. Note that HTTPEquivProcessor and HTTPResponseDebugProcessor require seekable responses and so are not compatible with mechanize.OpenerDirector and mechanize.UserAgent.

import mechanize
ua = mechanize.UserAgent()
ua.set_seekable_responses(False)
ua.set_handle_equiv(False)
ua.set_debug_responses(False)

Note that if you turn on features that use seekable responses (currently: HTTP-EQUIV handling and response body debug printing), returned responses may be seekable as a side-effect of these features. However, this is not guaranteed (currently, in these cases, returned response objects are seekable, but raised respose objects — mechanize.HTTPError instances — are not seekable). This applies regardless of whether you use mechanize.UserAgent or mechanize.OpenerDirector. If you explicitly request seekable responses by calling .set_seekable_responses(True) on a mechanize.UserAgent instance, or by using mechanize.Browser or mechanize.SeekableResponseOpener, which always return seekable responses, then both returned and raised responses are guaranteed to be seekable.

Handlers should call response = mechanize.seek_wrapped_response(response) if they require the .seek(), .get_data() or .set_data() methods.

Request object lifetime

Note that handlers may create new Request instances (for example when performing redirects) rather than adding headers to existing Request objects.

Adding headers

Adding headers is done like so:

import mechanize
req = mechanize.Request("http://foobar.com/")
req.add_header("Referer", "http://wwwsearch.sourceforge.net/mechanize/")
r = mechanize.urlopen(req)

You can also use the headers argument to the mechanize.Request constructor.

mechanize adds some headers to Request objects automatically — see the next section for details.

Automatically-added headers

OpenerDirector automatically adds a User-Agent header to every Request.

To change this and/or add similar headers, use your own OpenerDirector:

import mechanize
cookies = mechanize.CookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"),
("From", "responsible.person@example.com")]

Again, to use urlopen(), install your OpenerDirector globally:

mechanize.install_opener(opener)
r = mechanize.urlopen("http://example.com/")

Also, a few standard headers (Content-Length, Content-Type and Host) are added when the Request is passed to urlopen() (or OpenerDirector.open()). You shouldn’t need to change these headers, but since this is done by AbstractHTTPHandler, you can change the way it works by passing a subclass of that handler to build_opener() (or, as always, by constructing an opener yourself and calling .add_handler()).

Initiating unverifiable transactions

This section is only of interest for correct handling of third-party HTTP cookies. See below for an explanation of ‘third-party’.

First, some terminology.

An unverifiable request (defined fully by (RFC 2965) is one whose URL the user did not have the option to approve. For example, a transaction is unverifiable if the request is for an image in an HTML document, and the user had no option to approve the fetching of the image from a particular URL.

The request-host of the origin transaction (defined fully by RFC 2965) is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this is the request-host of the request for the page containing the image.

mechanize knows that redirected transactions are unverifiable, and will handle that on its own (ie. you don’t need to think about the origin request-host or verifiability yourself).

If you want to initiate an unverifiable transaction yourself (which you should if, for example, you’re downloading the images from a page, and ‘the user’ hasn’t explicitly OKed those URLs):

request = Request(origin_req_host="www.example.com", unverifiable=True)

RFC 2965 support

Support for the RFC 2965 protocol is switched off by default, because few browsers implement it, so the RFC 2965 protocol is essentially never seen on the internet. To switch it on, see here.

Parsing HTTP dates

A function named str2time is provided by the package, which may be useful for parsing dates in HTTP headers. str2time is intended to be liberal, since HTTP date/time formats are poorly standardised in practice. There is no need to use this function in normal operations: CookieJar instances keep track of cookie lifetimes automatically. This function will stay around in some form, though the supported date/time formats may change.

Dealing with bad HTML

XXX Intro

XXX Test me

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, March 2011.


mechanize-0.2.5/docs/html/download.html0000644000175000017500000000754211545150741016543 0ustar johnjohn mechanize — Download
SourceForge.net. Fast, secure and Free Open Source software downloads

mechanize — Download

There is more than one way to obtain mechanize:

Note re Windows and Mac support: currently the tests are only routinely run on Ubuntu 9.10 (“karmic”). However, as far as I know, mechanize works fine on Windows and Mac platforms.

easy_install

  1. Install EasyInstall

  2. easy_install mechanize

Easy install will automatically download the latest source code release and install it.

Source code release

  1. Download the source from one of the links below

  2. Unpack the source distribution and change directory to the resulting top-level directory.

  3. python setup.py install

This is a stable release.

All the documentation (these web pages, docstrings, and the changelog) is included in the distribution.

git repository

The git repository is here. To check it out:

  1. git clone git://github.com/jjlee/mechanize.git

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, March 2011.


mechanize-0.2.5/docs/html/development.html0000644000175000017500000000660611545150742017257 0ustar johnjohn mechanize — Development
SourceForge.net. Fast, secure and Free Open Source software downloads

mechanize — Development

git repository

The git repository is here. To check it out:

git clone git://github.com/jjlee/mechanize.git

There is also another repository, which is only useful for making mechanize releases:

git clone git://github.com/jjlee/mechanize-build-tools.git

Old repository

The old SVN repository may be useful for viewing ClientForm history. ClientForm used to be a dependency of mechanize, but has been merged into mechanize as of release 0.2.0; the history wasn’t imported. To check out:

svn co http://codespeak.net/svn/wwwsearch/

Bug tracker

The bug tracker is here on github. It’s equally acceptable to file bugs on the tracker or post about them to the mailing list. Feel free to send patches too!

Mailing list

There is a mailing list.

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, April 2010.


mechanize-0.2.5/docs/html/forms.html0000644000175000017500000005327511545150742016067 0ustar johnjohn mechanize — Forms
SourceForge.net. Fast, secure and Free Open Source software downloads

mechanize — Forms

This documentation is in need of reorganisation!

This page is the old ClientForm documentation. ClientForm is now part of mechanize, but the documentation hasn’t been fully updated to reflect that: what’s here is correct, but not well-integrated with the rest of the documentation. This page deals with HTML form handling: parsing HTML forms, filling them in and returning the completed forms to the server. See the front page for how to obtain form objects from a mechanize.Browser.

Simple working example (examples/forms/simple.py in the source distribution):

import sys

from mechanize import ParseResponse, urlopen, urljoin

if len(sys.argv) == 1:
uri = "http://wwwsearch.sourceforge.net/"
else:
uri = sys.argv[1]

response = urlopen(urljoin(uri, "mechanize/example.html"))
forms = ParseResponse(response, backwards_compat=False)
form = forms[0]
print form
form["comments"] = "Thanks, Gisle"

# form.click() returns a mechanize.Request object
# (see HTMLForm.click.__doc__ if you want to use only the forms support, and
# not the rest of mechanize)
print urlopen(form.click()).read()

A more complicated working example (from examples/forms/example.py in the source distribution):

import sys

import mechanize

if len(sys.argv) == 1:
uri = "http://wwwsearch.sourceforge.net/"
else:
uri = sys.argv[1]

request = mechanize.Request(mechanize.urljoin(uri, "mechanize/example.html"))
response = mechanize.urlopen(request)
forms = mechanize.ParseResponse(response, backwards_compat=False)
response.close()
## f = open("example.html")
## forms = mechanize.ParseFile(f, "http://example.com/example.html",
## backwards_compat=False)
## f.close()
form = forms[0]
print form # very useful!

# A 'control' is a graphical HTML form widget: a text entry box, a
# dropdown 'select' list, a checkbox, etc.

# Indexing allows setting and retrieval of control values
original_text = form["comments"] # a string, NOT a Control instance
form["comments"] = "Blah."

# Controls that represent lists (checkbox, select and radio lists) are
# ListControl instances. Their values are sequences of list item names.
# They come in two flavours: single- and multiple-selection:
form["favorite_cheese"] = ["brie"] # single
form["cheeses"] = ["parmesan", "leicester", "cheddar"] # multi
# equivalent, but more flexible:
form.set_value(["parmesan", "leicester", "cheddar"], name="cheeses")

# Add files to FILE controls with .add_file(). Only call this multiple
# times if the server is expecting multiple files.
# add a file, default value for MIME type, no filename sent to server
form.add_file(open("data.dat"))
# add a second file, explicitly giving MIME type, and telling the server
# what the filename is
form.add_file(open("data.txt"), "text/plain", "data.txt")

# All Controls may be disabled (equivalent of greyed-out in browser)...
control = form.find_control("comments")
print control.disabled
# ...or readonly
print control.readonly
# readonly and disabled attributes can be assigned to
control.disabled = False
# convenience method, used here to make all controls writable (unless
# they're disabled):
form.set_all_readonly(False)

# A couple of notes about list controls and HTML:

# 1. List controls correspond to either a single SELECT element, or
# multiple INPUT elements. Items correspond to either OPTION or INPUT
# elements. For example, this is a SELECT control, named "control1":

# <select name="control1">
# <option>foo</option>
# <option value="1">bar</option>
# </select>

# and this is a CHECKBOX control, named "control2":

# <input type="checkbox" name="control2" value="foo" id="cbe1">
# <input type="checkbox" name="control2" value="bar" id="cbe2">

# You know the latter is a single control because all the name attributes
# are the same.

# 2. Item names are the strings that go to make up the value that should
# be returned to the server. These strings come from various different
# pieces of text in the HTML. The HTML standard and the mechanize
# docstrings explain in detail, but playing around with an HTML file,
# ParseFile() and 'print form' is very useful to understand this!

# You can get the Control instances from inside the form...
control = form.find_control("cheeses", type="select")
print control.name, control.value, control.type
control.value = ["mascarpone", "curd"]
# ...and the Item instances from inside the Control
item = control.get("curd")
print item.name, item.selected, item.id, item.attrs
item.selected = False

# Controls may be referred to by label:
# find control with label that has a *substring* "Cheeses"
# (e.g., a label "Please select a cheese" would match).
control = form.find_control(label="select a cheese")

# You can explicitly say that you're referring to a ListControl:
# set value of "cheeses" ListControl
form.set_value(["gouda"], name="cheeses", kind="list")
# equivalent:
form.find_control(name="cheeses", kind="list").value = ["gouda"]
# the first example is also almost equivalent to the following (but
# insists that the control be a ListControl -- so it will skip any
# non-list controls that come before the control we want)
form["cheeses"] = ["gouda"]
# The kind argument can also take values "multilist", "singlelist", "text",
# "clickable" and "file":
# find first control that will accept text, and scribble in it
form.set_value("rhubarb rhubarb", kind="text", nr=0)
# find, and set the value of, the first single-selection list control
form.set_value(["spam"], kind="singlelist", nr=0)

# You can find controls with a general predicate function:
def control_has_caerphilly(control):
for item in control.items:
if item.name == "caerphilly": return True
form.find_control(kind="list", predicate=control_has_caerphilly)

# HTMLForm.controls is a list of all controls in the form
for control in form.controls:
if control.value == "inquisition": sys.exit()

# Control.items is a list of all Item instances in the control
for item in form.find_control("cheeses").items:
print item.name

# To remove items from a list control, remove it from .items:
cheeses = form.find_control("cheeses")
curd = cheeses.get("curd")
del cheeses.items[cheeses.items.index(curd)]
# To add items to a list container, instantiate an Item with its control
# and attributes:
# Note that you are responsible for getting the attributes correct here,
# and these are not quite identical to the original HTML, due to
# defaulting rules and a few special attributes (e.g. Items that represent
# OPTIONs have a special "contents" key in their .attrs dict). In future
# there will be an explicitly supported way of using the parsing logic to
# add items and controls from HTML strings without knowing these details.
mechanize.Item(cheeses, {"contents": "mascarpone",
"value": "mascarpone"})

# You can specify list items by label using set/get_value_by_label() and
# the label argument of the .get() method. Sometimes labels are easier to
# maintain than names, sometimes the other way around.
form.set_value_by_label(["Mozzarella", "Caerphilly"], "cheeses")

# Which items are present, selected, and successful?
# is the "parmesan" item of the "cheeses" control successful (selected
# and not disabled)?
print "parmesan" in form["cheeses"]
# is the "parmesan" item of the "cheeses" control selected?
print "parmesan" in [
item.name for item in form.find_control("cheeses").items if item.selected]
# does cheeses control have a "caerphilly" item?
print "caerphilly" in [item.name for item in form.find_control("cheeses").items]

# Sometimes one wants to set or clear individual items in a list, rather
# than setting the whole .value:
# select the item named "gorgonzola" in the first control named "cheeses"
form.find_control("cheeses").get("gorgonzola").selected = True
# You can be more specific:
# deselect "edam" in third CHECKBOX control
form.find_control(type="checkbox", nr=2).get("edam").selected = False
# deselect item labelled "Mozzarella" in control with id "chz"
form.find_control(id="chz").get(label="Mozzarella").selected = False

# Often, a single checkbox (a CHECKBOX control with a single item) is
# present. In that case, the name of the single item isn't of much
# interest, so it's a good idea to check and uncheck the box without
# using the item name:
form.find_control("smelly").items[0].selected = True # check
form.find_control("smelly").items[0].selected = False # uncheck

# Items may be disabled (selecting or de-selecting a disabled item is
# not allowed):
control = form.find_control("cheeses")
print control.get("emmenthal").disabled
control.get("emmenthal").disabled = True
# enable all items in control
control.set_all_items_disabled(False)

request2 = form.click() # mechanize.Request object
try:
response2 = mechanize.urlopen(request2)
except mechanize.HTTPError, response2:
pass

print response2.geturl()
# headers
for name, value in response2.info().items():
if name != "date":
print "%s: %s" % (name.title(), value)
print response2.read() # body
response2.close()

All of the standard control types are supported: TEXT, PASSWORD, HIDDEN, TEXTAREA, ISINDEX, RESET, BUTTON (INPUT TYPE=BUTTON and the various BUTTON types), SUBMIT, IMAGE, RADIO, CHECKBOX, SELECT/OPTION and FILE (for file upload). Both standard form encodings (application/x-www-form-urlencoded and multipart/form-data) are supported.

The module is designed for testing and automation of web interfaces, not for implementing interactive user agents.

Security note: Remember that any passwords you store in HTMLForm instances will be saved to disk in the clear if, for example, you pickle them.

Parsers

There are two parsers.

TODO: more!

See also the FAQ entries on XHTML and parsing bad HTML.

Backwards-compatibility mode

mechanize (and ClientForm 0.2) includes three minor backwards-incompatible interface changes from ClientForm version 0.1.

To make upgrading from ClientForm 0.1 easier, and to allow me to stop supporting version ClientForm 0.1 sooner, there is support for operating in a backwards-compatible mode, under which code written for ClientForm 0.1 should work without modification. This is done on a per-HTMLForm basis via the .backwards_compat attribute, but for convenience the ParseResponse() and ParseFile() factory functions accept backwards_compat arguments. These backwards-compatibility features will be removed soon. The default is to operate in backwards-compatible mode. To run with backwards compatible mode turned OFF (strongly recommended):

from mechanize import ParseResponse, urlopen
forms = ParseResponse(urlopen("http://example.com/"), backwards_compat=False)
# ...

The backwards-incompatible changes are:

  • Ambiguous specification of controls or items now results in AmbiguityError. If you want the old behaviour, explicitly pass nr=0 to indicate you want the first matching control or item.

  • Item label matching is now done by substring, not by strict string-equality (but note leading and trailing space is always stripped). (Control label matching is always done by substring.)

  • Handling of disabled list items has changed. First, note that handling of disabled list items in ClientForm 0.1 (and in ClientForm 0.2’s backwards-compatibility mode!) is buggy: disabled items are successful (ie. disabled item names are sent back to the server). As a result, there was no distinction to be made between successful items and selected items. In ClientForm 0.2, the bug is fixed, so this is no longer the case, and it is important to note that list controls’ .value attribute contains only the successful item names; items that are selected but not successful (because disabled) are not included in .value. Second, disabled list items may no longer be deselected: AttributeError is raised in ClientForm 0.2, whereas deselection was allowed in ClientForm 0.1. The bug in ClientForm 0.1 and in ClientForm 0.2’s backwards-compatibility mode will not be fixed, to preserve compatibility and to encourage people to upgrade to the new ClientForm 0.2 backwards_compat=False behaviour.

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, April 2010.


mechanize-0.2.5/docs/html/index.html0000644000175000017500000002501311545150741016034 0ustar johnjohn mechanize
SourceForge.net. Fast, secure and Free Open Source software downloads

mechanize

Stateful programmatic web browsing in Python, after Andy Lester’s Perl module WWW::Mechanize.

  • mechanize.Browser and mechanize.UserAgentBase implement the interface of urllib2.OpenerDirector, so:

    • any URL can be opened, not just http:

    • mechanize.UserAgentBase offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots.txt handling, without having to make a new OpenerDirector each time, e.g. by calling build_opener().

  • Easy HTML form filling.

  • Convenient link parsing and following.

  • Browser history (.back() and .reload() methods).

  • The Referer HTTP header is added properly (optional).

  • Automatic observance of robots.txt.

  • Automatic handling of HTTP-Equiv and Refresh.

Examples

The examples below are written for a website that does not exist (example.com), so cannot be run. There are also some working examples that you can run.

import re
import mechanize

br = mechanize.Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info() # headers
print response1.read() # body

br.select_form(name="order")
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm.
br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__)
# Submit current form. Browser calls .close() on the current response on
# navigation, so this closes response1
response2 = br.submit()

# print currently selected form (don't call .submit() on this, use br.submit())
print br.form

response3 = br.back() # back to cheese shop (same data as response1)
# the history mechanism returns cached response objects
# we can still use the response, even though it was .close()d
response3.get_data() # like .seek(0) followed by .read()
response4 = br.reload() # fetches from server

for form in br.forms():
print form
# .links() optionally accepts the keyword args of .follow_/.find_link()
for link in br.links(url_regex="python.org"):
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
br.back()

You may control the browser’s policy by using the methods of mechanize.Browser’s base class, mechanize.UserAgent. For example:

br = mechanize.Browser()
# Explicitly configure proxies (Browser will attempt to set good defaults).
# Note the userinfo ("joe:password@") and port number (":3128") are optional.
br.set_proxies({"http": "joe:password@myproxy.example.com:3128",
"ftp": "proxy.example.com",
})
# Add HTTP Basic/Digest auth username and password for HTTP proxy access.
# (equivalent to using "joe:password@..." form above)
br.add_proxy_password("joe", "password")
# Add HTTP Basic/Digest auth username and password for website access.
br.add_password("http://example.com/protected/", "joe", "password")
# Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML).
br.set_handle_equiv(False)
# Ignore robots.txt. Do not do this without thought and consideration.
br.set_handle_robots(False)
# Don't add Referer (sic) header
br.set_handle_referer(False)
# Don't handle Refresh redirections
br.set_handle_refresh(False)
# Don't handle cookies
br.set_cookiejar()
# Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by
# default: no need to do this unless you have some reason to use a
# particular cookiejar)
br.set_cookiejar(cj)
# Log information about HTTP redirects and Refreshes.
br.set_debug_redirects(True)
# Log HTTP response bodies (ie. the HTML, most of the time).
br.set_debug_responses(True)
# Print HTTP headers.
br.set_debug_http(True)

# To make sure you're seeing all debug output:
logger = logging.getLogger("mechanize")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.INFO)

# Sometimes it's useful to process bad headers or bad HTML:
response = br.response() # this is a copy of response
headers = response.info() # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("<!---", "<!--"))
br.set_response(response)

mechanize exports the complete interface of urllib2:

import mechanize
response = mechanize.urlopen("http://www.example.com/")
print response.read()

When using mechanize, anything you would normally import from urllib2 should be imported from mechanize instead.

Credits

Much of the code was originally derived from the work of the following people:

  • Gisle Aas — libwww-perl

  • Jeremy Hylton (and many others) — urllib2

  • Andy Lester — WWW::Mechanize

  • Johnny Lee (coincidentally-named) — MSIE CookieJar Perl code from which mechanize’s support for that is derived.

Also:

  • Gary Poster and Benji York at Zope Corporation — contributed significant changes to the HTML forms code

  • Ronald Tschalar — provided help with Netscape cookies

Thanks also to the many people who have contributed bug reports and patches.

See also

There are several wrappers around mechanize designed for functional testing of web applications:

See the FAQ page for other links to related software.

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, April 2010.


mechanize-0.2.5/docs/html/faq.html0000644000175000017500000004560611545150742015507 0ustar johnjohn mechanize — FAQ
SourceForge.net. Fast, secure and Free Open Source software downloads

mechanize — FAQ

  • Which version of Python do I need?

    Python 2.4, 2.5, 2.6, or 2.7. Python 3 is not yet supported.

  • Does mechanize depend on BeautifulSoup?

    No. mechanize offers a few classes that make use of BeautifulSoup, but these classes are not required to use mechanize. mechanize bundles BeautifulSoup version 2, so that module is no longer required. A future version of mechanize will support BeautifulSoup version 3, at which point mechanize will likely no longer bundle the module.

  • Does mechanize depend on ClientForm?

    No, ClientForm is now part of mechanize.

  • Which license?

    mechanize is dual-licensed: you may pick either the BSD license, or the ZPL 2.1 (both are included in the distribution).

Usage

  • I’m not getting the HTML page I expected to see.

    Debugging tips

  • Browser doesn’t have all of the forms/links I see in the HTML. Why not?

    Perhaps the default parser can’t cope with invalid HTML. Try using the included BeautifulSoup 2 parser instead:

import mechanize

browser = mechanize.Browser(factory=mechanize.RobustFactory())
browser.open("http://example.com/")
print browser.forms
Alternatively, you can process the HTML (and headers) arbitrarily:
browser = mechanize.Browser()
browser.open("http://example.com/")
html = browser.response().get_data().replace("<br/>", "<br />")
response = mechanize.make_response(
html, [("Content-Type", "text/html")],
"http://example.com/", 200, "OK")
browser.set_response(response)
  • Is JavaScript supported?

    No, sorry. See FAQs below.

  • My HTTP response data is truncated.

    mechanize.Browser's response objects support the .seek() method, and can still be used after .close() has been called. Response data is not fetched until it is needed, so navigation away from a URL before fetching all of the response will truncate it. Call response.get_data() before navigation if you don’t want that to happen.

  • I’m sure this page is HTML, why does mechanize.Browser think otherwise?

b = mechanize.Browser(
# mechanize's XHTML support needs work, so is currently switched off. If
# we want to get our work done, we have to turn it on by supplying a
# mechanize.Factory (with XHTML support turned on):
factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True)
)
  • Why don’t timeouts work for me?

    Timeouts are ignored with with versions of Python earlier than 2.6. Timeouts do not apply to DNS lookups.

  • Is there any example code?

    Look in the examples/ directory. Note that the examples on the forms page are executable as-is. Contributions of example code would be very welcome!

Cookies

  • Doesn’t the standard Python library module, Cookie, do this?

    No: module Cookie does the server end of the job. It doesn’t know when to accept cookies from a server or when to send them back. Part of mechanize has been contributed back to the standard library as module cookielib (there are a few differences, notably that cookielib contains thread synchronization code; mechanize does not use cookielib).

  • Which HTTP cookie protocols does mechanize support?

    Netscape and RFC 2965. RFC 2965 handling is switched off by default.

  • What about RFC 2109?

    RFC 2109 cookies are currently parsed as Netscape cookies, and treated by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled, or as Netscape cookies otherwise.

  • Why don’t I have any cookies?

    See here.

  • My response claims to be empty, but I know it’s not!

    Did you call response.read() (e.g., in a debug statement), then forget that all the data has already been read? In that case, you may want to use mechanize.response_seek_wrapper. mechanize.Browser always returns seekable responses, so it’s not necessary to use this explicitly in that case.

  • What’s the difference between the .load() and .revert() methods of CookieJar?

    .load() appends cookies from a file. .revert() discards all existing cookies held by the CookieJar first (but it won’t lose any existing cookies if the loading fails).

  • Is it threadsafe?

    No. As far as I know, you can use mechanize in threaded code, but it provides no synchronisation: you have to provide that yourself.

  • How do I do <X>

    Refer to the API documentation in docstrings.

Forms

  • Doesn’t the standard Python library module, cgi, do this?

    No: the cgi module does the server end of the job. It doesn’t know how to parse or fill in a form or how to send it back to the server.

  • How do I figure out what control names and values to use?

    print form is usually all you need. In your code, things like the HTMLForm.items attribute of HTMLForm instances can be useful to inspect forms at runtime. Note that it’s possible to use item labels instead of item names, which can be useful — use the by_label arguments to the various methods, and the .get_value_by_label() / .set_value_by_label() methods on ListControl.

  • What do those '*' characters mean in the string representations of list controls?

    A * next to an item means that item is selected.

  • What do those parentheses (round brackets) mean in the string representations of list controls?

    Parentheses (foo) around an item mean that item is disabled.

  • Why doesn’t <some control> turn up in the data returned by .click*() when that control has non-None value?

    Either the control is disabled, or it is not successful for some other reason. ‘Successful’ (see HTML 4 specification) means that the control will cause data to get sent to the server.

  • Why does mechanize not follow the HTML 4.0 / RFC 1866 standards for RADIO and multiple-selection SELECT controls?

    Because by default, it follows browser behaviour when setting the initially-selected items in list controls that have no items explicitly selected in the HTML. Use the select_default argument to ParseResponse if you want to follow the RFC 1866 rules instead. Note that browser behaviour violates the HTML 4.01 specification in the case of RADIO controls.

  • Why does .click()ing on a button not work for me?

    • Clicking on a RESET button doesn’t do anything, by design - this is a library for web automation, not an interactive browser. Even in an interactive browser, clicking on RESET sends nothing to the server, so there is little point in having .click() do anything special here.

    • Clicking on a BUTTON TYPE=BUTTON doesn’t do anything either, also by design. This time, the reason is that that BUTTON is only in the HTML standard so that one can attach JavaScript callbacks to its events. Their execution may result in information getting sent back to the server. mechanize, however, knows nothing about these callbacks, so it can’t do anything useful with a click on a BUTTON whose type is BUTTON.

    • Generally, JavaScript may be messing things up in all kinds of ways. See the answer to the next question.

  • How do I change INPUT TYPE=HIDDEN field values (for example, to emulate the effect of JavaScript code)?

    As with any control, set the control’s readonly attribute false.

form.find_control("foo").readonly = False # allow changing .value of control foo
form.set_all_readonly(False) # allow changing the .value of all controls
  • I’m having trouble debugging my code.

    See here for few relevant tips.

  • I have a control containing a list of integers. How do I select the one whose value is nearest to the one I want?

import bisect
def closest_int_value(form, ctrl_name, value):
values = map(int, [item.name for item in form.find_control(ctrl_name).items])
return str(values[bisect.bisect(values, value) - 1])

form["distance"] = [closest_int_value(form, "distance", 23)]

General

  • I want to see what my web browser is doing, but standard network sniffers like wireshark or netcat (nc) don’t work for HTTPS. How do I sniff HTTPS traffic?

    Three good options:

  • JavaScript is messing up my web-scraping. What do I do?

    JavaScript is used in web pages for many purposes — for example: creating content that was not present in the page at load time, submitting or filling in parts of forms in response to user actions, setting cookies, etc. mechanize does not provide any support for JavaScript.

    If you come across this in a page you want to automate, you have four options. Here they are, roughly in order of simplicity.

    • Figure out what the JavaScript is doing and emulate it in your Python code: for example, by manually adding cookies to your CookieJar instance, calling methods on HTMLForms, calling urlopen, etc. See above re forms.

    • Use Java’s HtmlUnit or HttpUnit from Jython, since they know some JavaScript.

    • Instead of using mechanize, automate a browser instead. For example use MS Internet Explorer via its COM automation interfaces, using the Python for Windows extensions, aka pywin32, aka win32all (e.g. simple function, pamie; pywin32 chapter from the O’Reilly book) or ctypes (example). This kind of thing may also come in useful on Windows for cases where the automation API is lacking. For Firefox, there is PyXPCOM.

    • Get ambitious and automatically delegate the work to an appropriate interpreter (Mozilla’s JavaScript interpreter, for instance). This is what HtmlUnit and httpunit do. I did a spike along these lines some years ago, but I think it would (still) be quite a lot of work to do well.

  • Misc links

  • Will any of this code make its way into the Python standard library?

    The request / response processing extensions to urllib2 from mechanize have been merged into urllib2 for Python 2.4. The cookie processing has been added, as module cookielib. There are other features that would be appropriate additions to urllib2, but since Python 2 is heading into bugfix-only mode, and I’m not using Python 3, they’re unlikely to be added.

  • Where can I find out about the relevant standards?

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, October 2010.


mechanize-0.2.5/docs/html/support.html0000644000175000017500000000551711545150741016450 0ustar johnjohn mechanize — Support
SourceForge.net. Fast, secure and Free Open Source software downloads

mechanize — Support

Documentation

See links at right. Start here.

Bug tracker

The bug tracker is here on github. It’s equally acceptable to file bugs on the tracker or post about them to the mailing list.

Contact

There is a mailing list.

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, April 2010.


mechanize-0.2.5/docs/html/hints.html0000644000175000017500000002234611545150742016061 0ustar johnjohn mechanize — Hints
SourceForge.net. Fast, secure and Free Open Source software downloads

mechanize — Hints

Hints for debugging programs that use mechanize.

Cookies

A common mistake is to use mechanize.urlopen(), and the .extract_cookies() and .add_cookie_header() methods on a cookie object themselves. If you use mechanize.urlopen() (or OpenerDirector.open()), the module handles extraction and adding of cookies by itself, so you should not call .extract_cookies() or .add_cookie_header().

Are you sure the server is sending you any cookies in the first place? Maybe the server is keeping track of state in some other way (HIDDEN HTML form entries (possibly in a separate page referenced by a frame), URL-encoded session keys, IP address, HTTP Referer headers)? Perhaps some embedded script in the HTML is setting cookies (see below)? Turn on logging.

When you .save() to or .load()/.revert() from a file, single-session cookies will expire unless you explicitly request otherwise with the ignore_discard argument. This may be your problem if you find cookies are going away after saving and loading.

import mechanize
cj = mechanize.LWPCookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
mechanize.install_opener(opener)
r = mechanize.urlopen("http://foobar.com/")
cj.save("/some/file", ignore_discard=True, ignore_expires=True)

JavaScript code can set cookies; mechanize does not support this. See the FAQ.

General

Enable logging.

Sometimes, a server wants particular HTTP headers set to the values it expects. For example, the User-Agent header may need to be set to a value like that of a popular browser.

Check that the browser is able to do manually what you’re trying to achieve programatically. Make sure that what you do manually is exactly the same as what you’re trying to do from Python — you may simply be hitting a server bug that only gets revealed if you view pages in a particular order, for example.

Try comparing the headers and data that your program sends with those that a browser sends. Often this will give you the clue you need. There are browser addons available that allow you to see what the browser sends and receives even if HTTPS is in use.

If nothing is obviously wrong with the requests your program is sending and you’re out of ideas, you can reliably locate the problem by copying the headers that a browser sends, and then changing headers until your program stops working again. Temporarily switch to explicitly sending individual HTTP headers (by calling .add_header(), or by using httplib directly). Start by sending exactly the headers that Firefox or IE send. You may need to make sure that a valid session ID is sent — the one you got from your browser may no longer be valid. If that works, you can begin the tedious process of changing your headers and data until they match what your original code was sending. You should end up with a minimal set of changes. If you think that reveals a bug in mechanize, please report it.

Logging

To enable logging to stdout:

import sys, logging
logger = logging.getLogger("mechanize")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.DEBUG)

You can reduce the amount of information shown by setting the level to logging.INFO instead of logging.DEBUG, or by only enabling logging for one of the following logger names instead of "mechanize":

  • "mechanize": Everything.

  • "mechanize.cookies": Why particular cookies are accepted or rejected and why they are or are not returned. Requires logging enabled at the DEBUG level.

  • "mechanize.http_responses": HTTP response body data.

  • "mechanize.http_redirects": HTTP redirect information.

HTTP headers

An example showing how to enable printing of HTTP headers to stdout, logging of HTTP response bodies, and logging of information about redirections:

import sys, logging
import mechanize

logger = logging.getLogger("mechanize")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.DEBUG)

browser = mechanize.Browser()
browser.set_debug_http(True)
browser.set_debug_responses(True)
browser.set_debug_redirects(True)
response = browser.open("http://python.org/")

Alternatively, you can examine request and response objects to see what’s going on. Note that requests may involve “sub-requests” in cases such as redirection, in which case you will not see everything that’s going on just by examining the original request and final response. It’s often useful to use the .get_data() method on responses during debugging.

Handlers

This section is not relevant if you use mechanize.Browser.

An example showing how to enable printing of HTTP headers to stdout, at the HTTPHandler level:

import mechanize
hh = mechanize.HTTPHandler() # you might want HTTPSHandler, too
hh.set_http_debuglevel(1)
opener = mechanize.build_opener(hh)
response = opener.open(url)

The following handlers are available:

NOTE: as well as having these handlers in your OpenerDirector (for example, by passing them to build_opener()) you have to turn on logging at the INFO level or lower in order to see any output.

HTTPRedirectDebugProcessor: logs information about redirections

HTTPResponseDebugProcessor: logs HTTP response bodies (including those that are read during redirections)

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, April 2010.


mechanize-0.2.5/docs/html/ChangeLog.txt0000644000175000017500000006012011545150742016426 0ustar johnjohnThis isn't really in proper GNU ChangeLog format, it just happens to look that way. 2011-03-31 John J Lee * 0.2.5 release. * This is essentially a no-changes release to fix easy_install breakage caused by a SourceForge issue * Sourceforge is returning invalid HTTP responses, make download links point to PyPI instead * Include cookietest.cgi in source distribution * Note new IETF cookie standardisation effort 2010-10-28 John J Lee * 0.2.4 release. * Fix IndexError on empty Content-type header value. (GH-18) * Fall back to another encoding if an unknown one is declared. Fixes traceback on unknoqn encoding in Content-type header. (GH-30) 2010-10-16 John J Lee * 0.2.3 release. * Fix str(ParseError()) traceback. (GH-25) * Add equality methods to mechanize.Cookie . (GH-29) 2010-07-17 John J Lee * 0.2.2 release. * Officially support Python 2.7 (no changes were required) * Fix TypeError on .open()ing ftp: URL (only affects Python 2.4 and 2.5) * Don't include HTTPSHandler in __all__ if it's not available 2010-05-16 John J Lee * 0.2.1 release. * API change: Change argument order of HTTPRedirectHandler.redirect_request() to match urllib2. * Fix failure to use bundled BeautifulSoup for forms. (GH-15) * Fix default cookie path where request path has query containing / character. (http://bugs.python.org/issue3704) * Fix failure to raise on click for nonexistent label. (GH-16) * Documentation fixes. 2010-04-22 John J Lee * 0.2.0 release. * Behaviour change: merged upstream urllib2 change (allegedly a "bug fix") to return a response for all 2** HTTP responses (e.g. 206 Partial Content). Previously, only 200 caused a response object to be returned. All other HTTP response codes resulted in a response object being raised as an exception. * Behaviour change: Use of mechanize classes with `urllib2` (and vice-versa) is no longer supported. However, existing classes implementing the urllib2 Handler interface are likely to work unchanged with mechanize. Removed RequestUpgradeProcessor, ResponseUpgradeProcessor, SeekableProcessor. * ClientForm has been merged into mechanize. This means that mechanize has no dependencies other than Python itself. The ClientForm API is still available -- to switch from ClientForm to mechanize, just s/ClientForm/mechanize in your source code, and ensure any use of the module logging logger named "ClientForm" is updated to use the new logger name "mechanize.forms". I probably won't do further standalone releases of ClientForm. * Stop monkey-patching Python stdlib. * Merge fixes from urllib2 trunk * Close file objects on .read() failure in .retrieve() * Fix a python 2.4 bug due to buggy urllib.splithost * Fix Python 2.4 syntax error in _firefox3cookiejar * Fix __init__.py typo that hid mechanize.seek_wrapped_response and mechanize.str2time. Fixes http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=465206 * Fix an obvious bug with experimental firefox 3 cookiejar support. It's still experimental and not properly tested. * Change documentation to not require a .port attribute on request objects, since that's unused. * Doc fixes * Added mechanize.urljoin (RFC 3986 compliant function for joining a base URI with a URI reference) * Merge of ClientForm (see above). * Moved to git (from SVN) http://github.com/jjlee/mechanize * Created an issue tracker http://github.com/jjlee/mechanize/issues * Docs are now in markdown format (thanks John Gabriele). * Website rearranged. The old website has been archived at http://wwwsearch.sourceforge.net/old/ . The new website is essentially just the mechanize pages, rearranged and cleaned up a bit. * Source code rearranged for easier merging with upstream urllib2 * Fully automated release process. * New test runner. Single test suite; tests create their own HTTP server fixtures (server fixtures are cached where possible for speed). 2009-02-07 John J Lee * 0.1.11 release. * Fix quadratic performance in number of .read() calls (and add an automated performance test). 2008-12-03 John J Lee * 0.1.10 release. * Add support for Python 2.6: Raise URLError on file: URL errors, not IOError (port of upstream urllib2 fix). Add support for Python 2.6's per-connection timeouts: Add timeout arguments to urlopen(), Request constructor, .open(), and .open_novisit(). * Drop support for Python 2.3 * Add Content-length header to Request object (httplib bug that prevented doing that was fixed in Python 2.4). There's no change is what is actually sent over the wire here, just in what headers get added to the Request object. * Fix AttributeError on .retrieve() with a Request (as opposed to URL string) argument * Don't change CookieJar state in .make_cookies(). * Fix AttributeError in case where .make_cookies() or .cookies_for_request() is called before other methods like .extract_cookies() or .make_cookie_header() * Fixes affecting version cookie-attribute (http://bugs.python.org/issue3924). * Silence module logging's "no handlers could be found for logger mechanize" warning in a way that doesn't clobber attempts to set log level sometimes * Don't use private attribute of request in request upgrade handler (what was I thinking??) * Don't call setup() on import of setup.py * Add new public function effective_request_host * Add .get_policy() method to CookieJar * Add method CookieJar.cookies_for_request() * Fix documented interface required of requests and responses (and add some tests for this!) * Allow either .is_unverifiable() or .unverifiable on request objects (preferring the former) * Note that there's a new functional test for digest auth, which fails when run against the sourceforge site (which is the default). It looks like this reflects the fact that digest auth has been fairly broken since it was introduced in urllib2. I don't plan to fix this myself. 2008-09-24 John J Lee * 0.1.9 release. * Fix ImportError if sqlite3 not available * Fix a couple of functional tests not to wait 5 seconds each 2008-09-13 John J Lee * 0.1.8 release. * Close sockets. This only affects Python 2.5 (and later) - earlier versions of Python were unaffected. See http://bugs.python.org/issue1627441 * Make title parsing follow Firefox behaviour wrt child elements (previously the behaviour differed between Factory and RobustFactory). * Fix BeautifulSoup RobustLinksFactory (hence RobustFactory) link text parsing for case of link text containing tags (Titus Brown) * Fix issue where more tags after caused default parser to raise an exception * Handle missing cookie max-age value. Previously, a warning was emitted in this case. * Fix thoroughly broken digest auth (still need functional test!) (trebor74hr@...) * Handle cookies containing embedded tabs in mozilla format files * Remove an assertion about mozilla format cookies file contents (raise LoadError instead) * Fix MechanizeRobotFileParser.set_opener() * Fix selection of global form using .select_form() (Titus Brown) * Log skipped Refreshes * Stop tests from clobbering files that happen to be lying around in cwd (!) * Use SO_REUSEADDR for local test server. * Raise exception if local test server fails to start. * Tests no longer (accidentally) depend on third-party coverage module * The usual docs and test fixes. * Add convenience method Browser.open_local_file(filename) * Add experimental support for Firefox 3 cookie jars ("cookies.sqlite"). Requires Python 2.5 * Fix a _gzip.py NameError (gzip support is experimental) 2007-05-31 John J Lee <jjl@pobox.com> * 0.1.7b release. * Sub-requests should not usually be visiting, so make it so. In fact the visible behaviour wasn't really broken here, since .back() skips over None responses (which is odd in itself, but won't be changed until after stable release is branched). However, this patch does change visible behaviour in that it creates new Request objects for sub-requests (e.g. basic auth retries) where previously we just mutated the existing Request object. * Changes to sort out abuse of by SeekableProcessor and ResponseUpgradeProcessor (latter shouldn't have been public in the first place) and resulting confusing / unclear / broken behaviour. Deprecate SeekableProcessor and ResponseUpgradeProcessor. Add SeekableResponseOpener. Remove SeekableProcessor and ResponseUpgradeProcessor from Browser. Move UserAgentBase.add_referer_header() to Browser (it was on by default, breaking UserAgent, and should never really have been there). * Fix HTTP proxy support: r29110 meant that Request.get_selector() didn't take into account the change to .__r_host (Thanks tgates@...). * Redirected robots.txt fetch no longer results in another attempted robots.txt fetch to check the redirection is allowed! * Fix exception raised by RFC 3986 implementation with urljoin(base, '/..') * Fix two multiple-response-wrapping bugs. * Add missing import in tests (caused failure on Windows). * Set svn:eol-style to native for all text files in SVN. * Add some tests for upgrade_response(). * Add a functional test for 302 + 404 case. * Add an -l option to run the functional tests against a local twisted.web2-based server (you need Twisted installed for this to work). This is much faster than running against wwwsearch.sourceforge.net * Add -u switch to skip unittests (and only run the doctests). 2007-01-07 John J Lee <jjl@pobox.com> * 0.1.6b release * Add mechanize.ParseError class, document it as part of the mechanize.Factory interface, and raise it from all Factory implementations. This is backwards-compatible, since the new exception derives from the old exceptions. * Bug fix: Truncation when there is no full .read() before navigating to the next page, and an old response is read after navigation. This happened e.g. with r = br.open(); r.readline(); br.open(url); r.read(); br.back() . * Bug fix: when .back() caused a reload, it was returning the old response, not the .reload()ed one. * Bug fix: .back() was not returning a copy of the response, which presumably would cause seek position problems. * Bug fix: base tag without href attribute would override document URL with a None value, causing a crash (thanks Nathan Eror). * Fix .set_response() to close current response first. * Fix non-idempotent behaviour of Factory.forms() / .links() . Previously, if for example you got a ParseError during execution of .forms(), you could call it again and have it not raise an exception, because it started out where it left off! * Add a missing copy.copy() to RobustFactory . * Fix redirection to 'URIs' that contain characters that are not allowed in URIs (thanks Riko Wichmann). Also, Request constructor now logs a module logging warning about any such bad URIs. * Add .global_form() method to Browser to support form controls whose HTML elements are not descendants of any FORM element. * Add a new method .visit_response() . This creates a new history entry from a response object, rather than just changing the current visited response. This is useful e.g. when you want to use Browser features in a handler. * Misc minor bug fixes. 2006-10-25 John J Lee <jjl@pobox.com> * 0.1.5b release: Update setuptools dependencies to depend on ClientForm>=0.2.5 (for an important bug fix affecting fragments in URLs). There are no other changes in this release -- this release was done purely so that people upgrading to the latest version of mechanize will get the latest ClientForm too. 2006-10-14 John J Lee <jjl@pobox.com> * 0.1.4b release: (skipped a version deliberately for obscure reasons) * Improved auth & proxies support. * Follow RFC 3986. * Add a .set_cookie() method to Browser . * Add Browser.open_novisit() and Request.visit to allow fetching files without affecting Browser state. * UserAgent and Browser are now subclasses of UserAgentBase. UserAgent's only role in life above what UserAgentBase does is to provide the .set_seekable_responses() method (it lives there because Browser depends on seekable responses, because that's how browser history is implemented). * Bundle BeautifulSoup 2.1.1. No more dependency pain! Note that BeautifulSoup is, and always was, optional, and that mechanize will eventually switch to BeautifulSoup version 3, at which point it may well stop bundling BeautifulSoup. Note also that the module is only used internally, and is not available as a public attribute of the package. If you dare, you can import it ("from mechanize import _beautifulsoup"), but beware that it will go away later, and that the API of BeautifulSoup will change when the upgrade to 3 happens. Also, BeautifulSoup support (mainly RobustFactory) is still a little experimental and buggy. * Fix HTTP-EQUIV with no content attribute case (thanks Pratik Dam). * Fix bug with quoted META Refresh URL (thanks Nilton Volpato). * Fix crash with </base> tag (yajdbgr02@...). * Somebody found a server that (incorrectly) depends on HTTP header case, so follow the Title-Case convention. Note that the Request headers interface(s), which were (somewhat oddly -- this is an inheritance from urllib2 that should really be fixed in a better way than it is currently) always case-sensitive still are; the only thing that changed is what actually eventually gets sent over the wire. * Use mechanize (not urllib) to open robots.txt. Don't consult RobotFileParser instance about non-HTTP URLs. * Fix OpenerDirector.retrieve(), which was very broken (thanks Duncan Booth). * Crash in a much more obvious way if trying to use OpenerDirector after .close() . * .reload() on .back() if necessary (necessary iff response was not fully .read() on first .open()ing ) * Strip fragments before retrieving URLs (fixed Request.get_selector() to strip fragment) * Fix catching HTTPError subclasses while still preserving all their response behaviour * Correct over-enthusiastic documented guarantees of closeable_response . * Fix assumption that httplib.HTTPMessage treats dict-style __setitem__ as append rather than set (where on earth did I get that from?). * Expose History in mechanize/__init__.py (though interface is still experimental). * Lots of other "internals" bugs fixed (thanks to reports / patches from Benji York especially, also Titus Brown, Duncan Booth, and me ;-), where I'm not 100% sure exactly when they were introduced, so not listing them here in detail. * Numerous other minor fixes. * Some code cleanup. 2006-05-21 John J Lee <jjl@pobox.com> * 0.1.2b release: * mechanize now exports the whole urllib2 interface. * Pull in bugfixed auth/proxy support code from Python 2.5. * Bugfix: strip leading and trailing whitespace from link URLs * Fix .any_response() / .any_request() methods to have ordering. consistent with rest of handlers rather than coming before all of them. * Tell cookie-handling code about new TLDs. * Remove Browser.set_seekable_responses() (they always are anyway). * Show in web page examples how to munge responses and how to do proxy/auth. * Rename 0.1.* changes document 0.1.0-changes.txt --> 0.1-changes.txt. * In 0.1 changes document, note change of logger name from "ClientCookie" to "mechanize" * Add something about response objects to changes document * Improve Browser.__str__ * Accept regexp strings as well as regexp objects when finding links. * Add crappy gzip transfer encoding support. This is off by default and warns if you turn it on (hopefully will get better later :-). * A bit of internal cleanup following merge with pullparser / ClientCookie. 2006-05-06 John J Lee <jjl@pobox.com> * 0.1.1a release: * Merge ClientCookie and pullparser with mechanize. * Response object fixes. * Remove accidental dependency on BeautifulSoup introduced in 0.1.0a (the BeautifulSoup support is still here, but BeautifulSoup is not required to use mechanize). 2006-05-03 John J Lee <jjl@pobox.com> * 0.1.0a release: * Stop trying to record precise dates in changelog, since that's silly ;-) * A fair number of interface changes: see 0.1.0-changes.txt. * Depend on recent ClientCookie with copy.copy()able response objects. * Don't do broken XHTML handling by default (need to review code before switching this back on, e.g. should use a real XML parser for first-try at parsing). To get the old behaviour, pass i_want_broken_xhtml_support=True to mechanize.DefaultFactory / .RobustFactory constructor. * Numerous small bug fixes. * Documentation & setup.py fixes. * Don't use cookielib, to avoid having to work around Python 2.4 RFC 2109 bug, and to avoid my braindead thread synchronisation code in cookielib :-((((( (I haven't encountered specific breakage due to latter, but since it's braindead I may as well avoid it). 2005-11-30 John J Lee <jjl@pobox.com> * Fixed setuptools support. * Release 0.0.11a. 2005-11-19 John J Lee <jjl@pobox.com> * Release 0.0.10a. 2005-11-17 John J Lee <jjl@pobox.com> * Fix set_handle_referer. 2005-11-12 John J Lee <jjl@pobox.com> * Fix history (Gary Poster). * Close responses on reload (Gary Poster). * Don't depend on SSL support (Gary Poster). 2005-10-31 John J Lee <jjl@pobox.com> * Add setuptools support. 2005-10-30 John J Lee <jjl@pobox.com> * Don't mask AttributeError exception messages from ClientForm. * Document intent of .links() vs. .get_links_iter(); Rename LinksFactory method. * Remove pullparser import dependency. * Remove Browser.urltags (now an argument to LinksFactory). * Document Browser constructor as taking keyword args only (and change positional arg spec). * Cleanup of lazy parsing (may fix bugs, not sure...). 2005-10-28 John J Lee <jjl@pobox.com> * Support ClientForm backwards_compat switch. 2005-08-28 John J Lee <jjl@pobox.com> * Apply optimisation patch (Stephan Richter). 2005-08-15 John J Lee <jjl@pobox.com> * Close responses (ie. close the file handles but leave response still .read()able &c., thanks to the response objects we're using) (aurel@nexedi.com). 2005-08-14 John J Lee <jjl@pobox.com> * Add missing argument to UserAgent's _add_referer_header stub. * Doc and comment improvements. 2005-06-28 John J Lee <jjl@pobox.com> * Allow specifying parser class for equiv handling. * Ensure correct default constructor args are passed to HTTPRefererProcessor. * Allow configuring details of Refresh handling. * Switch to tolerant parser. 2005-06-11 John J Lee <jjl@pobox.com> * Do .seek(0) after link parsing in a finally block. * Regard text/xhtml as HTML. * Fix 2.4-compatibility bugs. * Fix spelling of '_equiv' feature string. 2005-05-30 John J Lee <jjl@pobox.com> * Turn on Referer, Refresh and HTTP-Equiv handling by default. 2005-05-08 John J Lee <jjl@pobox.com> * Fix .reload() to not update history (thanks to Titus Brown). * Use cookielib where available 2005-03-01 John J Lee <jjl@pobox.com> * Fix referer bugs: Don't send URL fragments; Don't add in Referer header in redirected request unless original request had a Referer header. 2005-02-19 John J Lee <jjl@pobox.com> * Allow supplying own mechanize.FormsFactory, so eg. can use ClientForm.XHTMLFormParser. Also allow supplying own Request class, and use sensible defaults for this. Now depends on ClientForm 0.1.17. Side effect is that, since we use the correct Request class by default, there's (I hope) no need for using RequestUpgradeProcessor in Browser._add_referer_header() :-) 2005-01-30 John J Lee <jjl@pobox.com> * Released 0.0.9a. 2005-01-05 John J Lee <jjl@pobox.com> * Fix examples (scraped sites have changed). * Fix .set_*() method boolean arguments. * The .response attribute is now a method, .response() * Don't depend on BaseProcessor (no longer exists). 2004-05-18 John J Lee <jjl@pobox.com> * Released 0.0.8a: * Added robots.txt observance, controlled by * BASE element has attribute 'href', not 'uri'! (patch from Jochen Knuth) * Fixed several bugs in handling of Referer header. * Link.__eq__ now returns False instead of raising AttributeError on comparison with non-Link (patch from Jim Jewett) * Removed dependencies on HTTPS support in Python and on ClientCookie.HTTPRobotRulesProcessor 2004-01-18 John J Lee <jjl@pobox.com> * Added robots.txt observance, controlled by UserAgent.set_handle_robots(). This is now on by default. * Removed set_persistent_headers() method -- just use .addheaders, as in base class. 2004-01-09 John J Lee <jjl@pobox.com> * Removed unnecessary dependence on SSL support in Python. Thanks to Krzysztof Kowalczyk for bug report. * Released 0.0.7a. 2004-01-06 John J Lee <jjl@pobox.com> * Link instances may now be passed to .click_link() and .follow_link(). * Added a new example program, pypi.py. 2004-01-05 John J Lee <jjl@pobox.com> * Released 0.0.5a. * If <title> tag was missing, links and forms would not be parsed. Also, base element (giving base URI) was ignored. Now parse title lazily, and get base URI while parsing links. Also, fixed ClientForm to take note of base element. Thanks to Phillip J. Eby for bug report. * Released 0.0.6a. 2004-01-04 John J Lee <jjl@pobox.com> * Fixed _useragent._replace_handler() to update self.handlers correctly. * Updated required pullparser version check. * Visiting a URL now deselects form (sets self.form to None). * Only first Content-Type header is now checked by ._viewing_html(), if there are more than one. * Stopped using getheaders from ClientCookie -- don't need it, since depend on Python 2.2, which has .getheaders() method on responses. Improved comments. * .open() now resets .response to None. Also rearranged .open() a bit so instance remains in consistent state on failure. * .geturl() now checks for non-None .response, and raises Browser. * .back() now checks for non-None .response, and doesn't attempt to parse if it's None. * .reload() no longer adds new history item. * Documented tag argument to .find_link(). * Fixed a few places where non-keyword arguments for .find_link() were silently ignored. Now raises ValueError. 2004-01-02 John J Lee <jjl@pobox.com> * Use response_seek_wrapper instead of seek_wrapper, which broke use of reponses after they're closed. * (Fixed response_seek_wrapper in ClientCookie.) * Fixed adding of Referer header. Thanks to Per Cederqvist for bug report. * Released 0.0.4a. * Updated required ClientCookie version check. 2003-12-30 John J Lee <jjl@pobox.com> * Added support for character encodings (for matching link text). * Released 0.0.3a. 2003-12-28 John J Lee <jjl@pobox.com> * Attribute lookups are no longer forwarded to .response -- you have to do it explicitly. * Added .geturl() method, which just delegates to .response. * Big rehash of UserAgent, which was broken. Added a test. * Discovered that zip() doesn't raise an exception when its arguments are of different length, so several tests could pass when they should have failed. Fixed. * Fixed <A/> case in ._parse_html(). * Released 0.0.2a. 2003-12-27 John J Lee <jjl@pobox.com> * Added and improved docstrings. * Browser.form is now a public attribute. Also documented Browser's public attributes. * Added base_url and absolute_url attributes to Link. * Tidied up .open(). Relative URL Request objects are no longer converted to absolute URLs -- they should probably be absolute in the first place anyway. * Added proper Referer handling (the handler in ClientCookie is a hack that only covers a restricted case). * Added click_link method, for symmetry with .click() / .submit() methods (which latter apply to forms). Of these methods, .click/.click_link() returns a request, and .submit/ .follow_link() actually .open()s the request. * Updated broken example code. 2003-12-24 John J Lee <jjl@pobox.com> * Modified setup.py so can easily register with PyPI. 2003-12-22 John J Lee <jjl@pobox.com> * Released 0.0.1a. ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.2.5/docs/html/documentation.html��������������������������������������������������������0000644�0001750�0001750�00000017741�11545150741�017607� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <!--This file was generated by pandoc: do not edit--><head> <meta name="author" content="John J. Lee <jjl@pobox.com>"> <meta name="date" content="2010-07-17"> <meta name="keywords" content="Python,HTML,HTTP,browser,stateful,web,client,client-side,mechanize,cookie,form,META,HTTP-EQUIV,Refresh,ClientForm,ClientCookie,pullparser,WWW::Mechanize"> <meta name="keywords" content="cookie,HTTP,Python,web,client,client-side,HTML,META,HTTP-EQUIV,Refresh"> <style type="text/css" media="screen">@import "../styles/style.css";</style> <!--breaks resizing text in IE6,7,8 (the lack of it also breaks baseline grid a bit in IE8 - can't win)--><!--[if !IE]>--><style type="text/css" media="screen">body{font-size:14px;}</style> <!--<![endif]--><!--max-width--><!--[if IE 6]><script type="text/javascript" src="../styles/ie6.js"></script><![endif]--><title>mechanize — Documentation
SourceForge.net. Fast, secure and Free Open Source software downloads

mechanize — Documentation

Full API documentation is in the docstrings and the documentation of urllib2. The documentation in these web pages is in need of reorganisation at the moment, after the merge of ClientCookie and ClientForm into mechanize.

Tests and examples

Examples

The front page has some introductory examples.

The examples directory in the source packages contains a couple of silly, but working, scripts to demonstrate basic use of the module.

See also the forms examples (these examples use the forms API independently of mechanize.Browser).

Tests

To run the tests:

python test.py

There are some tests that try to fetch URLs from the internet. To include those in the test run:

python test.py discover --tag internet

The urllib2 interface

mechanize exports the complete interface of urllib2. See the urllib2 documentation. For example:

import mechanize
response = mechanize.urlopen("http://www.example.com/")
print response.read()

Compatibility

These notes explain the relationship between mechanize, ClientCookie, ClientForm, cookielib and urllib2, and which to use when. If you’re just using mechanize, and not any of those other libraries, you can ignore this section.

  1. mechanize works with Python 2.4, Python 2.5, Python 2.6, and Python 2.7.

  2. When using mechanize, anything you would normally import from urllib2 should be imported from mechanize instead.

  3. Use of mechanize classes with urllib2 (and vice-versa) is no longer supported. However, existing classes implementing the urllib2 Handler interface are likely to work unchanged with mechanize.

  4. mechanize now only imports urllib2.URLError and urllib2.HTTPError from urllib2. The rest is forked. I intend to merge fixes from Python trunk frequently.

  5. ClientForm is no longer maintained as a separate package. The code is now part of mechanize, and its interface is now exported through module mechanize (since mechanize 0.2.0). Old code can simply be changed to import mechanize as ClientForm and should continue to work.

  6. ClientCookie is no longer maintained as a separate package. The code is now part of mechanize, and its interface is now exported through module mechanize (since mechanize 0.1.0). Old code can simply be changed to import mechanize as ClientCookie and should continue to work.

  7. The cookie handling parts of mechanize are in Python 2.4 standard library as module cookielib and extensions to module urllib2. mechanize does not currently use cookielib, due to the presence of thread synchronisation code in cookielib that is not present in the mechanize fork of cookielib.

API differences between mechanize and urllib2:

  1. mechanize provides additional features.

  2. mechanize.urlopen differs in its behaviour: it handles cookies, whereas urllib2.urlopen does not. To make a urlopen function with the urllib2 behaviour:

import mechanize
handler_classes = [mechanize.ProxyHandler,
mechanize.UnknownHandler,
mechanize.HTTPHandler,
mechanize.HTTPDefaultErrorHandler,
mechanize.HTTPRedirectHandler,
mechanize.FTPHandler,
mechanize.FileHandler,
mechanize.HTTPErrorProcessor]
opener = mechanize.OpenerDirector()
for handler_class in handler_classes:
opener.add_handler(handler_class())
urlopen = opener.open
  1. Since Python 2.6, urllib2 uses a .timeout attribute on Request objects internally. However, urllib2.Request has no timeout constructor argument, and urllib2.urlopen() ignores this parameter. mechanize.Request has a timeout constructor argument which is used to set the attribute of the same name, and mechanize.urlopen() does not ignore the timeout attribute.

UserAgent vs UserAgentBase

mechanize.UserAgent is a trivial subclass of mechanize.UserAgentBase, adding just one method, .set_seekable_responses() (see the documentation on seekable responses).

The reason for the extra class is that mechanize.Browser depends on seekable response objects (because response objects are used to implement the browser history).

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, July 2010.


mechanize-0.2.5/docs/hints.txt0000644000175000017500000001302111545150644014757 0ustar johnjohn% mechanize -- Hints Hints for debugging programs that use mechanize. Cookies ------- A common mistake is to use `mechanize.urlopen()`, *and* the `.extract_cookies()` and `.add_cookie_header()` methods on a cookie object themselves. If you use `mechanize.urlopen()` (or `OpenerDirector.open()`), the module handles extraction and adding of cookies by itself, so you should not call `.extract_cookies()` or `.add_cookie_header()`. Are you sure the server is sending you any cookies in the first place? Maybe the server is keeping track of state in some other way (`HIDDEN` HTML form entries (possibly in a separate page referenced by a frame), URL-encoded session keys, IP address, HTTP `Referer` headers)? Perhaps some embedded script in the HTML is setting cookies (see below)? Turn on [logging](#logging). When you `.save()` to or `.load()`/`.revert()` from a file, single-session cookies will expire unless you explicitly request otherwise with the `ignore_discard` argument. This may be your problem if you find cookies are going away after saving and loading. ~~~~{.python} import mechanize cj = mechanize.LWPCookieJar() opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj)) mechanize.install_opener(opener) r = mechanize.urlopen("http://foobar.com/") cj.save("/some/file", ignore_discard=True, ignore_expires=True) ~~~~ JavaScript code can set cookies; mechanize does not support this. See [the FAQ](faq.html#script). General ------- Enable [logging](#logging). Sometimes, a server wants particular HTTP headers set to the values it expects. For example, the `User-Agent` header may need to be [set](./doc.html#headers) to a value like that of a popular browser. Check that the browser is able to do manually what you're trying to achieve programatically. Make sure that what you do manually is *exactly* the same as what you're trying to do from Python -- you may simply be hitting a server bug that only gets revealed if you view pages in a particular order, for example. Try comparing the headers and data that your program sends with those that a browser sends. Often this will give you the clue you need. There are [browser addons](faq.html#sniffing) available that allow you to see what the browser sends and receives even if HTTPS is in use. If nothing is obviously wrong with the requests your program is sending and you're out of ideas, you can reliably locate the problem by copying the headers that a browser sends, and then changing headers until your program stops working again. Temporarily switch to explicitly sending individual HTTP headers (by calling `.add_header()`, or by using `httplib` directly). Start by sending exactly the headers that Firefox or IE send. You may need to make sure that a valid session ID is sent -- the one you got from your browser may no longer be valid. If that works, you can begin the tedious process of changing your headers and data until they match what your original code was sending. You should end up with a minimal set of changes. If you think that reveals a bug in mechanize, please [report it](support.html). Logging ------- To enable logging to stdout: ~~~~{.python} import sys, logging logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.DEBUG) ~~~~ You can reduce the amount of information shown by setting the level to `logging.INFO` instead of `logging.DEBUG`, or by only enabling logging for one of the following logger names instead of `"mechanize"`: * `"mechanize"`: Everything. * `"mechanize.cookies"`: Why particular cookies are accepted or rejected and why they are or are not returned. Requires logging enabled at the `DEBUG` level. * `"mechanize.http_responses"`: HTTP response body data. * `"mechanize.http_redirects"`: HTTP redirect information. HTTP headers ------------ An example showing how to enable printing of HTTP headers to stdout, logging of HTTP response bodies, and logging of information about redirections: ~~~~{.python} import sys, logging import mechanize logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.DEBUG) browser = mechanize.Browser() browser.set_debug_http(True) browser.set_debug_responses(True) browser.set_debug_redirects(True) response = browser.open("http://python.org/") ~~~~ Alternatively, you can examine request and response objects to see what's going on. Note that requests may involve "sub-requests" in cases such as redirection, in which case you will not see everything that's going on just by examining the original request and final response. It's often useful to [use the `.get_data()` method](./doc.html#seekable-responses) on responses during debugging. ### Handlers ### **This section is not relevant if you use `mechanize.Browser`.** An example showing how to enable printing of HTTP headers to stdout, at the `HTTPHandler` level: ~~~~{.python} import mechanize hh = mechanize.HTTPHandler() # you might want HTTPSHandler, too hh.set_http_debuglevel(1) opener = mechanize.build_opener(hh) response = opener.open(url) ~~~~ The following handlers are available: **NOTE**: as well as having these handlers in your `OpenerDirector` (for example, by passing them to `build_opener()`) you have to [turn on logging](#logging) at the `INFO` level or lower in order to see any output. `HTTPRedirectDebugProcessor`: logs information about redirections `HTTPResponseDebugProcessor`: logs HTTP response bodies (including those that are read during redirections) mechanize-0.2.5/docs/styles/0000755000175000017500000000000011545173600014414 5ustar johnjohnmechanize-0.2.5/docs/styles/maxwidth.css0000644000175000017500000000051111356346234016755 0ustar johnjohn/* min-/max-width work-alike */ #content { padding:10px; width: expression(document.documentElement.clientWidth < 398 ? "400px" : document.documentElement.clientWidth > 752 ? "750px" : "auto");/*novalidate*/ margin: 0 50px; padding-left: 40px; background-color:#FFF; /* background: #fff url('/images/gridbg.gif');*/ } mechanize-0.2.5/docs/styles/ie6.js0000644000175000017500000000053411356346234015444 0ustar johnjohnfunction add_style_element(relative_ref) { var head = document.getElementsByTagName("head")[0]; var css = document.createElement("link"); css.type = "text/css"; css.rel = "stylesheet"; css.href = relative_ref; css.media = "screen"; head.appendChild(css); } /* enable max-width workaround */ add_style_element("/styles/maxwidth.css"); mechanize-0.2.5/docs/styles/style.css0000644000175000017500000000757211362420717016302 0ustar johnjohnbody { /* for IE6; text size for non-IE browsers is in .html files */ font-size:87.5%; } body,div,dl,dt,dd,ul,ol,li,h1,h2,h3,h4,h5,h6,pre,form,fieldset,p,blockquote,th,td { margin:0; padding:0; } #sf a img { float:right; border:none; } #content { padding-top:18px; padding-bottom:18px; padding-left:10px; padding-right:10px; max-width:750px; min-width:400px; margin:0 50px; padding-left:40px; /* there's a half-abandoned attempt to stick to baseline grid here: seems too easy for that to get messed up by font-weight variations &c.; due to choice of 14/21 px grid? */ /* background:#fff url('../images/gridbg.gif'); */ } #main { clear:both; } #nav { float:right; display:inline; padding-top:1.429em; padding-right:1.5em; list-style-type:none; padding-bottom:0px; border-bottom:1px solid black; } #nav li { float:left; /* display:inline; */ _height:0;/*novalidate*/ margin-left:0.6em; padding-bottom:0; padding-top:0; } #nav a, #nav .thispage { padding:0 11px 0 11px; /* display:inline; */ font:bold 1.143em verdana, arial, helvetica, sans-serif; line-height:1.3125em; margin:1.3125em 0; overflow:auto; } #subnav li, #TOC li { list-style-type:none; margin-left:10px; } ol { font-weight:bold; font-style:italic; font-family:verdana, arial, helvetica, sans-serif; } ol p { font-weight:normal; font-style:normal; font-family:times, serif; } li p { margin:0; } p.q { font-weight:bold; font-style:italic; font-family:verdana, arial, helvetica, sans-serif; margin-top:1.5em; } #subnav a, #subnav .thispage, #TOC a, #TOC .thispage { font-family:lucida, sans-serif; font-weight:600; } #subnav, #TOC { float:right; clear:right; width:162px; padding:10px 20px; background-color:#eee; margin-left:3em; margin-bottom:3em; border-left:thick solid #2d4706; } #TOC { border-left:none; } h1 { padding-left:10px; padding-top:20px; font:2em verdana, arial, helvetica, sans-serif; line-height:1.5em; margin:1.5em 0; color:#c4c496; } h2 { font:bold 1.143em verdana, arial, helvetica, sans-serif; line-height:1.313em; margin:1.313em 0; } h3 { font:bold 1em verdana, arial, helvetica, sans-serif; line-height:1.5em; margin:1.5em 0; font-style:italic; } p { font:normal 1em times, serif; line-height:1.5em; margin:1.5em 0; } pre { font-family:"Courier New", Courier, monospace; line-height:1.5em; margin-top:1.5em; margin-bottom:1.5em; margin-left:10px; } code { font-family:"Courier New", Courier, monospace; line-height:1.3em; /* Avoid breaking baseline grid in firefox :-( */ margin:1.5em 0; } ul { border-bottom:1.5em; } li { margin-left:2em; } dt { line-height:1.5em; margin:1.5em 0; } dd { line-height:1.5em; margin-top:1.5em; margin-bottom:1.5em; margin-left:2em; } .expanded li { margin-left:2em; margin-top:1.5em; margin-bottom:1.5em; } .expanded li li { margin-left:2em; margin-top:0; margin-bottom:0; } a { color:#2d4706; } a, .thispage { text-decoration:none; font-weight:bold; } a:link { color:#2d4706; } a:visited { color:#41680d; } a:hover { color:#000; } .warning { background-color:#ffffaa; } .docwarning { background-color:#f3ecd2; } pre.sourceCode { } pre.sourceCode span.Normal { } pre.sourceCode span.Keyword { color: #007020; font-weight: bold; } pre.sourceCode span.DataType { color: #902000; } pre.sourceCode span.DecVal { color: #40a070; } pre.sourceCode span.BaseN { color: #40a070; } pre.sourceCode span.Float { color: #40a070; } pre.sourceCode span.Char { color: #4070a0; } pre.sourceCode span.String { color: #4070a0; } pre.sourceCode span.Comment { color: #60a0b0; font-style: italic; } pre.sourceCode span.Others { color: #007020; } pre.sourceCode span.Alert { color: red; font-weight: bold; } pre.sourceCode span.Function { color: #06287e; } pre.sourceCode span.RegionMarker { } pre.sourceCode span.Error { color: red; font-weight: bold; } mechanize-0.2.5/docs/index.txt0000644000175000017500000001346511545150644014755 0ustar johnjohn% mechanize Stateful programmatic web browsing in Python, after Andy Lester's Perl module [`WWW::Mechanize`](http://search.cpan.org/dist/WWW-Mechanize/). * `mechanize.Browser` and `mechanize.UserAgentBase` implement the interface of `urllib2.OpenerDirector`, so: * any URL can be opened, not just `http:` * `mechanize.UserAgentBase` offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and `robots.txt` handling, without having to make a new `OpenerDirector` each time, e.g. by calling `build_opener()`. * Easy HTML form filling. * Convenient link parsing and following. * Browser history (`.back()` and `.reload()` methods). * The `Referer` HTTP header is added properly (optional). * Automatic observance of [`robots.txt`](http://www.robotstxt.org/wc/norobots.html). * Automatic handling of HTTP-Equiv and Refresh. Examples -------- The examples below are written for a website that does not exist (`example.com`), so cannot be run. There are also some [working examples](documentation.html#examples) that you can run. ~~~~{.python} import re import mechanize br = mechanize.Browser() br.open("http://www.example.com/") # follow second link with element text matching regular expression response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1) assert br.viewing_html() print br.title() print response1.geturl() print response1.info() # headers print response1.read() # body br.select_form(name="order") # Browser passes through unknown attributes (including methods) # to the selected HTMLForm. br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__) # Submit current form. Browser calls .close() on the current response on # navigation, so this closes response1 response2 = br.submit() # print currently selected form (don't call .submit() on this, use br.submit()) print br.form response3 = br.back() # back to cheese shop (same data as response1) # the history mechanism returns cached response objects # we can still use the response, even though it was .close()d response3.get_data() # like .seek(0) followed by .read() response4 = br.reload() # fetches from server for form in br.forms(): print form # .links() optionally accepts the keyword args of .follow_/.find_link() for link in br.links(url_regex="python.org"): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back() ~~~~ You may control the browser's policy by using the methods of `mechanize.Browser`'s base class, `mechanize.UserAgent`. For example: ~~~~{.python} br = mechanize.Browser() # Explicitly configure proxies (Browser will attempt to set good defaults). # Note the userinfo ("joe:password@") and port number (":3128") are optional. br.set_proxies({"http": "joe:password@myproxy.example.com:3128", "ftp": "proxy.example.com", }) # Add HTTP Basic/Digest auth username and password for HTTP proxy access. # (equivalent to using "joe:password@..." form above) br.add_proxy_password("joe", "password") # Add HTTP Basic/Digest auth username and password for website access. br.add_password("http://example.com/protected/", "joe", "password") # Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML). br.set_handle_equiv(False) # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False) # Don't add Referer (sic) header br.set_handle_referer(False) # Don't handle Refresh redirections br.set_handle_refresh(False) # Don't handle cookies br.set_cookiejar() # Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by # default: no need to do this unless you have some reason to use a # particular cookiejar) br.set_cookiejar(cj) # Log information about HTTP redirects and Refreshes. br.set_debug_redirects(True) # Log HTTP response bodies (ie. the HTML, most of the time). br.set_debug_responses(True) # Print HTTP headers. br.set_debug_http(True) # To make sure you're seeing all debug output: logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.INFO) # Sometimes it's useful to process bad headers or bad HTML: response = br.response() # this is a copy of response headers = response.info() # currently, this is a mimetools.Message headers["Content-type"] = "text/html; charset=utf-8" response.set_data(response.get_data().replace(" mechanize-0.2.5/docs/support.txt0000644000175000017500000000074111545150644015353 0ustar johnjohn% mechanize -- Support Documentation ------------- See links at right. [Start here](documentation.html). Bug tracker ----------- The bug tracker is [here on github](http://github.com/jjlee/mechanize/issues). It's equally acceptable to file bugs on the tracker or post about them to the mailing list. Contact ------- There is a [mailing list](http://lists.sourceforge.net/lists/listinfo/wwwsearch-general). mechanize-0.2.5/docs/documentation.txt0000644000175000017500000001111011545150644016500 0ustar johnjohn% mechanize -- Documentation Full API documentation is in the docstrings and the documentation of [`urllib2`](http://docs.python.org/release/2.6/library/urllib2.html). The documentation in these web pages is in need of reorganisation at the moment, after the merge of ClientCookie and ClientForm into mechanize. Tests and examples ------------------ ### Examples ### The [front page](./) has some introductory examples. The `examples` directory in the source packages contains a couple of silly, but working, scripts to demonstrate basic use of the module. See also the [forms examples](./forms.html) (these examples use the forms API independently of `mechanize.Browser`). ### Tests ### To run the tests: python test.py There are some tests that try to fetch URLs from the internet. To include those in the test run: python test.py discover --tag internet The `urllib2` interface ----------------------- mechanize exports the complete interface of `urllib2`. See the [`urllib2` documentation](http://docs.python.org/release/2.6/library/urllib2.html). For example: ~~~~{.python} import mechanize response = mechanize.urlopen("http://www.example.com/") print response.read() ~~~~ Compatibility ------------- These notes explain the relationship between mechanize, ClientCookie, ClientForm, `cookielib` and `urllib2`, and which to use when. If you're just using mechanize, and not any of those other libraries, you can ignore this section. #. mechanize works with Python 2.4, Python 2.5, Python 2.6, and Python 2.7. #. When using mechanize, anything you would normally import from `urllib2` should be imported from `mechanize` instead. #. Use of mechanize classes with `urllib2` (and vice-versa) is no longer supported. However, existing classes implementing the `urllib2 Handler` interface are likely to work unchanged with mechanize. #. mechanize now only imports `urllib2.URLError` and `urllib2.HTTPError` from `urllib2`. The rest is forked. I intend to merge fixes from Python trunk frequently. #. ClientForm is no longer maintained as a separate package. The code is now part of mechanize, and its interface is now exported through module mechanize (since mechanize 0.2.0). Old code can simply be changed to `import mechanize as ClientForm` and should continue to work. #. ClientCookie is no longer maintained as a separate package. The code is now part of mechanize, and its interface is now exported through module mechanize (since mechanize 0.1.0). Old code can simply be changed to `import mechanize as ClientCookie` and should continue to work. #. The cookie handling parts of mechanize are in Python 2.4 standard library as module `cookielib` and extensions to module `urllib2`. mechanize does not currently use `cookielib`, due to the presence of thread synchronisation code in `cookielib` that is not present in the mechanize fork of `cookielib`. API differences between mechanize and `urllib2`: #. mechanize provides additional features. #. `mechanize.urlopen` differs in its behaviour: it handles cookies, whereas `urllib2.urlopen` does not. To make a `urlopen` function with the `urllib2` behaviour: ~~~~{.python} import mechanize handler_classes = [mechanize.ProxyHandler, mechanize.UnknownHandler, mechanize.HTTPHandler, mechanize.HTTPDefaultErrorHandler, mechanize.HTTPRedirectHandler, mechanize.FTPHandler, mechanize.FileHandler, mechanize.HTTPErrorProcessor] opener = mechanize.OpenerDirector() for handler_class in handler_classes: opener.add_handler(handler_class()) urlopen = opener.open ~~~~ #. Since Python 2.6, `urllib2` uses a `.timeout` attribute on `Request` objects internally. However, `urllib2.Request` has no timeout constructor argument, and `urllib2.urlopen()` ignores this parameter. `mechanize.Request` has a `timeout` constructor argument which is used to set the attribute of the same name, and `mechanize.urlopen()` does not ignore the timeout attribute. UserAgent vs UserAgentBase -------------------------- `mechanize.UserAgent` is a trivial subclass of `mechanize.UserAgentBase`, adding just one method, `.set_seekable_responses()` (see the [documentation on seekable responses](./doc.html#seekable-responses)). The reason for the extra class is that `mechanize.Browser` depends on seekable response objects (because response objects are used to implement the browser history). mechanize-0.2.5/docs/forms.txt0000644000175000017500000003051111545150741014761 0ustar johnjohn% mechanize -- Forms This documentation is in need of reorganisation! This page is the old ClientForm documentation. ClientForm is now part of mechanize, but the documentation hasn't been fully updated to reflect that: what's here is correct, but not well-integrated with the rest of the documentation. This page deals with HTML form handling: parsing HTML forms, filling them in and returning the completed forms to the server. See the [front page](./) for how to obtain form objects from a `mechanize.Browser`. Simple working example (`examples/forms/simple.py` in the source distribution): ~~~~{.python} import sys from mechanize import ParseResponse, urlopen, urljoin if len(sys.argv) == 1: uri = "http://wwwsearch.sourceforge.net/" else: uri = sys.argv[1] response = urlopen(urljoin(uri, "mechanize/example.html")) forms = ParseResponse(response, backwards_compat=False) form = forms[0] print form form["comments"] = "Thanks, Gisle" # form.click() returns a mechanize.Request object # (see HTMLForm.click.__doc__ if you want to use only the forms support, and # not the rest of mechanize) print urlopen(form.click()).read() ~~~~ A more complicated working example (from `examples/forms/example.py` in the source distribution): ~~~~{.python} import sys import mechanize if len(sys.argv) == 1: uri = "http://wwwsearch.sourceforge.net/" else: uri = sys.argv[1] request = mechanize.Request(mechanize.urljoin(uri, "mechanize/example.html")) response = mechanize.urlopen(request) forms = mechanize.ParseResponse(response, backwards_compat=False) response.close() ## f = open("example.html") ## forms = mechanize.ParseFile(f, "http://example.com/example.html", ## backwards_compat=False) ## f.close() form = forms[0] print form # very useful! # A 'control' is a graphical HTML form widget: a text entry box, a # dropdown 'select' list, a checkbox, etc. # Indexing allows setting and retrieval of control values original_text = form["comments"] # a string, NOT a Control instance form["comments"] = "Blah." # Controls that represent lists (checkbox, select and radio lists) are # ListControl instances. Their values are sequences of list item names. # They come in two flavours: single- and multiple-selection: form["favorite_cheese"] = ["brie"] # single form["cheeses"] = ["parmesan", "leicester", "cheddar"] # multi # equivalent, but more flexible: form.set_value(["parmesan", "leicester", "cheddar"], name="cheeses") # Add files to FILE controls with .add_file(). Only call this multiple # times if the server is expecting multiple files. # add a file, default value for MIME type, no filename sent to server form.add_file(open("data.dat")) # add a second file, explicitly giving MIME type, and telling the server # what the filename is form.add_file(open("data.txt"), "text/plain", "data.txt") # All Controls may be disabled (equivalent of greyed-out in browser)... control = form.find_control("comments") print control.disabled # ...or readonly print control.readonly # readonly and disabled attributes can be assigned to control.disabled = False # convenience method, used here to make all controls writable (unless # they're disabled): form.set_all_readonly(False) # A couple of notes about list controls and HTML: # 1. List controls correspond to either a single SELECT element, or # multiple INPUT elements. Items correspond to either OPTION or INPUT # elements. For example, this is a SELECT control, named "control1": # # and this is a CHECKBOX control, named "control2": # # # You know the latter is a single control because all the name attributes # are the same. # 2. Item names are the strings that go to make up the value that should # be returned to the server. These strings come from various different # pieces of text in the HTML. The HTML standard and the mechanize # docstrings explain in detail, but playing around with an HTML file, # ParseFile() and 'print form' is very useful to understand this! # You can get the Control instances from inside the form... control = form.find_control("cheeses", type="select") print control.name, control.value, control.type control.value = ["mascarpone", "curd"] # ...and the Item instances from inside the Control item = control.get("curd") print item.name, item.selected, item.id, item.attrs item.selected = False # Controls may be referred to by label: # find control with label that has a *substring* "Cheeses" # (e.g., a label "Please select a cheese" would match). control = form.find_control(label="select a cheese") # You can explicitly say that you're referring to a ListControl: # set value of "cheeses" ListControl form.set_value(["gouda"], name="cheeses", kind="list") # equivalent: form.find_control(name="cheeses", kind="list").value = ["gouda"] # the first example is also almost equivalent to the following (but # insists that the control be a ListControl -- so it will skip any # non-list controls that come before the control we want) form["cheeses"] = ["gouda"] # The kind argument can also take values "multilist", "singlelist", "text", # "clickable" and "file": # find first control that will accept text, and scribble in it form.set_value("rhubarb rhubarb", kind="text", nr=0) # find, and set the value of, the first single-selection list control form.set_value(["spam"], kind="singlelist", nr=0) # You can find controls with a general predicate function: def control_has_caerphilly(control): for item in control.items: if item.name == "caerphilly": return True form.find_control(kind="list", predicate=control_has_caerphilly) # HTMLForm.controls is a list of all controls in the form for control in form.controls: if control.value == "inquisition": sys.exit() # Control.items is a list of all Item instances in the control for item in form.find_control("cheeses").items: print item.name # To remove items from a list control, remove it from .items: cheeses = form.find_control("cheeses") curd = cheeses.get("curd") del cheeses.items[cheeses.items.index(curd)] # To add items to a list container, instantiate an Item with its control # and attributes: # Note that you are responsible for getting the attributes correct here, # and these are not quite identical to the original HTML, due to # defaulting rules and a few special attributes (e.g. Items that represent # OPTIONs have a special "contents" key in their .attrs dict). In future # there will be an explicitly supported way of using the parsing logic to # add items and controls from HTML strings without knowing these details. mechanize.Item(cheeses, {"contents": "mascarpone", "value": "mascarpone"}) # You can specify list items by label using set/get_value_by_label() and # the label argument of the .get() method. Sometimes labels are easier to # maintain than names, sometimes the other way around. form.set_value_by_label(["Mozzarella", "Caerphilly"], "cheeses") # Which items are present, selected, and successful? # is the "parmesan" item of the "cheeses" control successful (selected # and not disabled)? print "parmesan" in form["cheeses"] # is the "parmesan" item of the "cheeses" control selected? print "parmesan" in [ item.name for item in form.find_control("cheeses").items if item.selected] # does cheeses control have a "caerphilly" item? print "caerphilly" in [item.name for item in form.find_control("cheeses").items] # Sometimes one wants to set or clear individual items in a list, rather # than setting the whole .value: # select the item named "gorgonzola" in the first control named "cheeses" form.find_control("cheeses").get("gorgonzola").selected = True # You can be more specific: # deselect "edam" in third CHECKBOX control form.find_control(type="checkbox", nr=2).get("edam").selected = False # deselect item labelled "Mozzarella" in control with id "chz" form.find_control(id="chz").get(label="Mozzarella").selected = False # Often, a single checkbox (a CHECKBOX control with a single item) is # present. In that case, the name of the single item isn't of much # interest, so it's a good idea to check and uncheck the box without # using the item name: form.find_control("smelly").items[0].selected = True # check form.find_control("smelly").items[0].selected = False # uncheck # Items may be disabled (selecting or de-selecting a disabled item is # not allowed): control = form.find_control("cheeses") print control.get("emmenthal").disabled control.get("emmenthal").disabled = True # enable all items in control control.set_all_items_disabled(False) request2 = form.click() # mechanize.Request object try: response2 = mechanize.urlopen(request2) except mechanize.HTTPError, response2: pass print response2.geturl() # headers for name, value in response2.info().items(): if name != "date": print "%s: %s" % (name.title(), value) print response2.read() # body response2.close() ~~~~ All of the standard control types are supported: `TEXT`, `PASSWORD`, `HIDDEN`, `TEXTAREA`, `ISINDEX`, `RESET`, `BUTTON` (`INPUT TYPE=BUTTON` and the various `BUTTON` types), `SUBMIT`, `IMAGE`, `RADIO`, `CHECKBOX`, `SELECT`/`OPTION` and `FILE` (for file upload). Both standard form encodings (`application/x-www-form-urlencoded` and `multipart/form-data`) are supported. The module is designed for testing and automation of web interfaces, not for implementing interactive user agents. ***Security note*: Remember that any passwords you store in `HTMLForm` instances will be saved to disk in the clear if, for example, you [pickle](http://docs.python.org/library/pickle.html) them.** Parsers ------- There are two parsers. TODO: more! See also the FAQ entries on [XHTML](faq.html#xhtml) and [parsing bad HTML](./faq.html#parsing). Backwards-compatibility mode ---------------------------- mechanize (and ClientForm 0.2) includes three minor backwards-incompatible interface changes from ClientForm version 0.1. To make upgrading from ClientForm 0.1 easier, and to allow me to stop supporting version ClientForm 0.1 sooner, there is support for operating in a backwards-compatible mode, under which code written for ClientForm 0.1 should work without modification. This is done on a per-`HTMLForm` basis via the `.backwards_compat` attribute, but for convenience the `ParseResponse()` and `ParseFile()` factory functions accept `backwards_compat` arguments. These backwards-compatibility features will be removed soon. The default is to operate in backwards-compatible mode. To run with backwards compatible mode turned ***OFF*** (**strongly recommended**): ~~~~{.python} from mechanize import ParseResponse, urlopen forms = ParseResponse(urlopen("http://example.com/"), backwards_compat=False) # ... ~~~~ The backwards-incompatible changes are: * Ambiguous specification of controls or items now results in AmbiguityError. If you want the old behaviour, explicitly pass `nr=0` to indicate you want the first matching control or item. * Item label matching is now done by substring, not by strict string-equality (but note leading and trailing space is always stripped). (Control label matching is always done by substring.) * Handling of disabled list items has changed. First, note that handling of disabled list items in ClientForm 0.1 (and in ClientForm 0.2's backwards-compatibility mode!) is buggy: disabled items are successful (ie. disabled item names are sent back to the server). As a result, there was no distinction to be made between successful items and selected items. In ClientForm 0.2, the bug is fixed, so this is no longer the case, and it is important to note that list controls' `.value` attribute contains only the *successful* item names; items that are *selected* but not successful (because disabled) are not included in `.value`. Second, disabled list items may no longer be deselected: AttributeError is raised in ClientForm 0.2, whereas deselection was allowed in ClientForm 0.1. The bug in ClientForm 0.1 and in ClientForm 0.2's backwards-compatibility mode will not be fixed, to preserve compatibility and to encourage people to upgrade to the new ClientForm 0.2 `backwards_compat=False` behaviour. mechanize-0.2.5/docs/doc.txt0000644000175000017500000005140511545150644014407 0ustar johnjohn% mechanize -- Documentation This documentation is in need of reorganisation! This page is the old ClientCookie documentation. It deals with operation on the level of `urllib2 Handler` objects, and also with adding headers, debugging, and cookie handling. See the [front page](./) for more typical use. Examples -------- ~~~~{.python} import mechanize response = mechanize.urlopen("http://example.com/") ~~~~ This function behaves identically to `urllib2.urlopen()`, except that it deals with cookies automatically. Here is a more complicated example, involving `Request` objects (useful if you want to pass `Request`s around, add headers to them, etc.): ~~~~{.python} import mechanize request = mechanize.Request("http://example.com/") # note we're using the urlopen from mechanize, not urllib2 response = mechanize.urlopen(request) # let's say this next request requires a cookie that was set # in response request2 = mechanize.Request("http://example.com/spam.html") response2 = mechanize.urlopen(request2) print response2.geturl() print response2.info() # headers print response2.read() # body (readline and readlines work too) ~~~~ In these examples, the workings are hidden inside the `mechanize.urlopen()` function, which is an extension of `urllib2.urlopen()`. Redirects, proxies and cookies are handled automatically by this function (note that you may need a bit of configuration to get your proxies correctly set up: see `urllib2` documentation). There is also a `urlretrieve()` function, which works like `urllib.urlretrieve()`. An example at a slightly lower level shows how the module processes cookies more clearly: ~~~~{.python} # Don't copy this blindly! You probably want to follow the examples # above, not this one. import mechanize # Build an opener that *doesn't* automatically call .add_cookie_header() # and .extract_cookies(), so we can do it manually without interference. class NullCookieProcessor(mechanize.HTTPCookieProcessor): def http_request(self, request): return request def http_response(self, request, response): return response opener = mechanize.build_opener(NullCookieProcessor) request = mechanize.Request("http://example.com/") response = mechanize.urlopen(request) cj = mechanize.CookieJar() cj.extract_cookies(response, request) # let's say this next request requires a cookie that was set in response request2 = mechanize.Request("http://example.com/spam.html") cj.add_cookie_header(request2) response2 = mechanize.urlopen(request2) ~~~~ The `CookieJar` class does all the work. There are essentially two operations: `.extract_cookies()` extracts HTTP cookies from `Set-Cookie` (the original [Netscape cookie standard](http://curl.haxx.se/rfc/cookie_spec.html)) and `Set-Cookie2` ([RFC 2965](http://www.ietf.org/rfc/rfc2965.txt)) headers from a response if and only if they should be set given the request, and `.add_cookie_header()` adds `Cookie` headers if and only if they are appropriate for a particular HTTP request. Incoming cookies are checked for acceptability based on the host name, etc. Cookies are only set on outgoing requests if they match the request's host name, path, etc. **Note that if you're using `mechanize.urlopen()` (or if you're using `mechanize.HTTPCookieProcessor` by some other means), you don't need to call `.extract_cookies()` or `.add_cookie_header()` yourself**. If, on the other hand, you want to use mechanize to provide cookie handling for an HTTP client other than mechanize itself, you will need to use this pair of methods. You can make your own `request` and `response` objects, which must support the interfaces described in the docstrings of `.extract_cookies()` and `.add_cookie_header()`. There are also some `CookieJar` subclasses which can store cookies in files and databases. `FileCookieJar` is the abstract class for `CookieJar`s that can store cookies in disk files. `LWPCookieJar` saves cookies in a format compatible with the libwww-perl library. This class is convenient if you want to store cookies in a human-readable file: ~~~~{.python} import mechanize cj = mechanize.LWPCookieJar() cj.revert("cookie3.txt") opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj)) r = opener.open("http://foobar.com/") cj.save("cookie3.txt") ~~~~ The `.revert()` method discards all existing cookies held by the `CookieJar` (it won't lose any existing cookies if the load fails). The `.load()` method, on the other hand, adds the loaded cookies to existing cookies held in the `CookieJar` (old cookies are kept unless overwritten by newly loaded ones). `MozillaCookieJar` can load and save to the Mozilla/Netscape/lynx-compatible `'cookies.txt'` format. This format loses some information (unusual and nonstandard cookie attributes such as comment, and also information specific to RFC 2965 cookies). The subclass `MSIECookieJar` can load (but not save) from Microsoft Internet Explorer's cookie files on Windows. Important note -------------- Only use names you can import directly from the `mechanize` package, and that don't start with a single underscore. Everything else is subject to change or disappearance without notice. Cooperating with Browsers ------------------------- **Firefox since version 3 persists cookies in an sqlite database, which is not supported by MozillaCookieJar.** The subclass `MozillaCookieJar` differs from `CookieJar` only in storing cookies using a different, Firefox 2/Mozilla/Netscape-compatible, file format known as "cookies.txt". The lynx browser also uses this format. This file format can't store RFC 2965 cookies, so they are downgraded to Netscape cookies on saving. `LWPCookieJar` itself uses a libwww-perl specific format (\`Set-Cookie3') -- see the example above. Python and your browser should be able to share a cookies file (note that the file location here will differ on non-unix OSes): **WARNING:** you may want to back up your browser's cookies file if you use `MozillaCookieJar` to save cookies. I *think* it works, but there have been bugs in the past! ~~~~{.python} import os, mechanize cookies = mechanize.MozillaCookieJar() cookies.load(os.path.join(os.environ["HOME"], "/.netscape/cookies.txt")) # see also the save and revert methods ~~~~ Note that cookies saved while Mozilla is running will get clobbered by Mozilla -- see `MozillaCookieJar.__doc__`. `MSIECookieJar` does the same for Microsoft Internet Explorer (MSIE) 5.x and 6.x on Windows, but does not allow saving cookies in this format. In future, the Windows API calls might be used to load and save (though the index has to be read directly, since there is no API for that, AFAIK; there's also an unfinished `MSIEDBCookieJar`, which uses (reads and writes) the Windows MSIE cookie database directly, rather than storing copies of cookies as `MSIECookieJar` does). ~~~~{.python} import mechanize cj = mechanize.MSIECookieJar(delayload=True) cj.load_from_registry() # finds cookie index file from registry ~~~~ A true `delayload` argument speeds things up. On Windows 9x (win 95, win 98, win ME), you need to supply a username to the `.load_from_registry()` method: ~~~~{.python} cj.load_from_registry(username="jbloggs") ~~~~ Konqueror/Safari and Opera use different file formats, which aren't yet supported. Saving cookies in a file ------------------------ If you have no need to co-operate with a browser, the most convenient way to save cookies on disk between sessions in human-readable form is to use `LWPCookieJar`. This class uses a libwww-perl specific format (\`Set-Cookie3'). Unlike `MozilliaCookieJar`, this file format doesn't lose information. Supplying a CookieJar --------------------- You might want to do this to [use your browser's cookies](#cooperating-with-browsers), to customize `CookieJar`'s behaviour by passing constructor arguments, or to be able to get at the cookies it will hold (for example, for saving cookies between sessions and for debugging). If you're using the higher-level `urllib2`-like interface (`urlopen()`, etc), you'll have to let it know what `CookieJar` it should use: ~~~~{.python} import mechanize cookies = mechanize.CookieJar() # build_opener() adds standard handlers (such as HTTPHandler and # HTTPCookieProcessor) by default. The cookie processor we supply # will replace the default one. opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies)) r = opener.open("http://example.com/") # GET r = opener.open("http://example.com/", data) # POST ~~~~ The `urlopen()` function uses a global `OpenerDirector` instance to do its work, so if you want to use `urlopen()` with your own `CookieJar`, install the `OpenerDirector` you built with `build_opener()` using the `mechanize.install_opener()` function, then proceed as usual: ~~~~{.python} mechanize.install_opener(opener) r = mechanize.urlopen("http://example.com/") ~~~~ Of course, everyone using `urlopen` is using the same global `CookieJar` instance! You can set a policy object (must satisfy the interface defined by `mechanize.CookiePolicy`), which determines which cookies are allowed to be set and returned. Use the `policy` argument to the `CookieJar` constructor, or use the `.set\_policy()` method. The default implementation has some useful switches: ~~~~{.python} from mechanize import CookieJar, DefaultCookiePolicy as Policy cookies = CookieJar() # turn on RFC 2965 cookies, be more strict about domains when setting and # returning Netscape cookies, and block some domains from setting cookies # or having them returned (read the DefaultCookiePolicy docstring for the # domain matching rules here) policy = Policy(rfc2965=True, strict_ns_domain=Policy.DomainStrict, blocked_domains=["ads.net", ".ads.net"]) cookies.set_policy(policy) ~~~~ Additional Handlers ------------------- The following handlers are provided in addition to those provided by `urllib2`: `HTTPRobotRulesProcessor` : WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. This kind of program can place significant loads on web servers, so there is a [standard](http://www.robotstxt.org/wc/norobots.html) for a `robots.txt` file by which web site operators can request robots to keep out of their site, or out of particular areas of it. This handler uses the standard Python library's `robotparser` module. It raises `mechanize.RobotExclusionError` (subclass of `mechanize.HTTPError`) if an attempt is made to open a URL prohibited by `robots.txt`. `HTTPEquivProcessor` : The `` tag is a way of including data in HTML to be treated as if it were part of the HTTP headers. mechanize can automatically read these tags and add the `HTTP-EQUIV` headers to the response object's real HTTP headers. The HTML is left unchanged. `HTTPRefreshProcessor` : The `Refresh` HTTP header is a non-standard header which is widely used. It requests that the user-agent follow a URL after a specified time delay. mechanize can treat these headers (which may have been set in `` tags) as if they were 302 redirections. Exactly when and how `Refresh` headers are handled is configurable using the constructor arguments. `HTTPRefererProcessor` : The `Referer` HTTP header lets the server know which URL you've just visited. Some servers use this header as state information, and don't like it if this is not present. It's a chore to add this header by hand every time you make a request. This adds it automatically. **NOTE**: this only makes sense if you use each handler for a single chain of HTTP requests (so, for example, if you use a single HTTPRefererProcessor to fetch a series of URLs extracted from a single page, **this will break**). [mechanize.Browser](../mechanize/) does this properly. Example: ~~~~{.python} import mechanize cookies = mechanize.CookieJar() opener = mechanize.build_opener(mechanize.HTTPRefererProcessor, mechanize.HTTPEquivProcessor, mechanize.HTTPRefreshProcessor, ) opener.open("http://www.rhubarb.com/") ~~~~ Seekable responses ------------------ Response objects returned from (or raised as exceptions by) `mechanize.SeekableResponseOpener`, `mechanize.UserAgent` (if `.set_seekable_responses(True)` has been called) and `mechanize.Browser()` have `.seek()`, `.get_data()` and `.set_data()` methods: ~~~~{.python} import mechanize opener = mechanize.OpenerFactory(mechanize.SeekableResponseOpener).build_opener() response = opener.open("http://example.com/") # same return value as .read(), but without affecting seek position total_nr_bytes = len(response.get_data()) assert len(response.read()) == total_nr_bytes assert len(response.read()) == 0 # we've already read the data response.seek(0) assert len(response.read()) == total_nr_bytes response.set_data("blah\n") assert response.get_data() == "blah\n" ... ~~~~ This caching behaviour can be avoided by using `mechanize.OpenerDirector`. It can also be avoided with `mechanize.UserAgent`. Note that `HTTPEquivProcessor` and `HTTPResponseDebugProcessor` require seekable responses and so are not compatible with `mechanize.OpenerDirector` and `mechanize.UserAgent`. ~~~~{.python} import mechanize ua = mechanize.UserAgent() ua.set_seekable_responses(False) ua.set_handle_equiv(False) ua.set_debug_responses(False) ~~~~ Note that if you turn on features that use seekable responses (currently: HTTP-EQUIV handling and response body debug printing), returned responses *may* be seekable as a side-effect of these features. However, this is not guaranteed (currently, in these cases, returned response objects are seekable, but raised respose objects — `mechanize.HTTPError` instances — are not seekable). This applies regardless of whether you use `mechanize.UserAgent` or `mechanize.OpenerDirector`. If you explicitly request seekable responses by calling `.set_seekable_responses(True)` on a `mechanize.UserAgent` instance, or by using `mechanize.Browser` or `mechanize.SeekableResponseOpener`, which always return seekable responses, then both returned and raised responses are guaranteed to be seekable. Handlers should call `response = mechanize.seek_wrapped_response(response)` if they require the `.seek()`, `.get_data()` or `.set_data()` methods. Request object lifetime ----------------------- Note that handlers may create new `Request` instances (for example when performing redirects) rather than adding headers to existing `Request` objects. Adding headers -------------- Adding headers is done like so: ~~~~{.python} import mechanize req = mechanize.Request("http://foobar.com/") req.add_header("Referer", "http://wwwsearch.sourceforge.net/mechanize/") r = mechanize.urlopen(req) ~~~~ You can also use the `headers` argument to the `mechanize.Request` constructor. mechanize adds some headers to `Request` objects automatically -- see the next section for details. Automatically-added headers --------------------------- `OpenerDirector` automatically adds a `User-Agent` header to every `Request`. To change this and/or add similar headers, use your own `OpenerDirector`: ~~~~{.python} import mechanize cookies = mechanize.CookieJar() opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies)) opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"), ("From", "responsible.person@example.com")] ~~~~ Again, to use `urlopen()`, install your `OpenerDirector` globally: ~~~~{.python} mechanize.install_opener(opener) r = mechanize.urlopen("http://example.com/") ~~~~ Also, a few standard headers (`Content-Length`, `Content-Type` and `Host`) are added when the `Request` is passed to `urlopen()` (or `OpenerDirector.open()`). You shouldn't need to change these headers, but since this is done by `AbstractHTTPHandler`, you can change the way it works by passing a subclass of that handler to `build_opener()` (or, as always, by constructing an opener yourself and calling `.add_handler()`). Initiating unverifiable transactions ------------------------------------ This section is only of interest for correct handling of third-party HTTP cookies. See [below](#note-about-cookie-standards) for an explanation of 'third-party'. First, some terminology. An *unverifiable request* (defined fully by ([RFC 2965](http://www.ietf.org/rfc/rfc2965.txt)) is one whose URL the user did not have the option to approve. For example, a transaction is unverifiable if the request is for an image in an HTML document, and the user had no option to approve the fetching of the image from a particular URL. The *request-host of the origin transaction* (defined fully by RFC 2965) is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this is the request-host of the request for the page containing the image. **mechanize knows that redirected transactions are unverifiable, and will handle that on its own (ie. you don't need to think about the origin request-host or verifiability yourself).** If you want to initiate an unverifiable transaction yourself (which you should if, for example, you're downloading the images from a page, and 'the user' hasn't explicitly OKed those URLs): ~~~~{.python} request = Request(origin_req_host="www.example.com", unverifiable=True) ~~~~ RFC 2965 support ---------------- Support for the RFC 2965 protocol is switched off by default, because few browsers implement it, so the RFC 2965 protocol is essentially never seen on the internet. To switch it on, see [here](#policy). Parsing HTTP dates ------------------ A function named `str2time` is provided by the package, which may be useful for parsing dates in HTTP headers. `str2time` is intended to be liberal, since HTTP date/time formats are poorly standardised in practice. There is no need to use this function in normal operations: `CookieJar` instances keep track of cookie lifetimes automatically. This function will stay around in some form, though the supported date/time formats may change. Dealing with bad HTML --------------------- XXX Intro XXX Test me Note about cookie standards --------------------------- There are several standards relevant to HTTP cookies. The Netscape protocol is the only standard supported by most web browsers (including Internet Explorer and Firefox). This is a *de facto* standard defined by the behaviour of popular browsers, and neither the [cookie\_spec.html](http://curl.haxx.se/rfc/cookie_spec.html) document that was published by Netscape, nor the RFCs that were published later, describe the Netscape protocol accurately or completely. Netscape protocol cookies are also known as V0 cookies, to distinguish them from RFC 2109 or RFC 2965 cookies, which have a version cookie-attribute with a value of 1. [RFC 2109](http://www.ietf.org/rfc/rfc2109.txt) was introduced to fix some problems identified with the Netscape protocol, while still keeping the same HTTP headers (`Cookie` and `Set-Cookie`). The most prominent of these problems is the 'third-party' cookie issue, which was an accidental feature of the Netscape protocol. Some features defined by RFC2109 (such as the port and max-age cookie attributes) are now part of the de facto Netscape protocol, but the RFC was never implemented fully by browsers, because of differences in behaviour between the Netscape and Internet Explorer browsers of the time. [RFC 2965](http://www.ietf.org/rfc/rfc2965.txt) attempted to fix the compatibility problem by introducing two new headers, `Set-Cookie2` and `Cookie2`. Unlike the `Cookie` header, `Cookie2` does *not* carry cookies to the server -- rather, it simply advertises to the server that RFC 2965 is understood. `Set-Cookie2` *does* carry cookies, from server to client: the new header means that both IE and Netscape ignore these cookies. This preserves backwards compatibility, but popular browsers did not implement the RFC, so it was never widely adopted. One confusing point to note about RFC 2965 is that it uses the same value (1) of the Version attribute in HTTP headers as does RFC 2109. See also [RFC 2964](http://www.ietf.org/rfc/rfc2964.txt), which discusses use of the protocol. Because Netscape cookies are so poorly specified, the general philosophy of the module's Netscape protocol implementation is to start with RFC 2965 and open holes where required for Netscape protocol-compatibility. RFC 2965 cookies are *always* treated as RFC 2965 requires, of course. There is more information about the history of HTTP cookies in [this paper by David Kristol](http://arxiv.org/abs/cs.SE/0105018). Recently (2011), [an IETF effort has started](http://tools.ietf.org/html/draft-ietf-httpstate-cookie) to specify the syntax and semantics of the `Cookie` and `Set-Cookie` headers as they are actually used on the internet. mechanize-0.2.5/docs/download.txt0000644000175000017500000000305611545150741015446 0ustar johnjohn% mechanize -- Download There is more than one way to obtain mechanize: _Note re Windows and Mac support: currently the tests are only routinely run on [Ubuntu](http://www.ubuntu.com/) 9.10 ("karmic"). However, as far as I know, mechanize works fine on Windows and Mac platforms._ easy_install ------------ #. Install [EasyInstall](http://peak.telecommunity.com/DevCenter/EasyInstall) #. `easy_install mechanize` Easy install will automatically download the latest source code release and install it. Source code release ------------------- #. Download the source from one of the links below #. Unpack the source distribution and change directory to the resulting top-level directory. #. `python setup.py install` This is a stable release. * [`mechanize-0.2.5.tar.gz`](http://pypi.python.org/packages/source/m/mechanize/mechanize-0.2.5.tar.gz) * [`mechanize-0.2.5.zip`](http://pypi.python.org/packages/source/m/mechanize/mechanize-0.2.5.zip) * [Older versions.](./src/) Note: these are hosted on sourceforge, which at the time of writing (2011-03-31) is returning invalid HTTP responses -- you can also find old releases on [PyPI](http://pypi.python.org/)) All the documentation (these web pages, docstrings, and [the changelog](./ChangeLog.txt)) is included in the distribution. git repository -------------- The [git](http://git-scm.com/) repository is [here](http://github.com/jjlee/mechanize). To check it out: #.

`git clone git://github.com/jjlee/mechanize.git`

mechanize-0.2.5/docs/development.txt0000644000175000017500000000232311545150644016157 0ustar johnjohn% mechanize -- Development git repository -------------- The [git](http://git-scm.com/) repository is [here](http://github.com/jjlee/mechanize). To check it out: `git clone git://github.com/jjlee/mechanize.git` There is also [another repository](http://github.com/jjlee/mechanize-build-tools), which is only useful for making mechanize releases: `git clone git://github.com/jjlee/mechanize-build-tools.git` Old repository -------------- The [old SVN repository](http://codespeak.net/svn/wwwsearch/) may be useful for viewing ClientForm history. ClientForm used to be a dependency of mechanize, but has been merged into mechanize as of release 0.2.0; the history wasn't imported. To check out: `svn co http://codespeak.net/svn/wwwsearch/` Bug tracker ----------- The bug tracker is [here on github](http://github.com/jjlee/mechanize/issues). It's equally acceptable to file bugs on the tracker or post about them to the [mailing list](http://lists.sourceforge.net/lists/listinfo/wwwsearch-general). Feel free to send patches too! Mailing list ------------ There is a [mailing list](http://lists.sourceforge.net/lists/listinfo/wwwsearch-general). mechanize-0.2.5/MANIFEST.in0000644000175000017500000000065511545150644013710 0ustar johnjohninclude COPYING.txt include INSTALL.txt include MANIFEST.in include README.txt include *.py recursive-include examples *.py recursive-include examples/forms *.dat *.txt *.html *.cgi *.py recursive-include test/functional_tests_golden output recursive-include test/test_form_data *.html recursive-include test *.py *.doctest *.special_doctest recursive-include test-tools *.py *.cgi recursive-include docs *.txt *.html *.css *.js mechanize-0.2.5/examples/0000755000175000017500000000000011545173600013757 5ustar johnjohnmechanize-0.2.5/examples/forms/0000755000175000017500000000000011545173600015105 5ustar johnjohnmechanize-0.2.5/examples/forms/example.html0000644000175000017500000000261111545150644017431 0ustar johnjohn Example
%s """ % extra_content if title is not None: html = re.sub("(.*)", "%s" % title, html) return html MECHANIZE_HTML = html() ROOT_HTML = html("mechanize") RELOAD_TEST_HTML = """\ Title near the start

Now some data to prevent HEAD parsing from reading the link near the end.

%s
near the end """ % (("0123456789ABCDEF"*4+"\n")*61) REFERER_TEST_HTML = """\ mechanize Referer (sic) test page

This page exists to test the Referer functionality of mechanize.

Here is a link to a page that displays the Referer header. """ BASIC_AUTH_PAGE = """ Basic Auth Protected Area

Hello, basic auth world.

""" DIGEST_AUTH_PAGE = """ Digest Auth Protected Area

Hello, digest auth world.

""" class TestHTTPUser(object): """ Test avatar implementation for http auth with cred """ implements(IHTTPUser) username = None def __init__(self, username): """ @param username: The str username sent as part of the HTTP auth response. """ self.username = username class TestAuthRealm(object): """ Test realm that supports the IHTTPUser interface """ implements(portal.IRealm) def requestAvatar(self, avatarId, mind, *interfaces): if IHTTPUser in interfaces: if avatarId == checkers.ANONYMOUS: return IHTTPUser, TestHTTPUser('anonymous') return IHTTPUser, TestHTTPUser(avatarId) raise NotImplementedError("Only IHTTPUser interface is supported") class Page(resource.Resource): addSlash = True content_type = http_headers.MimeType("text", "html") def render(self, ctx): return http.Response( responsecode.OK, {"content-type": self.content_type}, self.text) class Dir(resource.Resource): addSlash = True def locateChild(self, request, segments): #import pdb; pdb.set_trace() return resource.Resource.locateChild(self, request, segments) def render(self, ctx): print "render" return http.Response(responsecode.FORBIDDEN) def make_dir(parent, name): dir_ = Dir() parent.putChild(name, dir_) return dir_ def _make_page(parent, name, text, content_type, wrapper, leaf=False): page = Page() page.text = text base_type, specific_type = content_type.split("/") page.content_type = http_headers.MimeType(base_type, specific_type) page.addSlash = not leaf parent.putChild(name, wrapper(page)) return page def make_page(parent, name, text, content_type="text/html", wrapper=lambda page: page): return _make_page(parent, name, text, content_type, wrapper, leaf=False) def make_leaf_page(parent, name, text, content_type="text/html", wrapper=lambda page: page): return _make_page(parent, name, text, content_type, wrapper, leaf=True) def make_redirect(parent, name, location_relative_ref): redirect = resource.RedirectResource(path=location_relative_ref) setattr(parent, "child_"+name, redirect) return redirect def make_cgi_bin(parent, name, dir_name): cgi_bin = twcgi.CGIDirectory(dir_name) setattr(parent, "child_"+name, cgi_bin) return cgi_bin def make_cgi_script(parent, name, path): cgi_script = twcgi.CGIScript(path) setattr(parent, "child_"+name, cgi_script) return cgi_script def require_basic_auth(resource): p = portal.Portal(TestAuthRealm()) c = checkers.InMemoryUsernamePasswordDatabaseDontUse() c.addUser("john", "john") p.registerChecker(c) cred_factory = basic.BasicCredentialFactory("Basic Auth protected area") return wrapper.HTTPAuthResource(resource, [cred_factory], p, interfaces=(IHTTPUser,)) class DigestCredFactory(digest.DigestCredentialFactory): def generateOpaque(self, nonce, clientip): # http://twistedmatrix.com/trac/ticket/3693 key = "%s,%s,%s" % (nonce, clientip, str(int(self._getTime()))) digest = md5(key + self.privateKey).hexdigest() ekey = key.encode('base64') return "%s-%s" % (digest, ekey.replace('\n', '')) def require_digest_auth(resource): p = portal.Portal(TestAuthRealm()) c = checkers.InMemoryUsernamePasswordDatabaseDontUse() c.addUser("digestuser", "digestuser") p.registerChecker(c) cred_factory = DigestCredFactory("MD5", "Digest Auth protected area") return wrapper.HTTPAuthResource(resource, [cred_factory], p, interfaces=(IHTTPUser,)) def parse_options(args): parser = optparse.OptionParser() parser.add_option("--log", action="store_true") options, remaining_args = parser.parse_args(args) options.port = int(remaining_args[0]) return options def main(argv): options = parse_options(argv[1:]) if options.log: log.startLogging(sys.stdout) # This is supposed to match the SF site so it's easy to run a functional # test over the internet and against Apache. # TODO: Remove bizarre structure and strings expected by functional tests. root = Page() root.text = ROOT_HTML mechanize = make_page(root, "mechanize", MECHANIZE_HTML) make_leaf_page(root, "robots.txt", "User-Agent: *\nDisallow: /norobots", "text/plain") make_leaf_page(root, "robots", "Hello, robots.", "text/plain") make_leaf_page(root, "norobots", "Hello, non-robots.", "text/plain") test_fixtures = make_page(root, "test_fixtures", # satisfy stupid assertions in functional tests html("Python bits", extra_content="GeneralFAQ.html")) make_leaf_page(test_fixtures, "cctest2.txt", "Hello ClientCookie functional test suite.", "text/plain") make_leaf_page(test_fixtures, "referertest.html", REFERER_TEST_HTML) make_leaf_page(test_fixtures, "mechanize_reload_test.html", RELOAD_TEST_HTML) make_redirect(root, "redirected", "/doesnotexist") cgi_bin = make_dir(root, "cgi-bin") project_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) make_cgi_script(cgi_bin, "cookietest.cgi", os.path.join(project_dir, "test-tools", "cookietest.cgi")) example_html = open(os.path.join("examples", "forms", "example.html")).read() make_leaf_page(mechanize, "example.html", example_html) make_cgi_script(cgi_bin, "echo.cgi", os.path.join(project_dir, "examples", "forms", "echo.cgi")) make_page(root, "basic_auth", BASIC_AUTH_PAGE, wrapper=require_basic_auth) make_page(root, "digest_auth", DIGEST_AUTH_PAGE, wrapper=require_digest_auth) site = server.Site(root) reactor.listenTCP(options.port, channel.HTTPFactory(site)) reactor.run() if __name__ == "__main__": main(sys.argv) mechanize-0.2.5/test-tools/linecache_copy.py0000644000175000017500000000743011545150644017604 0ustar johnjohn"""Cache lines from files. This is intended to read lines from modules imported -- hence if a filename is not found, it will look down the module search path for a file by that name. """ import sys import os __all__ = ["getline", "clearcache", "checkcache"] def getline(filename, lineno, module_globals=None): lines = getlines(filename, module_globals) if 1 <= lineno <= len(lines): return lines[lineno-1] else: return '' # The cache cache = {} # The cache def clearcache(): """Clear the cache entirely.""" global cache cache = {} def getlines(filename, module_globals=None): """Get the lines for a file from the cache. Update the cache if it doesn't contain an entry for this file already.""" if filename in cache: return cache[filename][2] else: return updatecache(filename, module_globals) def checkcache(filename=None): """Discard cache entries that are out of date. (This is not checked upon each call!)""" if filename is None: filenames = cache.keys() else: if filename in cache: filenames = [filename] else: return for filename in filenames: size, mtime, lines, fullname = cache[filename] if mtime is None: continue # no-op for files loaded via a __loader__ try: stat = os.stat(fullname) except os.error: del cache[filename] continue if size != stat.st_size or mtime != stat.st_mtime: del cache[filename] def updatecache(filename, module_globals=None): """Update a cache entry and return its list of lines. If something's wrong, print a message, discard the cache entry, and return an empty list.""" if filename in cache: del cache[filename] if not filename or filename[0] + filename[-1] == '<>': return [] fullname = filename try: stat = os.stat(fullname) except os.error, msg: basename = os.path.split(filename)[1] # Try for a __loader__, if available if module_globals and '__loader__' in module_globals: name = module_globals.get('__name__') loader = module_globals['__loader__'] get_source = getattr(loader, 'get_source', None) if name and get_source: if basename.startswith(name.split('.')[-1]+'.'): try: data = get_source(name) except (ImportError, IOError): pass else: cache[filename] = ( len(data), None, [line+'\n' for line in data.splitlines()], fullname ) return cache[filename][2] # Try looking through the module search path. for dirname in sys.path: # When using imputil, sys.path may contain things other than # strings; ignore them when it happens. try: fullname = os.path.join(dirname, basename) except (TypeError, AttributeError): # Not sufficiently string-like to do anything useful with. pass else: try: stat = os.stat(fullname) break except os.error: pass else: # No luck ## print '*** Cannot stat', filename, ':', msg return [] try: fp = open(fullname, 'rU') lines = fp.readlines() fp.close() except IOError, msg: ## print '*** Cannot open', fullname, ':', msg return [] size, mtime = stat.st_size, stat.st_mtime cache[filename] = size, mtime, lines, fullname return lines mechanize-0.2.5/test-tools/twisted-ftpserver.py0000644000175000017500000000515611545150644020343 0ustar johnjohnimport optparse import sys import twisted.cred.checkers import twisted.cred.credentials import twisted.cred.portal import twisted.internet import twisted.protocols.ftp from twisted.python import filepath, log from zope.interface import implements def make_ftp_shell(avatar_id, root_path): if avatar_id is twisted.cred.checkers.ANONYMOUS: return twisted.protocols.ftp.FTPAnonymousShell(root_path) else: return twisted.protocols.ftp.FTPShell(root_path) class FTPRealm(object): implements(twisted.cred.portal.IRealm) def __init__(self, root_path): self._root_path = filepath.FilePath(root_path) def requestAvatar(self, avatarId, mind, *interfaces): for iface in interfaces: if iface is twisted.protocols.ftp.IFTPShell: avatar = make_ftp_shell(avatarId, self._root_path) return (twisted.protocols.ftp.IFTPShell, avatar, getattr(avatar, "logout", lambda: None)) raise NotImplementedError() class FtpServerFactory(object): """ port = FtpServerFactory("/tmp", 2121).makeListner() self.addCleanup(port.stopListening) """ def __init__(self, root_path, port): factory = twisted.protocols.ftp.FTPFactory() realm = FTPRealm(root_path) portal = twisted.cred.portal.Portal(realm) portal.registerChecker(twisted.cred.checkers.AllowAnonymousAccess(), twisted.cred.credentials.IAnonymous) checker = twisted.cred.checkers.\ InMemoryUsernamePasswordDatabaseDontUse() checker.addUser("john", "john") portal.registerChecker(checker) factory.tld = root_path factory.userAnonymous = "anon" factory.portal = portal factory.protocol = twisted.protocols.ftp.FTP self._factory = factory self._port = port def makeListener(self): # XXX use 0 instead of self._port? return twisted.internet.reactor.listenTCP( self._port, self._factory, interface="127.0.0.1") def parse_options(args): parser = optparse.OptionParser() parser.add_option("--log", action="store_true") parser.add_option("--port", type="int", default=2121) options, remaining_args = parser.parse_args(args) options.root_path = remaining_args[0] return options def main(argv): options = parse_options(argv[1:]) if options.log: log.startLogging(sys.stdout) factory = FtpServerFactory(options.root_path, options.port) factory.makeListener() twisted.internet.reactor.run() if __name__ == "__main__": main(sys.argv) mechanize-0.2.5/test-tools/doctest.py0000644000175000017500000030424411545150644016307 0ustar johnjohn# Module doctest. # Released to the public domain 16-Jan-2001, by Tim Peters (tim@python.org). # Major enhancements and refactoring by: # Jim Fulton # Edward Loper # Provided as-is; use at your own risk; no warranty; no promises; enjoy! r"""Module doctest -- a framework for running examples in docstrings. In simplest use, end each module M to be tested with: def _test(): import doctest doctest.testmod() if __name__ == "__main__": _test() Then running the module as a script will cause the examples in the docstrings to get executed and verified: python M.py This won't display anything unless an example fails, in which case the failing example(s) and the cause(s) of the failure(s) are printed to stdout (why not stderr? because stderr is a lame hack <0.2 wink>), and the final line of output is "Test failed.". Run it with the -v switch instead: python M.py -v and a detailed report of all examples tried is printed to stdout, along with assorted summaries at the end. You can force verbose mode by passing "verbose=True" to testmod, or prohibit it by passing "verbose=False". In either of those cases, sys.argv is not examined by testmod. There are a variety of other ways to run doctests, including integration with the unittest framework, and support for running non-Python text files containing doctests. There are also many ways to override parts of doctest's default behaviors. See the Library Reference Manual for details. """ __docformat__ = 'reStructuredText en' __all__ = [ # 0, Option Flags 'register_optionflag', 'DONT_ACCEPT_TRUE_FOR_1', 'DONT_ACCEPT_BLANKLINE', 'NORMALIZE_WHITESPACE', 'ELLIPSIS', 'SKIP', 'IGNORE_EXCEPTION_DETAIL', 'COMPARISON_FLAGS', 'REPORT_UDIFF', 'REPORT_CDIFF', 'REPORT_NDIFF', 'REPORT_ONLY_FIRST_FAILURE', 'REPORTING_FLAGS', # 1. Utility Functions 'is_private', # 2. Example & DocTest 'Example', 'DocTest', # 3. Doctest Parser 'DocTestParser', # 4. Doctest Finder 'DocTestFinder', # 5. Doctest Runner 'DocTestRunner', 'OutputChecker', 'DocTestFailure', 'UnexpectedException', 'DebugRunner', # 6. Test Functions 'testmod', 'testfile', 'run_docstring_examples', # 7. Tester 'Tester', # 8. Unittest Support 'DocTestSuite', 'DocFileSuite', 'set_unittest_reportflags', # 9. Debugging Support 'script_from_examples', 'testsource', 'debug_src', 'debug', ] import __future__ import sys, traceback, inspect, linecache_copy, os, re, types import unittest, difflib, pdb, tempfile import warnings from StringIO import StringIO # Don't whine about the deprecated is_private function in this # module's tests. warnings.filterwarnings("ignore", "is_private", DeprecationWarning, __name__, 0) # There are 4 basic classes: # - Example: a pair, plus an intra-docstring line number. # - DocTest: a collection of examples, parsed from a docstring, plus # info about where the docstring came from (name, filename, lineno). # - DocTestFinder: extracts DocTests from a given object's docstring and # its contained objects' docstrings. # - DocTestRunner: runs DocTest cases, and accumulates statistics. # # So the basic picture is: # # list of: # +------+ +---------+ +-------+ # |object| --DocTestFinder-> | DocTest | --DocTestRunner-> |results| # +------+ +---------+ +-------+ # | Example | # | ... | # | Example | # +---------+ # Option constants. OPTIONFLAGS_BY_NAME = {} def register_optionflag(name): flag = 1 << len(OPTIONFLAGS_BY_NAME) OPTIONFLAGS_BY_NAME[name] = flag return flag DONT_ACCEPT_TRUE_FOR_1 = register_optionflag('DONT_ACCEPT_TRUE_FOR_1') DONT_ACCEPT_BLANKLINE = register_optionflag('DONT_ACCEPT_BLANKLINE') NORMALIZE_WHITESPACE = register_optionflag('NORMALIZE_WHITESPACE') ELLIPSIS = register_optionflag('ELLIPSIS') SKIP = register_optionflag('SKIP') IGNORE_EXCEPTION_DETAIL = register_optionflag('IGNORE_EXCEPTION_DETAIL') COMPARISON_FLAGS = (DONT_ACCEPT_TRUE_FOR_1 | DONT_ACCEPT_BLANKLINE | NORMALIZE_WHITESPACE | ELLIPSIS | SKIP | IGNORE_EXCEPTION_DETAIL) REPORT_UDIFF = register_optionflag('REPORT_UDIFF') REPORT_CDIFF = register_optionflag('REPORT_CDIFF') REPORT_NDIFF = register_optionflag('REPORT_NDIFF') REPORT_ONLY_FIRST_FAILURE = register_optionflag('REPORT_ONLY_FIRST_FAILURE') REPORTING_FLAGS = (REPORT_UDIFF | REPORT_CDIFF | REPORT_NDIFF | REPORT_ONLY_FIRST_FAILURE) # Special string markers for use in `want` strings: BLANKLINE_MARKER = '' ELLIPSIS_MARKER = '...' ###################################################################### ## Table of Contents ###################################################################### # 1. Utility Functions # 2. Example & DocTest -- store test cases # 3. DocTest Parser -- extracts examples from strings # 4. DocTest Finder -- extracts test cases from objects # 5. DocTest Runner -- runs test cases # 6. Test Functions -- convenient wrappers for testing # 7. Tester Class -- for backwards compatibility # 8. Unittest Support # 9. Debugging Support # 10. Example Usage ###################################################################### ## 1. Utility Functions ###################################################################### def is_private(prefix, base): """prefix, base -> true iff name prefix + "." + base is "private". Prefix may be an empty string, and base does not contain a period. Prefix is ignored (although functions you write conforming to this protocol may make use of it). Return true iff base begins with an (at least one) underscore, but does not both begin and end with (at least) two underscores. >>> is_private("a.b", "my_func") False >>> is_private("____", "_my_func") True >>> is_private("someclass", "__init__") False >>> is_private("sometypo", "__init_") True >>> is_private("x.y.z", "_") True >>> is_private("_x.y.z", "__") False >>> is_private("", "") # senseless but consistent False """ warnings.warn("is_private is deprecated; it wasn't useful; " "examine DocTestFinder.find() lists instead", DeprecationWarning, stacklevel=2) return base[:1] == "_" and not base[:2] == "__" == base[-2:] def _extract_future_flags(globs): """ Return the compiler-flags associated with the future features that have been imported into the given namespace (globs). """ flags = 0 for fname in __future__.all_feature_names: feature = globs.get(fname, None) if feature is getattr(__future__, fname): flags |= feature.compiler_flag return flags def _normalize_module(module, depth=2): """ Return the module specified by `module`. In particular: - If `module` is a module, then return module. - If `module` is a string, then import and return the module with that name. - If `module` is None, then return the calling module. The calling module is assumed to be the module of the stack frame at the given depth in the call stack. """ if inspect.ismodule(module): return module elif isinstance(module, (str, unicode)): return __import__(module, globals(), locals(), ["*"]) elif module is None: return sys.modules[sys._getframe(depth).f_globals['__name__']] else: raise TypeError("Expected a module, string, or None") def _load_testfile(filename, package, module_relative): if module_relative: package = _normalize_module(package, 3) filename = _module_relative_path(package, filename) if hasattr(package, '__loader__'): if hasattr(package.__loader__, 'get_data'): return package.__loader__.get_data(filename), filename return open(filename).read(), filename def _indent(s, indent=4): """ Add the given number of space characters to the beginning every non-blank line in `s`, and return the result. """ # This regexp matches the start of non-blank lines: return re.sub('(?m)^(?!$)', indent*' ', s) def _exception_traceback(exc_info): """ Return a string containing a traceback message for the given exc_info tuple (as returned by sys.exc_info()). """ # Get a traceback message. excout = StringIO() exc_type, exc_val, exc_tb = exc_info traceback.print_exception(exc_type, exc_val, exc_tb, file=excout) return excout.getvalue() # Override some StringIO methods. class _SpoofOut(StringIO): def getvalue(self): result = StringIO.getvalue(self) # If anything at all was written, make sure there's a trailing # newline. There's no way for the expected output to indicate # that a trailing newline is missing. if result and not result.endswith("\n"): result += "\n" # Prevent softspace from screwing up the next test case, in # case they used print with a trailing comma in an example. if hasattr(self, "softspace"): del self.softspace return result def truncate(self, size=None): StringIO.truncate(self, size) if hasattr(self, "softspace"): del self.softspace # Worst-case linear-time ellipsis matching. def _ellipsis_match(want, got): """ Essentially the only subtle case: >>> _ellipsis_match('aa...aa', 'aaa') False """ if ELLIPSIS_MARKER not in want: return want == got # Find "the real" strings. ws = want.split(ELLIPSIS_MARKER) assert len(ws) >= 2 # Deal with exact matches possibly needed at one or both ends. startpos, endpos = 0, len(got) w = ws[0] if w: # starts with exact match if got.startswith(w): startpos = len(w) del ws[0] else: return False w = ws[-1] if w: # ends with exact match if got.endswith(w): endpos -= len(w) del ws[-1] else: return False if startpos > endpos: # Exact end matches required more characters than we have, as in # _ellipsis_match('aa...aa', 'aaa') return False # For the rest, we only need to find the leftmost non-overlapping # match for each piece. If there's no overall match that way alone, # there's no overall match period. for w in ws: # w may be '' at times, if there are consecutive ellipses, or # due to an ellipsis at the start or end of `want`. That's OK. # Search for an empty string succeeds, and doesn't change startpos. startpos = got.find(w, startpos, endpos) if startpos < 0: return False startpos += len(w) return True def _comment_line(line): "Return a commented form of the given line" line = line.rstrip() if line: return '# '+line else: return '#' class _OutputRedirectingPdb(pdb.Pdb): """ A specialized version of the python debugger that redirects stdout to a given stream when interacting with the user. Stdout is *not* redirected when traced code is executed. """ def __init__(self, out): self.__out = out self.__debugger_used = False pdb.Pdb.__init__(self) def set_trace(self): self.__debugger_used = True pdb.Pdb.set_trace(self) def set_continue(self): # Calling set_continue unconditionally would break unit test coverage # reporting, as Bdb.set_continue calls sys.settrace(None). if self.__debugger_used: pdb.Pdb.set_continue(self) def trace_dispatch(self, *args): # Redirect stdout to the given stream. save_stdout = sys.stdout sys.stdout = self.__out # Call Pdb's trace dispatch method. try: return pdb.Pdb.trace_dispatch(self, *args) finally: sys.stdout = save_stdout # [XX] Normalize with respect to os.path.pardir? def _module_relative_path(module, path): if not inspect.ismodule(module): raise TypeError, 'Expected a module: %r' % module if path.startswith('/'): raise ValueError, 'Module-relative files may not have absolute paths' # Find the base directory for the path. if hasattr(module, '__file__'): # A normal module/package basedir = os.path.split(module.__file__)[0] elif module.__name__ == '__main__': # An interactive session. if len(sys.argv)>0 and sys.argv[0] != '': basedir = os.path.split(sys.argv[0])[0] else: basedir = os.curdir else: # A module w/o __file__ (this includes builtins) raise ValueError("Can't resolve paths relative to the module " + module + " (it has no __file__)") # Combine the base directory and the path. return os.path.join(basedir, *(path.split('/'))) ###################################################################### ## 2. Example & DocTest ###################################################################### ## - An "example" is a pair, where "source" is a ## fragment of source code, and "want" is the expected output for ## "source." The Example class also includes information about ## where the example was extracted from. ## ## - A "doctest" is a collection of examples, typically extracted from ## a string (such as an object's docstring). The DocTest class also ## includes information about where the string was extracted from. class Example: """ A single doctest example, consisting of source code and expected output. `Example` defines the following attributes: - source: A single Python statement, always ending with a newline. The constructor adds a newline if needed. - want: The expected output from running the source code (either from stdout, or a traceback in case of exception). `want` ends with a newline unless it's empty, in which case it's an empty string. The constructor adds a newline if needed. - exc_msg: The exception message generated by the example, if the example is expected to generate an exception; or `None` if it is not expected to generate an exception. This exception message is compared against the return value of `traceback.format_exception_only()`. `exc_msg` ends with a newline unless it's `None`. The constructor adds a newline if needed. - lineno: The line number within the DocTest string containing this Example where the Example begins. This line number is zero-based, with respect to the beginning of the DocTest. - indent: The example's indentation in the DocTest string. I.e., the number of space characters that preceed the example's first prompt. - options: A dictionary mapping from option flags to True or False, which is used to override default options for this example. Any option flags not contained in this dictionary are left at their default value (as specified by the DocTestRunner's optionflags). By default, no options are set. """ def __init__(self, source, want, exc_msg=None, lineno=0, indent=0, options=None): # Normalize inputs. if not source.endswith('\n'): source += '\n' if want and not want.endswith('\n'): want += '\n' if exc_msg is not None and not exc_msg.endswith('\n'): exc_msg += '\n' # Store properties. self.source = source self.want = want self.lineno = lineno self.indent = indent if options is None: options = {} self.options = options self.exc_msg = exc_msg class DocTest: """ A collection of doctest examples that should be run in a single namespace. Each `DocTest` defines the following attributes: - examples: the list of examples. - globs: The namespace (aka globals) that the examples should be run in. - name: A name identifying the DocTest (typically, the name of the object whose docstring this DocTest was extracted from). - filename: The name of the file that this DocTest was extracted from, or `None` if the filename is unknown. - lineno: The line number within filename where this DocTest begins, or `None` if the line number is unavailable. This line number is zero-based, with respect to the beginning of the file. - docstring: The string that the examples were extracted from, or `None` if the string is unavailable. """ def __init__(self, examples, globs, name, filename, lineno, docstring): """ Create a new DocTest containing the given examples. The DocTest's globals are initialized with a copy of `globs`. """ assert not isinstance(examples, basestring), \ "DocTest no longer accepts str; use DocTestParser instead" self.examples = examples self.docstring = docstring self.globs = globs.copy() self.name = name self.filename = filename self.lineno = lineno def __repr__(self): if len(self.examples) == 0: examples = 'no examples' elif len(self.examples) == 1: examples = '1 example' else: examples = '%d examples' % len(self.examples) return ('' % (self.name, self.filename, self.lineno, examples)) # This lets us sort tests by name: def __cmp__(self, other): if not isinstance(other, DocTest): return -1 return cmp((self.name, self.filename, self.lineno, id(self)), (other.name, other.filename, other.lineno, id(other))) ###################################################################### ## 3. DocTestParser ###################################################################### class DocTestParser: """ A class used to parse strings containing doctest examples. """ # This regular expression is used to find doctest examples in a # string. It defines three groups: `source` is the source code # (including leading indentation and prompts); `indent` is the # indentation of the first (PS1) line of the source code; and # `want` is the expected output (including leading indentation). _EXAMPLE_RE = re.compile(r''' # Source consists of a PS1 line followed by zero or more PS2 lines. (?P (?:^(?P [ ]*) >>> .*) # PS1 line (?:\n [ ]* \.\.\. .*)*) # PS2 lines \n? # Want consists of any non-blank lines that do not start with PS1. (?P (?:(?![ ]*$) # Not a blank line (?![ ]*>>>) # Not a line starting with PS1 .*$\n? # But any other line )*) ''', re.MULTILINE | re.VERBOSE) # A regular expression for handling `want` strings that contain # expected exceptions. It divides `want` into three pieces: # - the traceback header line (`hdr`) # - the traceback stack (`stack`) # - the exception message (`msg`), as generated by # traceback.format_exception_only() # `msg` may have multiple lines. We assume/require that the # exception message is the first non-indented line starting with a word # character following the traceback header line. _EXCEPTION_RE = re.compile(r""" # Grab the traceback header. Different versions of Python have # said different things on the first traceback line. ^(?P Traceback\ \( (?: most\ recent\ call\ last | innermost\ last ) \) : ) \s* $ # toss trailing whitespace on the header. (?P .*?) # don't blink: absorb stuff until... ^ (?P \w+ .*) # a line *starts* with alphanum. """, re.VERBOSE | re.MULTILINE | re.DOTALL) # A callable returning a true value iff its argument is a blank line # or contains a single comment. _IS_BLANK_OR_COMMENT = re.compile(r'^[ ]*(#.*)?$').match def parse(self, string, name=''): """ Divide the given string into examples and intervening text, and return them as a list of alternating Examples and strings. Line numbers for the Examples are 0-based. The optional argument `name` is a name identifying this string, and is only used for error messages. """ string = string.expandtabs() # If all lines begin with the same indentation, then strip it. min_indent = self._min_indent(string) if min_indent > 0: string = '\n'.join([l[min_indent:] for l in string.split('\n')]) output = [] charno, lineno = 0, 0 # Find all doctest examples in the string: for m in self._EXAMPLE_RE.finditer(string): # Add the pre-example text to `output`. output.append(string[charno:m.start()]) # Update lineno (lines before this example) lineno += string.count('\n', charno, m.start()) # Extract info from the regexp match. (source, options, want, exc_msg) = \ self._parse_example(m, name, lineno) # Create an Example, and add it to the list. if not self._IS_BLANK_OR_COMMENT(source): output.append( Example(source, want, exc_msg, lineno=lineno, indent=min_indent+len(m.group('indent')), options=options) ) # Update lineno (lines inside this example) lineno += string.count('\n', m.start(), m.end()) # Update charno. charno = m.end() # Add any remaining post-example text to `output`. output.append(string[charno:]) return output def get_doctest(self, string, globs, name, filename, lineno): """ Extract all doctest examples from the given string, and collect them into a `DocTest` object. `globs`, `name`, `filename`, and `lineno` are attributes for the new `DocTest` object. See the documentation for `DocTest` for more information. """ return DocTest(self.get_examples(string, name), globs, name, filename, lineno, string) def get_examples(self, string, name=''): """ Extract all doctest examples from the given string, and return them as a list of `Example` objects. Line numbers are 0-based, because it's most common in doctests that nothing interesting appears on the same line as opening triple-quote, and so the first interesting line is called \"line 1\" then. The optional argument `name` is a name identifying this string, and is only used for error messages. """ return [x for x in self.parse(string, name) if isinstance(x, Example)] def _parse_example(self, m, name, lineno): """ Given a regular expression match from `_EXAMPLE_RE` (`m`), return a pair `(source, want)`, where `source` is the matched example's source code (with prompts and indentation stripped); and `want` is the example's expected output (with indentation stripped). `name` is the string's name, and `lineno` is the line number where the example starts; both are used for error messages. """ # Get the example's indentation level. indent = len(m.group('indent')) # Divide source into lines; check that they're properly # indented; and then strip their indentation & prompts. source_lines = m.group('source').split('\n') self._check_prompt_blank(source_lines, indent, name, lineno) self._check_prefix(source_lines[1:], ' '*indent + '.', name, lineno) source = '\n'.join([sl[indent+4:] for sl in source_lines]) # Divide want into lines; check that it's properly indented; and # then strip the indentation. Spaces before the last newline should # be preserved, so plain rstrip() isn't good enough. want = m.group('want') want_lines = want.split('\n') if len(want_lines) > 1 and re.match(r' *$', want_lines[-1]): del want_lines[-1] # forget final newline & spaces after it self._check_prefix(want_lines, ' '*indent, name, lineno + len(source_lines)) want = '\n'.join([wl[indent:] for wl in want_lines]) # If `want` contains a traceback message, then extract it. m = self._EXCEPTION_RE.match(want) if m: exc_msg = m.group('msg') else: exc_msg = None # Extract options from the source. options = self._find_options(source, name, lineno) return source, options, want, exc_msg # This regular expression looks for option directives in the # source code of an example. Option directives are comments # starting with "doctest:". Warning: this may give false # positives for string-literals that contain the string # "#doctest:". Eliminating these false positives would require # actually parsing the string; but we limit them by ignoring any # line containing "#doctest:" that is *followed* by a quote mark. _OPTION_DIRECTIVE_RE = re.compile(r'#\s*doctest:\s*([^\n\'"]*)$', re.MULTILINE) def _find_options(self, source, name, lineno): """ Return a dictionary containing option overrides extracted from option directives in the given source string. `name` is the string's name, and `lineno` is the line number where the example starts; both are used for error messages. """ options = {} # (note: with the current regexp, this will match at most once:) for m in self._OPTION_DIRECTIVE_RE.finditer(source): option_strings = m.group(1).replace(',', ' ').split() for option in option_strings: if (option[0] not in '+-' or option[1:] not in OPTIONFLAGS_BY_NAME): raise ValueError('line %r of the doctest for %s ' 'has an invalid option: %r' % (lineno+1, name, option)) flag = OPTIONFLAGS_BY_NAME[option[1:]] options[flag] = (option[0] == '+') if options and self._IS_BLANK_OR_COMMENT(source): raise ValueError('line %r of the doctest for %s has an option ' 'directive on a line with no example: %r' % (lineno, name, source)) return options # This regular expression finds the indentation of every non-blank # line in a string. _INDENT_RE = re.compile('^([ ]*)(?=\S)', re.MULTILINE) def _min_indent(self, s): "Return the minimum indentation of any non-blank line in `s`" indents = [len(indent) for indent in self._INDENT_RE.findall(s)] if len(indents) > 0: return min(indents) else: return 0 def _check_prompt_blank(self, lines, indent, name, lineno): """ Given the lines of a source string (including prompts and leading indentation), check to make sure that every prompt is followed by a space character. If any line is not followed by a space character, then raise ValueError. """ for i, line in enumerate(lines): if len(line) >= indent+4 and line[indent+3] != ' ': raise ValueError('line %r of the docstring for %s ' 'lacks blank after %s: %r' % (lineno+i+1, name, line[indent:indent+3], line)) def _check_prefix(self, lines, prefix, name, lineno): """ Check that every line in the given list starts with the given prefix; if any line does not, then raise a ValueError. """ for i, line in enumerate(lines): if line and not line.startswith(prefix): raise ValueError('line %r of the docstring for %s has ' 'inconsistent leading whitespace: %r' % (lineno+i+1, name, line)) ###################################################################### ## 4. DocTest Finder ###################################################################### class DocTestFinder: """ A class used to extract the DocTests that are relevant to a given object, from its docstring and the docstrings of its contained objects. Doctests can currently be extracted from the following object types: modules, functions, classes, methods, staticmethods, classmethods, and properties. """ def __init__(self, verbose=False, parser=DocTestParser(), recurse=True, _namefilter=None, exclude_empty=True): """ Create a new doctest finder. The optional argument `parser` specifies a class or function that should be used to create new DocTest objects (or objects that implement the same interface as DocTest). The signature for this factory function should match the signature of the DocTest constructor. If the optional argument `recurse` is false, then `find` will only examine the given object, and not any contained objects. If the optional argument `exclude_empty` is false, then `find` will include tests for objects with empty docstrings. """ self._parser = parser self._verbose = verbose self._recurse = recurse self._exclude_empty = exclude_empty # _namefilter is undocumented, and exists only for temporary backward- # compatibility support of testmod's deprecated isprivate mess. self._namefilter = _namefilter def find(self, obj, name=None, module=None, globs=None, extraglobs=None): """ Return a list of the DocTests that are defined by the given object's docstring, or by any of its contained objects' docstrings. The optional parameter `module` is the module that contains the given object. If the module is not specified or is None, then the test finder will attempt to automatically determine the correct module. The object's module is used: - As a default namespace, if `globs` is not specified. - To prevent the DocTestFinder from extracting DocTests from objects that are imported from other modules. - To find the name of the file containing the object. - To help find the line number of the object within its file. Contained objects whose module does not match `module` are ignored. If `module` is False, no attempt to find the module will be made. This is obscure, of use mostly in tests: if `module` is False, or is None but cannot be found automatically, then all objects are considered to belong to the (non-existent) module, so all contained objects will (recursively) be searched for doctests. The globals for each DocTest is formed by combining `globs` and `extraglobs` (bindings in `extraglobs` override bindings in `globs`). A new copy of the globals dictionary is created for each DocTest. If `globs` is not specified, then it defaults to the module's `__dict__`, if specified, or {} otherwise. If `extraglobs` is not specified, then it defaults to {}. """ # If name was not specified, then extract it from the object. if name is None: name = getattr(obj, '__name__', None) if name is None: raise ValueError("DocTestFinder.find: name must be given " "when obj.__name__ doesn't exist: %r" % (type(obj),)) # Find the module that contains the given object (if obj is # a module, then module=obj.). Note: this may fail, in which # case module will be None. if module is False: module = None elif module is None: module = inspect.getmodule(obj) # Read the module's source code. This is used by # DocTestFinder._find_lineno to find the line number for a # given object's docstring. try: file = inspect.getsourcefile(obj) or inspect.getfile(obj) source_lines = linecache_copy.getlines(file) if not source_lines: source_lines = None except TypeError: source_lines = None # Initialize globals, and merge in extraglobs. if globs is None: if module is None: globs = {} else: globs = module.__dict__.copy() else: globs = globs.copy() if extraglobs is not None: globs.update(extraglobs) # Recursively expore `obj`, extracting DocTests. tests = [] self._find(tests, obj, name, module, source_lines, globs, {}) return tests def _filter(self, obj, prefix, base): """ Return true if the given object should not be examined. """ return (self._namefilter is not None and self._namefilter(prefix, base)) def _from_module(self, module, object): """ Return true if the given object is defined in the given module. """ if module is None: return True elif inspect.isfunction(object): return module.__dict__ is object.func_globals elif inspect.isclass(object): return module.__name__ == object.__module__ elif inspect.getmodule(object) is not None: return module is inspect.getmodule(object) elif hasattr(object, '__module__'): return module.__name__ == object.__module__ elif isinstance(object, property): return True # [XX] no way not be sure. else: raise ValueError("object must be a class or function") def _find(self, tests, obj, name, module, source_lines, globs, seen): """ Find tests for the given object and any contained objects, and add them to `tests`. """ if self._verbose: print 'Finding tests in %s' % name # If we've already processed this object, then ignore it. if id(obj) in seen: return seen[id(obj)] = 1 # Find a test for this object, and add it to the list of tests. test = self._get_test(obj, name, module, globs, source_lines) if test is not None: tests.append(test) # Look for tests in a module's contained objects. if inspect.ismodule(obj) and self._recurse: for valname, val in obj.__dict__.items(): # Check if this contained object should be ignored. if self._filter(val, name, valname): continue valname = '%s.%s' % (name, valname) # Recurse to functions & classes. if ((inspect.isfunction(val) or inspect.isclass(val)) and self._from_module(module, val)): self._find(tests, val, valname, module, source_lines, globs, seen) # Look for tests in a module's __test__ dictionary. if inspect.ismodule(obj) and self._recurse: for valname, val in getattr(obj, '__test__', {}).items(): if not isinstance(valname, basestring): raise ValueError("DocTestFinder.find: __test__ keys " "must be strings: %r" % (type(valname),)) if not (inspect.isfunction(val) or inspect.isclass(val) or inspect.ismethod(val) or inspect.ismodule(val) or isinstance(val, basestring)): raise ValueError("DocTestFinder.find: __test__ values " "must be strings, functions, methods, " "classes, or modules: %r" % (type(val),)) valname = '%s.__test__.%s' % (name, valname) self._find(tests, val, valname, module, source_lines, globs, seen) # Look for tests in a class's contained objects. if inspect.isclass(obj) and self._recurse: for valname, val in obj.__dict__.items(): # Check if this contained object should be ignored. if self._filter(val, name, valname): continue # Special handling for staticmethod/classmethod. if isinstance(val, staticmethod): val = getattr(obj, valname) if isinstance(val, classmethod): val = getattr(obj, valname).im_func # Recurse to methods, properties, and nested classes. if ((inspect.isfunction(val) or inspect.isclass(val) or isinstance(val, property)) and self._from_module(module, val)): valname = '%s.%s' % (name, valname) self._find(tests, val, valname, module, source_lines, globs, seen) def _get_test(self, obj, name, module, globs, source_lines): """ Return a DocTest for the given object, if it defines a docstring; otherwise, return None. """ # Extract the object's docstring. If it doesn't have one, # then return None (no test for this object). if isinstance(obj, basestring): docstring = obj else: try: if obj.__doc__ is None: docstring = '' else: docstring = obj.__doc__ if not isinstance(docstring, basestring): docstring = str(docstring) except (TypeError, AttributeError): docstring = '' # Find the docstring's location in the file. lineno = self._find_lineno(obj, source_lines) # Don't bother if the docstring is empty. if self._exclude_empty and not docstring: return None # Return a DocTest for this object. if module is None: filename = None else: filename = getattr(module, '__file__', module.__name__) if filename[-4:] in (".pyc", ".pyo"): filename = filename[:-1] return self._parser.get_doctest(docstring, globs, name, filename, lineno) def _find_lineno(self, obj, source_lines): """ Return a line number of the given object's docstring. Note: this method assumes that the object has a docstring. """ lineno = None # Find the line number for modules. if inspect.ismodule(obj): lineno = 0 # Find the line number for classes. # Note: this could be fooled if a class is defined multiple # times in a single file. if inspect.isclass(obj): if source_lines is None: return None pat = re.compile(r'^\s*class\s*%s\b' % getattr(obj, '__name__', '-')) for i, line in enumerate(source_lines): if pat.match(line): lineno = i break # Find the line number for functions & methods. if inspect.ismethod(obj): obj = obj.im_func if inspect.isfunction(obj): obj = obj.func_code if inspect.istraceback(obj): obj = obj.tb_frame if inspect.isframe(obj): obj = obj.f_code if inspect.iscode(obj): lineno = getattr(obj, 'co_firstlineno', None)-1 # Find the line number where the docstring starts. Assume # that it's the first line that begins with a quote mark. # Note: this could be fooled by a multiline function # signature, where a continuation line begins with a quote # mark. if lineno is not None: if source_lines is None: return lineno+1 pat = re.compile('(^|.*:)\s*\w*("|\')') for lineno in range(lineno, len(source_lines)): if pat.match(source_lines[lineno]): return lineno # We couldn't find the line number. return None ###################################################################### ## 5. DocTest Runner ###################################################################### class DocTestRunner: """ A class used to run DocTest test cases, and accumulate statistics. The `run` method is used to process a single DocTest case. It returns a tuple `(f, t)`, where `t` is the number of test cases tried, and `f` is the number of test cases that failed. >>> tests = DocTestFinder().find(_TestClass) >>> runner = DocTestRunner(verbose=False) >>> for test in tests: ... print runner.run(test) (0, 2) (0, 1) (0, 2) (0, 2) The `summarize` method prints a summary of all the test cases that have been run by the runner, and returns an aggregated `(f, t)` tuple: >>> runner.summarize(verbose=1) 4 items passed all tests: 2 tests in _TestClass 2 tests in _TestClass.__init__ 2 tests in _TestClass.get 1 tests in _TestClass.square 7 tests in 4 items. 7 passed and 0 failed. Test passed. (0, 7) The aggregated number of tried examples and failed examples is also available via the `tries` and `failures` attributes: >>> runner.tries 7 >>> runner.failures 0 The comparison between expected outputs and actual outputs is done by an `OutputChecker`. This comparison may be customized with a number of option flags; see the documentation for `testmod` for more information. If the option flags are insufficient, then the comparison may also be customized by passing a subclass of `OutputChecker` to the constructor. The test runner's display output can be controlled in two ways. First, an output function (`out) can be passed to `TestRunner.run`; this function will be called with strings that should be displayed. It defaults to `sys.stdout.write`. If capturing the output is not sufficient, then the display output can be also customized by subclassing DocTestRunner, and overriding the methods `report_start`, `report_success`, `report_unexpected_exception`, and `report_failure`. """ # This divider string is used to separate failure messages, and to # separate sections of the summary. DIVIDER = "*" * 70 def __init__(self, checker=None, verbose=None, optionflags=0): """ Create a new test runner. Optional keyword arg `checker` is the `OutputChecker` that should be used to compare the expected outputs and actual outputs of doctest examples. Optional keyword arg 'verbose' prints lots of stuff if true, only failures if false; by default, it's true iff '-v' is in sys.argv. Optional argument `optionflags` can be used to control how the test runner compares expected output to actual output, and how it displays failures. See the documentation for `testmod` for more information. """ self._checker = checker or OutputChecker() if verbose is None: verbose = '-v' in sys.argv self._verbose = verbose self.optionflags = optionflags self.original_optionflags = optionflags # Keep track of the examples we've run. self.tries = 0 self.failures = 0 self._name2ft = {} # Create a fake output target for capturing doctest output. self._fakeout = _SpoofOut() #///////////////////////////////////////////////////////////////// # Reporting methods #///////////////////////////////////////////////////////////////// def report_start(self, out, test, example): """ Report that the test runner is about to process the given example. (Only displays a message if verbose=True) """ if self._verbose: if example.want: out('Trying:\n' + _indent(example.source) + 'Expecting:\n' + _indent(example.want)) else: out('Trying:\n' + _indent(example.source) + 'Expecting nothing\n') def report_success(self, out, test, example, got): """ Report that the given example ran successfully. (Only displays a message if verbose=True) """ if self._verbose: out("ok\n") def report_failure(self, out, test, example, got): """ Report that the given example failed. """ out(self._failure_header(test, example) + self._checker.output_difference(example, got, self.optionflags)) def report_unexpected_exception(self, out, test, example, exc_info): """ Report that the given example raised an unexpected exception. """ out(self._failure_header(test, example) + 'Exception raised:\n' + _indent(_exception_traceback(exc_info))) def _failure_header(self, test, example): out = [self.DIVIDER] if test.filename: if test.lineno is not None and example.lineno is not None: lineno = test.lineno + example.lineno + 1 else: lineno = '?' out.append('File "%s", line %s, in %s' % (test.filename, lineno, test.name)) else: out.append('Line %s, in %s' % (example.lineno+1, test.name)) out.append('Failed example:') source = example.source out.append(_indent(source)) return '\n'.join(out) #///////////////////////////////////////////////////////////////// # DocTest Running #///////////////////////////////////////////////////////////////// def __run(self, test, compileflags, out): """ Run the examples in `test`. Write the outcome of each example with one of the `DocTestRunner.report_*` methods, using the writer function `out`. `compileflags` is the set of compiler flags that should be used to execute examples. Return a tuple `(f, t)`, where `t` is the number of examples tried, and `f` is the number of examples that failed. The examples are run in the namespace `test.globs`. """ # Keep track of the number of failures and tries. failures = tries = 0 # Save the option flags (since option directives can be used # to modify them). original_optionflags = self.optionflags SUCCESS, FAILURE, BOOM = range(3) # `outcome` state check = self._checker.check_output # Process each example. for examplenum, example in enumerate(test.examples): # If REPORT_ONLY_FIRST_FAILURE is set, then supress # reporting after the first failure. quiet = (self.optionflags & REPORT_ONLY_FIRST_FAILURE and failures > 0) # Merge in the example's options. self.optionflags = original_optionflags if example.options: for (optionflag, val) in example.options.items(): if val: self.optionflags |= optionflag else: self.optionflags &= ~optionflag # If 'SKIP' is set, then skip this example. if self.optionflags & SKIP: continue # Record that we started this example. tries += 1 if not quiet: self.report_start(out, test, example) # Use a special filename for compile(), so we can retrieve # the source code during interactive debugging (see # __patched_linecache_getlines). filename = '' % (test.name, examplenum) # Run the example in the given context (globs), and record # any exception that gets raised. (But don't intercept # keyboard interrupts.) try: # Don't blink! This is where the user's code gets run. exec compile(example.source, filename, "single", compileflags, 1) in test.globs self.debugger.set_continue() # ==== Example Finished ==== exception = None except KeyboardInterrupt: raise except: exception = sys.exc_info() self.debugger.set_continue() # ==== Example Finished ==== got = self._fakeout.getvalue() # the actual output self._fakeout.truncate(0) outcome = FAILURE # guilty until proved innocent or insane # If the example executed without raising any exceptions, # verify its output. if exception is None: if check(example.want, got, self.optionflags): outcome = SUCCESS # The example raised an exception: check if it was expected. else: exc_info = sys.exc_info() exc_msg = traceback.format_exception_only(*exc_info[:2])[-1] if not quiet: got += _exception_traceback(exc_info) # If `example.exc_msg` is None, then we weren't expecting # an exception. if example.exc_msg is None: outcome = BOOM # We expected an exception: see whether it matches. elif check(example.exc_msg, exc_msg, self.optionflags): outcome = SUCCESS # Another chance if they didn't care about the detail. elif self.optionflags & IGNORE_EXCEPTION_DETAIL: m1 = re.match(r'[^:]*:', example.exc_msg) m2 = re.match(r'[^:]*:', exc_msg) if m1 and m2 and check(m1.group(0), m2.group(0), self.optionflags): outcome = SUCCESS # Report the outcome. if outcome is SUCCESS: if not quiet: self.report_success(out, test, example, got) elif outcome is FAILURE: if not quiet: self.report_failure(out, test, example, got) failures += 1 elif outcome is BOOM: if not quiet: self.report_unexpected_exception(out, test, example, exc_info) failures += 1 else: assert False, ("unknown outcome", outcome) # Restore the option flags (in case they were modified) self.optionflags = original_optionflags # Record and return the number of failures and tries. self.__record_outcome(test, failures, tries) return failures, tries def __record_outcome(self, test, f, t): """ Record the fact that the given DocTest (`test`) generated `f` failures out of `t` tried examples. """ f2, t2 = self._name2ft.get(test.name, (0,0)) self._name2ft[test.name] = (f+f2, t+t2) self.failures += f self.tries += t __LINECACHE_FILENAME_RE = re.compile(r'[\w\.]+)' r'\[(?P\d+)\]>$') def __patched_linecache_getlines(self, filename, module_globals=None): m = self.__LINECACHE_FILENAME_RE.match(filename) if m and m.group('name') == self.test.name: example = self.test.examples[int(m.group('examplenum'))] return example.source.splitlines(True) else: return self.save_linecache_getlines(filename, module_globals) def run(self, test, compileflags=None, out=None, clear_globs=True): """ Run the examples in `test`, and display the results using the writer function `out`. The examples are run in the namespace `test.globs`. If `clear_globs` is true (the default), then this namespace will be cleared after the test runs, to help with garbage collection. If you would like to examine the namespace after the test completes, then use `clear_globs=False`. `compileflags` gives the set of flags that should be used by the Python compiler when running the examples. If not specified, then it will default to the set of future-import flags that apply to `globs`. The output of each example is checked using `DocTestRunner.check_output`, and the results are formatted by the `DocTestRunner.report_*` methods. """ self.test = test if compileflags is None: compileflags = _extract_future_flags(test.globs) save_stdout = sys.stdout if out is None: out = save_stdout.write sys.stdout = self._fakeout # Patch pdb.set_trace to restore sys.stdout during interactive # debugging (so it's not still redirected to self._fakeout). # Note that the interactive output will go to *our* # save_stdout, even if that's not the real sys.stdout; this # allows us to write test cases for the set_trace behavior. save_set_trace = pdb.set_trace self.debugger = _OutputRedirectingPdb(save_stdout) self.debugger.reset() pdb.set_trace = self.debugger.set_trace # Patch linecache_copy.getlines, so we can see the example's source # when we're inside the debugger. self.save_linecache_getlines = linecache_copy.getlines linecache_copy.getlines = self.__patched_linecache_getlines try: return self.__run(test, compileflags, out) finally: sys.stdout = save_stdout pdb.set_trace = save_set_trace linecache_copy.getlines = self.save_linecache_getlines if clear_globs: test.globs.clear() #///////////////////////////////////////////////////////////////// # Summarization #///////////////////////////////////////////////////////////////// def summarize(self, verbose=None): """ Print a summary of all the test cases that have been run by this DocTestRunner, and return a tuple `(f, t)`, where `f` is the total number of failed examples, and `t` is the total number of tried examples. The optional `verbose` argument controls how detailed the summary is. If the verbosity is not specified, then the DocTestRunner's verbosity is used. """ if verbose is None: verbose = self._verbose notests = [] passed = [] failed = [] totalt = totalf = 0 for x in self._name2ft.items(): name, (f, t) = x assert f <= t totalt += t totalf += f if t == 0: notests.append(name) elif f == 0: passed.append( (name, t) ) else: failed.append(x) if verbose: if notests: print len(notests), "items had no tests:" notests.sort() for thing in notests: print " ", thing if passed: print len(passed), "items passed all tests:" passed.sort() for thing, count in passed: print " %3d tests in %s" % (count, thing) if failed: print self.DIVIDER print len(failed), "items had failures:" failed.sort() for thing, (f, t) in failed: print " %3d of %3d in %s" % (f, t, thing) if verbose: print totalt, "tests in", len(self._name2ft), "items." print totalt - totalf, "passed and", totalf, "failed." if totalf: print "***Test Failed***", totalf, "failures." elif verbose: print "Test passed." return totalf, totalt #///////////////////////////////////////////////////////////////// # Backward compatibility cruft to maintain doctest.master. #///////////////////////////////////////////////////////////////// def merge(self, other): d = self._name2ft for name, (f, t) in other._name2ft.items(): if name in d: print "*** DocTestRunner.merge: '" + name + "' in both" \ " testers; summing outcomes." f2, t2 = d[name] f = f + f2 t = t + t2 d[name] = f, t class OutputChecker: """ A class used to check the whether the actual output from a doctest example matches the expected output. `OutputChecker` defines two methods: `check_output`, which compares a given pair of outputs, and returns true if they match; and `output_difference`, which returns a string describing the differences between two outputs. """ def check_output(self, want, got, optionflags): """ Return True iff the actual output from an example (`got`) matches the expected output (`want`). These strings are always considered to match if they are identical; but depending on what option flags the test runner is using, several non-exact match types are also possible. See the documentation for `TestRunner` for more information about option flags. """ # Handle the common case first, for efficiency: # if they're string-identical, always return true. if got == want: return True # The values True and False replaced 1 and 0 as the return # value for boolean comparisons in Python 2.3. if not (optionflags & DONT_ACCEPT_TRUE_FOR_1): if (got,want) == ("True\n", "1\n"): return True if (got,want) == ("False\n", "0\n"): return True # can be used as a special sequence to signify a # blank line, unless the DONT_ACCEPT_BLANKLINE flag is used. if not (optionflags & DONT_ACCEPT_BLANKLINE): # Replace in want with a blank line. want = re.sub('(?m)^%s\s*?$' % re.escape(BLANKLINE_MARKER), '', want) # If a line in got contains only spaces, then remove the # spaces. got = re.sub('(?m)^\s*?$', '', got) if got == want: return True # This flag causes doctest to ignore any differences in the # contents of whitespace strings. Note that this can be used # in conjunction with the ELLIPSIS flag. if optionflags & NORMALIZE_WHITESPACE: got = ' '.join(got.split()) want = ' '.join(want.split()) if got == want: return True # The ELLIPSIS flag says to let the sequence "..." in `want` # match any substring in `got`. if optionflags & ELLIPSIS: if _ellipsis_match(want, got): return True # We didn't find any match; return false. return False # Should we do a fancy diff? def _do_a_fancy_diff(self, want, got, optionflags): # Not unless they asked for a fancy diff. if not optionflags & (REPORT_UDIFF | REPORT_CDIFF | REPORT_NDIFF): return False # If expected output uses ellipsis, a meaningful fancy diff is # too hard ... or maybe not. In two real-life failures Tim saw, # a diff was a major help anyway, so this is commented out. # [todo] _ellipsis_match() knows which pieces do and don't match, # and could be the basis for a kick-ass diff in this case. ##if optionflags & ELLIPSIS and ELLIPSIS_MARKER in want: ## return False # ndiff does intraline difference marking, so can be useful even # for 1-line differences. if optionflags & REPORT_NDIFF: return True # The other diff types need at least a few lines to be helpful. return want.count('\n') > 2 and got.count('\n') > 2 def output_difference(self, example, got, optionflags): """ Return a string describing the differences between the expected output for a given example (`example`) and the actual output (`got`). `optionflags` is the set of option flags used to compare `want` and `got`. """ want = example.want # If s are being used, then replace blank lines # with in the actual output string. if not (optionflags & DONT_ACCEPT_BLANKLINE): got = re.sub('(?m)^[ ]*(?=\n)', BLANKLINE_MARKER, got) # Check if we should use diff. if self._do_a_fancy_diff(want, got, optionflags): # Split want & got into lines. want_lines = want.splitlines(True) # True == keep line ends got_lines = got.splitlines(True) # Use difflib to find their differences. if optionflags & REPORT_UDIFF: diff = difflib.unified_diff(want_lines, got_lines, n=2) diff = list(diff)[2:] # strip the diff header kind = 'unified diff with -expected +actual' elif optionflags & REPORT_CDIFF: diff = difflib.context_diff(want_lines, got_lines, n=2) diff = list(diff)[2:] # strip the diff header kind = 'context diff with expected followed by actual' elif optionflags & REPORT_NDIFF: engine = difflib.Differ(charjunk=difflib.IS_CHARACTER_JUNK) diff = list(engine.compare(want_lines, got_lines)) kind = 'ndiff with -expected +actual' else: assert 0, 'Bad diff option' # Remove trailing whitespace on diff output. diff = [line.rstrip() + '\n' for line in diff] return 'Differences (%s):\n' % kind + _indent(''.join(diff)) # If we're not using diff, then simply list the expected # output followed by the actual output. if want and got: return 'Expected:\n%sGot:\n%s' % (_indent(want), _indent(got)) elif want: return 'Expected:\n%sGot nothing\n' % _indent(want) elif got: return 'Expected nothing\nGot:\n%s' % _indent(got) else: return 'Expected nothing\nGot nothing\n' class DocTestFailure(Exception): """A DocTest example has failed in debugging mode. The exception instance has variables: - test: the DocTest object being run - excample: the Example object that failed - got: the actual output """ def __init__(self, test, example, got): self.test = test self.example = example self.got = got def __str__(self): return str(self.test) class UnexpectedException(Exception): """A DocTest example has encountered an unexpected exception The exception instance has variables: - test: the DocTest object being run - excample: the Example object that failed - exc_info: the exception info """ def __init__(self, test, example, exc_info): self.test = test self.example = example self.exc_info = exc_info def __str__(self): return str(self.test) class DebugRunner(DocTestRunner): r"""Run doc tests but raise an exception as soon as there is a failure. If an unexpected exception occurs, an UnexpectedException is raised. It contains the test, the example, and the original exception: >>> runner = DebugRunner(verbose=False) >>> test = DocTestParser().get_doctest('>>> raise KeyError\n42', ... {}, 'foo', 'foo.py', 0) >>> try: ... runner.run(test) ... except UnexpectedException, failure: ... pass >>> failure.test is test True >>> failure.example.want '42\n' >>> exc_info = failure.exc_info >>> raise exc_info[0], exc_info[1], exc_info[2] Traceback (most recent call last): ... KeyError We wrap the original exception to give the calling application access to the test and example information. If the output doesn't match, then a DocTestFailure is raised: >>> test = DocTestParser().get_doctest(''' ... >>> x = 1 ... >>> x ... 2 ... ''', {}, 'foo', 'foo.py', 0) >>> try: ... runner.run(test) ... except DocTestFailure, failure: ... pass DocTestFailure objects provide access to the test: >>> failure.test is test True As well as to the example: >>> failure.example.want '2\n' and the actual output: >>> failure.got '1\n' If a failure or error occurs, the globals are left intact: >>> del test.globs['__builtins__'] >>> test.globs {'x': 1} >>> test = DocTestParser().get_doctest(''' ... >>> x = 2 ... >>> raise KeyError ... ''', {}, 'foo', 'foo.py', 0) >>> runner.run(test) Traceback (most recent call last): ... UnexpectedException: >>> del test.globs['__builtins__'] >>> test.globs {'x': 2} But the globals are cleared if there is no error: >>> test = DocTestParser().get_doctest(''' ... >>> x = 2 ... ''', {}, 'foo', 'foo.py', 0) >>> runner.run(test) (0, 1) >>> test.globs {} """ def run(self, test, compileflags=None, out=None, clear_globs=True): r = DocTestRunner.run(self, test, compileflags, out, False) if clear_globs: test.globs.clear() return r def report_unexpected_exception(self, out, test, example, exc_info): raise UnexpectedException(test, example, exc_info) def report_failure(self, out, test, example, got): raise DocTestFailure(test, example, got) ###################################################################### ## 6. Test Functions ###################################################################### # These should be backwards compatible. # For backward compatibility, a global instance of a DocTestRunner # class, updated by testmod. master = None def testmod(m=None, name=None, globs=None, verbose=None, isprivate=None, report=True, optionflags=0, extraglobs=None, raise_on_error=False, exclude_empty=False): """m=None, name=None, globs=None, verbose=None, isprivate=None, report=True, optionflags=0, extraglobs=None, raise_on_error=False, exclude_empty=False Test examples in docstrings in functions and classes reachable from module m (or the current module if m is not supplied), starting with m.__doc__. Unless isprivate is specified, private names are not skipped. Also test examples reachable from dict m.__test__ if it exists and is not None. m.__test__ maps names to functions, classes and strings; function and class docstrings are tested even if the name is private; strings are tested directly, as if they were docstrings. Return (#failures, #tests). See doctest.__doc__ for an overview. Optional keyword arg "name" gives the name of the module; by default use m.__name__. Optional keyword arg "globs" gives a dict to be used as the globals when executing examples; by default, use m.__dict__. A copy of this dict is actually used for each docstring, so that each docstring's examples start with a clean slate. Optional keyword arg "extraglobs" gives a dictionary that should be merged into the globals that are used to execute examples. By default, no extra globals are used. This is new in 2.4. Optional keyword arg "verbose" prints lots of stuff if true, prints only failures if false; by default, it's true iff "-v" is in sys.argv. Optional keyword arg "report" prints a summary at the end when true, else prints nothing at the end. In verbose mode, the summary is detailed, else very brief (in fact, empty if all tests passed). Optional keyword arg "optionflags" or's together module constants, and defaults to 0. This is new in 2.3. Possible values (see the docs for details): DONT_ACCEPT_TRUE_FOR_1 DONT_ACCEPT_BLANKLINE NORMALIZE_WHITESPACE ELLIPSIS SKIP IGNORE_EXCEPTION_DETAIL REPORT_UDIFF REPORT_CDIFF REPORT_NDIFF REPORT_ONLY_FIRST_FAILURE Optional keyword arg "raise_on_error" raises an exception on the first unexpected exception or failure. This allows failures to be post-mortem debugged. Deprecated in Python 2.4: Optional keyword arg "isprivate" specifies a function used to determine whether a name is private. The default function is treat all functions as public. Optionally, "isprivate" can be set to doctest.is_private to skip over functions marked as private using the underscore naming convention; see its docs for details. Advanced tomfoolery: testmod runs methods of a local instance of class doctest.Tester, then merges the results into (or creates) global Tester instance doctest.master. Methods of doctest.master can be called directly too, if you want to do something unusual. Passing report=0 to testmod is especially useful then, to delay displaying a summary. Invoke doctest.master.summarize(verbose) when you're done fiddling. """ global master if isprivate is not None: warnings.warn("the isprivate argument is deprecated; " "examine DocTestFinder.find() lists instead", DeprecationWarning) # If no module was given, then use __main__. if m is None: # DWA - m will still be None if this wasn't invoked from the command # line, in which case the following TypeError is about as good an error # as we should expect m = sys.modules.get('__main__') # Check that we were actually given a module. if not inspect.ismodule(m): raise TypeError("testmod: module required; %r" % (m,)) # If no name was given, then use the module's name. if name is None: name = m.__name__ # Find, parse, and run all tests in the given module. finder = DocTestFinder(_namefilter=isprivate, exclude_empty=exclude_empty) if raise_on_error: runner = DebugRunner(verbose=verbose, optionflags=optionflags) else: runner = DocTestRunner(verbose=verbose, optionflags=optionflags) for test in finder.find(m, name, globs=globs, extraglobs=extraglobs): runner.run(test) if report: runner.summarize() if master is None: master = runner else: master.merge(runner) return runner.failures, runner.tries def testfile(filename, module_relative=True, name=None, package=None, globs=None, verbose=None, report=True, optionflags=0, extraglobs=None, raise_on_error=False, parser=DocTestParser()): """ Test examples in the given file. Return (#failures, #tests). Optional keyword arg "module_relative" specifies how filenames should be interpreted: - If "module_relative" is True (the default), then "filename" specifies a module-relative path. By default, this path is relative to the calling module's directory; but if the "package" argument is specified, then it is relative to that package. To ensure os-independence, "filename" should use "/" characters to separate path segments, and should not be an absolute path (i.e., it may not begin with "/"). - If "module_relative" is False, then "filename" specifies an os-specific path. The path may be absolute or relative (to the current working directory). Optional keyword arg "name" gives the name of the test; by default use the file's basename. Optional keyword argument "package" is a Python package or the name of a Python package whose directory should be used as the base directory for a module relative filename. If no package is specified, then the calling module's directory is used as the base directory for module relative filenames. It is an error to specify "package" if "module_relative" is False. Optional keyword arg "globs" gives a dict to be used as the globals when executing examples; by default, use {}. A copy of this dict is actually used for each docstring, so that each docstring's examples start with a clean slate. Optional keyword arg "extraglobs" gives a dictionary that should be merged into the globals that are used to execute examples. By default, no extra globals are used. Optional keyword arg "verbose" prints lots of stuff if true, prints only failures if false; by default, it's true iff "-v" is in sys.argv. Optional keyword arg "report" prints a summary at the end when true, else prints nothing at the end. In verbose mode, the summary is detailed, else very brief (in fact, empty if all tests passed). Optional keyword arg "optionflags" or's together module constants, and defaults to 0. Possible values (see the docs for details): DONT_ACCEPT_TRUE_FOR_1 DONT_ACCEPT_BLANKLINE NORMALIZE_WHITESPACE ELLIPSIS SKIP IGNORE_EXCEPTION_DETAIL REPORT_UDIFF REPORT_CDIFF REPORT_NDIFF REPORT_ONLY_FIRST_FAILURE Optional keyword arg "raise_on_error" raises an exception on the first unexpected exception or failure. This allows failures to be post-mortem debugged. Optional keyword arg "parser" specifies a DocTestParser (or subclass) that should be used to extract tests from the files. Advanced tomfoolery: testmod runs methods of a local instance of class doctest.Tester, then merges the results into (or creates) global Tester instance doctest.master. Methods of doctest.master can be called directly too, if you want to do something unusual. Passing report=0 to testmod is especially useful then, to delay displaying a summary. Invoke doctest.master.summarize(verbose) when you're done fiddling. """ global master if package and not module_relative: raise ValueError("Package may only be specified for module-" "relative paths.") # Relativize the path text, filename = _load_testfile(filename, package, module_relative) # If no name was given, then use the file's name. if name is None: name = os.path.basename(filename) # Assemble the globals. if globs is None: globs = {} else: globs = globs.copy() if extraglobs is not None: globs.update(extraglobs) if raise_on_error: runner = DebugRunner(verbose=verbose, optionflags=optionflags) else: runner = DocTestRunner(verbose=verbose, optionflags=optionflags) # Read the file, convert it to a test, and run it. test = parser.get_doctest(text, globs, name, filename, 0) runner.run(test) if report: runner.summarize() if master is None: master = runner else: master.merge(runner) return runner.failures, runner.tries def run_docstring_examples(f, globs, verbose=False, name="NoName", compileflags=None, optionflags=0): """ Test examples in the given object's docstring (`f`), using `globs` as globals. Optional argument `name` is used in failure messages. If the optional argument `verbose` is true, then generate output even if there are no failures. `compileflags` gives the set of flags that should be used by the Python compiler when running the examples. If not specified, then it will default to the set of future-import flags that apply to `globs`. Optional keyword arg `optionflags` specifies options for the testing and output. See the documentation for `testmod` for more information. """ # Find, parse, and run all tests in the given module. finder = DocTestFinder(verbose=verbose, recurse=False) runner = DocTestRunner(verbose=verbose, optionflags=optionflags) for test in finder.find(f, name, globs=globs): runner.run(test, compileflags=compileflags) ###################################################################### ## 7. Tester ###################################################################### # This is provided only for backwards compatibility. It's not # actually used in any way. class Tester: def __init__(self, mod=None, globs=None, verbose=None, isprivate=None, optionflags=0): warnings.warn("class Tester is deprecated; " "use class doctest.DocTestRunner instead", DeprecationWarning, stacklevel=2) if mod is None and globs is None: raise TypeError("Tester.__init__: must specify mod or globs") if mod is not None and not inspect.ismodule(mod): raise TypeError("Tester.__init__: mod must be a module; %r" % (mod,)) if globs is None: globs = mod.__dict__ self.globs = globs self.verbose = verbose self.isprivate = isprivate self.optionflags = optionflags self.testfinder = DocTestFinder(_namefilter=isprivate) self.testrunner = DocTestRunner(verbose=verbose, optionflags=optionflags) def runstring(self, s, name): test = DocTestParser().get_doctest(s, self.globs, name, None, None) if self.verbose: print "Running string", name (f,t) = self.testrunner.run(test) if self.verbose: print f, "of", t, "examples failed in string", name return (f,t) def rundoc(self, object, name=None, module=None): f = t = 0 tests = self.testfinder.find(object, name, module=module, globs=self.globs) for test in tests: (f2, t2) = self.testrunner.run(test) (f,t) = (f+f2, t+t2) return (f,t) def rundict(self, d, name, module=None): import new m = new.module(name) m.__dict__.update(d) if module is None: module = False return self.rundoc(m, name, module) def run__test__(self, d, name): import new m = new.module(name) m.__test__ = d return self.rundoc(m, name) def summarize(self, verbose=None): return self.testrunner.summarize(verbose) def merge(self, other): self.testrunner.merge(other.testrunner) ###################################################################### ## 8. Unittest Support ###################################################################### _unittest_reportflags = 0 def set_unittest_reportflags(flags): """Sets the unittest option flags. The old flag is returned so that a runner could restore the old value if it wished to: >>> import doctest >>> old = doctest._unittest_reportflags >>> doctest.set_unittest_reportflags(REPORT_NDIFF | ... REPORT_ONLY_FIRST_FAILURE) == old True >>> doctest._unittest_reportflags == (REPORT_NDIFF | ... REPORT_ONLY_FIRST_FAILURE) True Only reporting flags can be set: >>> doctest.set_unittest_reportflags(ELLIPSIS) Traceback (most recent call last): ... ValueError: ('Only reporting flags allowed', 8) >>> doctest.set_unittest_reportflags(old) == (REPORT_NDIFF | ... REPORT_ONLY_FIRST_FAILURE) True """ global _unittest_reportflags if (flags & REPORTING_FLAGS) != flags: raise ValueError("Only reporting flags allowed", flags) old = _unittest_reportflags _unittest_reportflags = flags return old class DocTestCase(unittest.TestCase): def __init__(self, test, optionflags=0, setUp=None, tearDown=None, checker=None): unittest.TestCase.__init__(self) self._dt_optionflags = optionflags self._dt_checker = checker self._dt_test = test self._dt_setUp = setUp self._dt_tearDown = tearDown def setUp(self): test = self._dt_test if self._dt_setUp is not None: self._dt_setUp(test) def tearDown(self): test = self._dt_test if self._dt_tearDown is not None: self._dt_tearDown(test) test.globs.clear() def runTest(self): test = self._dt_test old = sys.stdout new = StringIO() optionflags = self._dt_optionflags if not (optionflags & REPORTING_FLAGS): # The option flags don't include any reporting flags, # so add the default reporting flags optionflags |= _unittest_reportflags runner = DocTestRunner(optionflags=optionflags, checker=self._dt_checker, verbose=False) try: runner.DIVIDER = "-"*70 failures, tries = runner.run( test, out=new.write, clear_globs=False) finally: sys.stdout = old if failures: raise self.failureException(self.format_failure(new.getvalue())) def format_failure(self, err): test = self._dt_test if test.lineno is None: lineno = 'unknown line number' else: lineno = '%s' % test.lineno lname = '.'.join(test.name.split('.')[-1:]) return ('Failed doctest test for %s\n' ' File "%s", line %s, in %s\n\n%s' % (test.name, test.filename, lineno, lname, err) ) def debug(self): r"""Run the test case without results and without catching exceptions The unit test framework includes a debug method on test cases and test suites to support post-mortem debugging. The test code is run in such a way that errors are not caught. This way a caller can catch the errors and initiate post-mortem debugging. The DocTestCase provides a debug method that raises UnexpectedException errors if there is an unexepcted exception: >>> test = DocTestParser().get_doctest('>>> raise KeyError\n42', ... {}, 'foo', 'foo.py', 0) >>> case = DocTestCase(test) >>> try: ... case.debug() ... except UnexpectedException, failure: ... pass The UnexpectedException contains the test, the example, and the original exception: >>> failure.test is test True >>> failure.example.want '42\n' >>> exc_info = failure.exc_info >>> raise exc_info[0], exc_info[1], exc_info[2] Traceback (most recent call last): ... KeyError If the output doesn't match, then a DocTestFailure is raised: >>> test = DocTestParser().get_doctest(''' ... >>> x = 1 ... >>> x ... 2 ... ''', {}, 'foo', 'foo.py', 0) >>> case = DocTestCase(test) >>> try: ... case.debug() ... except DocTestFailure, failure: ... pass DocTestFailure objects provide access to the test: >>> failure.test is test True As well as to the example: >>> failure.example.want '2\n' and the actual output: >>> failure.got '1\n' """ self.setUp() runner = DebugRunner(optionflags=self._dt_optionflags, checker=self._dt_checker, verbose=False) runner.run(self._dt_test) self.tearDown() def id(self): return self._dt_test.name def __repr__(self): name = self._dt_test.name.split('.') return "%s (%s)" % (name[-1], '.'.join(name[:-1])) __str__ = __repr__ def shortDescription(self): return "Doctest: " + self._dt_test.name def DocTestSuite(module=None, globs=None, extraglobs=None, test_finder=None, **options): """ Convert doctest tests for a module to a unittest test suite. This converts each documentation string in a module that contains doctest tests to a unittest test case. If any of the tests in a doc string fail, then the test case fails. An exception is raised showing the name of the file containing the test and a (sometimes approximate) line number. The `module` argument provides the module to be tested. The argument can be either a module or a module name. If no argument is given, the calling module is used. A number of options may be provided as keyword arguments: setUp A set-up function. This is called before running the tests in each file. The setUp function will be passed a DocTest object. The setUp function can access the test globals as the globs attribute of the test passed. tearDown A tear-down function. This is called after running the tests in each file. The tearDown function will be passed a DocTest object. The tearDown function can access the test globals as the globs attribute of the test passed. globs A dictionary containing initial global variables for the tests. optionflags A set of doctest option flags expressed as an integer. """ if test_finder is None: test_finder = DocTestFinder() module = _normalize_module(module) tests = test_finder.find(module, globs=globs, extraglobs=extraglobs) if globs is None: globs = module.__dict__ if not tests: # Why do we want to do this? Because it reveals a bug that might # otherwise be hidden. raise ValueError(module, "has no tests") tests.sort() suite = unittest.TestSuite() for test in tests: if len(test.examples) == 0: continue if not test.filename: filename = module.__file__ if filename[-4:] in (".pyc", ".pyo"): filename = filename[:-1] test.filename = filename suite.addTest(DocTestCase(test, **options)) return suite class DocFileCase(DocTestCase): def id(self): return '_'.join(self._dt_test.name.split('.')) def __repr__(self): return self._dt_test.filename __str__ = __repr__ def format_failure(self, err): return ('Failed doctest test for %s\n File "%s", line 0\n\n%s' % (self._dt_test.name, self._dt_test.filename, err) ) def DocFileTest(path, module_relative=True, package=None, globs=None, parser=DocTestParser(), **options): if globs is None: globs = {} else: globs = globs.copy() if package and not module_relative: raise ValueError("Package may only be specified for module-" "relative paths.") # Relativize the path. doc, path = _load_testfile(path, package, module_relative) if "__file__" not in globs: globs["__file__"] = path # Find the file and read it. name = os.path.basename(path) # Convert it to a test, and wrap it in a DocFileCase. test = parser.get_doctest(doc, globs, name, path, 0) return DocFileCase(test, **options) def DocFileSuite(*paths, **kw): """A unittest suite for one or more doctest files. The path to each doctest file is given as a string; the interpretation of that string depends on the keyword argument "module_relative". A number of options may be provided as keyword arguments: module_relative If "module_relative" is True, then the given file paths are interpreted as os-independent module-relative paths. By default, these paths are relative to the calling module's directory; but if the "package" argument is specified, then they are relative to that package. To ensure os-independence, "filename" should use "/" characters to separate path segments, and may not be an absolute path (i.e., it may not begin with "/"). If "module_relative" is False, then the given file paths are interpreted as os-specific paths. These paths may be absolute or relative (to the current working directory). package A Python package or the name of a Python package whose directory should be used as the base directory for module relative paths. If "package" is not specified, then the calling module's directory is used as the base directory for module relative filenames. It is an error to specify "package" if "module_relative" is False. setUp A set-up function. This is called before running the tests in each file. The setUp function will be passed a DocTest object. The setUp function can access the test globals as the globs attribute of the test passed. tearDown A tear-down function. This is called after running the tests in each file. The tearDown function will be passed a DocTest object. The tearDown function can access the test globals as the globs attribute of the test passed. globs A dictionary containing initial global variables for the tests. optionflags A set of doctest option flags expressed as an integer. parser A DocTestParser (or subclass) that should be used to extract tests from the files. """ suite = unittest.TestSuite() # We do this here so that _normalize_module is called at the right # level. If it were called in DocFileTest, then this function # would be the caller and we might guess the package incorrectly. if kw.get('module_relative', True): kw['package'] = _normalize_module(kw.get('package')) for path in paths: suite.addTest(DocFileTest(path, **kw)) return suite ###################################################################### ## 9. Debugging Support ###################################################################### def script_from_examples(s): r"""Extract script from text with examples. Converts text with examples to a Python script. Example input is converted to regular code. Example output and all other words are converted to comments: >>> text = ''' ... Here are examples of simple math. ... ... Python has super accurate integer addition ... ... >>> 2 + 2 ... 5 ... ... And very friendly error messages: ... ... >>> 1/0 ... To Infinity ... And ... Beyond ... ... You can use logic if you want: ... ... >>> if 0: ... ... blah ... ... blah ... ... ... ... Ho hum ... ''' >>> print script_from_examples(text) # Here are examples of simple math. # # Python has super accurate integer addition # 2 + 2 # Expected: ## 5 # # And very friendly error messages: # 1/0 # Expected: ## To Infinity ## And ## Beyond # # You can use logic if you want: # if 0: blah blah # # Ho hum """ output = [] for piece in DocTestParser().parse(s): if isinstance(piece, Example): # Add the example's source code (strip trailing NL) output.append(piece.source[:-1]) # Add the expected output: want = piece.want if want: output.append('# Expected:') output += ['## '+l for l in want.split('\n')[:-1]] else: # Add non-example text. output += [_comment_line(l) for l in piece.split('\n')[:-1]] # Trim junk on both ends. while output and output[-1] == '#': output.pop() while output and output[0] == '#': output.pop(0) # Combine the output, and return it. # Add a courtesy newline to prevent exec from choking (see bug #1172785) return '\n'.join(output) + '\n' def testsource(module, name): """Extract the test sources from a doctest docstring as a script. Provide the module (or dotted name of the module) containing the test to be debugged and the name (within the module) of the object with the doc string with tests to be debugged. """ module = _normalize_module(module) tests = DocTestFinder().find(module) test = [t for t in tests if t.name == name] if not test: raise ValueError(name, "not found in tests") test = test[0] testsrc = script_from_examples(test.docstring) return testsrc def debug_src(src, pm=False, globs=None): """Debug a single doctest docstring, in argument `src`'""" testsrc = script_from_examples(src) debug_script(testsrc, pm, globs) def debug_script(src, pm=False, globs=None): "Debug a test script. `src` is the script, as a string." import pdb # Note that tempfile.NameTemporaryFile() cannot be used. As the # docs say, a file so created cannot be opened by name a second time # on modern Windows boxes, and execfile() needs to open it. srcfilename = tempfile.mktemp(".py", "doctestdebug") f = open(srcfilename, 'w') f.write(src) f.close() try: if globs: globs = globs.copy() else: globs = {} if pm: try: execfile(srcfilename, globs, globs) except: print sys.exc_info()[1] pdb.post_mortem(sys.exc_info()[2]) else: # Note that %r is vital here. '%s' instead can, e.g., cause # backslashes to get treated as metacharacters on Windows. pdb.run("execfile(%r)" % srcfilename, globs, globs) finally: os.remove(srcfilename) def debug(module, name, pm=False): """Debug a single doctest docstring. Provide the module (or dotted name of the module) containing the test to be debugged and the name (within the module) of the object with the docstring with tests to be debugged. """ module = _normalize_module(module) testsrc = testsource(module, name) debug_script(testsrc, pm, module.__dict__) ###################################################################### ## 10. Example Usage ###################################################################### class _TestClass: """ A pointless class, for sanity-checking of docstring testing. Methods: square() get() >>> _TestClass(13).get() + _TestClass(-12).get() 1 >>> hex(_TestClass(13).square().get()) '0xa9' """ def __init__(self, val): """val -> _TestClass object with associated value val. >>> t = _TestClass(123) >>> print t.get() 123 """ self.val = val def square(self): """square() -> square TestClass's associated value >>> _TestClass(13).square().get() 169 """ self.val = self.val ** 2 return self def get(self): """get() -> return TestClass's associated value. >>> x = _TestClass(-42) >>> print x.get() -42 """ return self.val __test__ = {"_TestClass": _TestClass, "string": r""" Example of a string object, searched as-is. >>> x = 1; y = 2 >>> x + y, x * y (3, 2) """, "bool-int equivalence": r""" In 2.2, boolean expressions displayed 0 or 1. By default, we still accept them. This can be disabled by passing DONT_ACCEPT_TRUE_FOR_1 to the new optionflags argument. >>> 4 == 4 1 >>> 4 == 4 True >>> 4 > 4 0 >>> 4 > 4 False """, "blank lines": r""" Blank lines can be marked with : >>> print 'foo\n\nbar\n' foo bar """, "ellipsis": r""" If the ellipsis flag is used, then '...' can be used to elide substrings in the desired output: >>> print range(1000) #doctest: +ELLIPSIS [0, 1, 2, ..., 999] """, "whitespace normalization": r""" If the whitespace normalization flag is used, then differences in whitespace are ignored. >>> print range(30) #doctest: +NORMALIZE_WHITESPACE [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29] """, } def _test(): r = unittest.TextTestRunner() r.run(DocTestSuite()) if __name__ == "__main__": _test() mechanize-0.2.5/test-tools/cookietest.cgi0000755000175000017500000000334311545150644017124 0ustar johnjohn#!/usr/bin/python # -*-python-*- # This is used by functional_tests.py #import cgitb; cgitb.enable() import time print "Content-Type: text/html" year_plus_one = time.localtime(time.time())[0] + 1 expires = "expires=09-Nov-%d 23:12:40 GMT" % (year_plus_one,) print "Set-Cookie: foo=bar; %s" % expires print "Set-Cookie: sessioncookie=spam\n" import sys, os, string, cgi, Cookie, urllib from xml.sax import saxutils from types import ListType print "Cookies and form submission parameters" cookie = Cookie.SimpleCookie() cookieHdr = os.environ.get("HTTP_COOKIE", "") cookie.load(cookieHdr) form = cgi.FieldStorage() refresh_value = None if form.has_key("refresh"): refresh = form["refresh"] if not isinstance(refresh, ListType): refresh_value = refresh.value if refresh_value is not None: print '' % ( saxutils.quoteattr(urllib.unquote_plus(refresh_value))) elif not cookie.has_key("foo"): print '' print "" print "

Received cookies:

" print "
"
print cgi.escape(os.environ.get("HTTP_COOKIE", ""))
print "
" if cookie.has_key("foo"): print "

Your browser supports cookies!" if cookie.has_key("sessioncookie"): print "

Received session cookie" print "

Referer:

" print "
"
print cgi.escape(os.environ.get("HTTP_REFERER", ""))
print "
" print "

Received parameters:

" print "
"
for k in form.keys():
    v = form[k]
    if isinstance(v, ListType):
        vs = []
        for item in v:
            vs.append(item.value)
        text = string.join(vs, ", ")
    else:
        text = v.value
    print "%s: %s" % (cgi.escape(k), cgi.escape(text))
print "
" mechanize-0.2.5/mechanize.egg-info/0000755000175000017500000000000011545173600015576 5ustar johnjohnmechanize-0.2.5/mechanize.egg-info/top_level.txt0000644000175000017500000000001211545173600020321 0ustar johnjohnmechanize mechanize-0.2.5/mechanize.egg-info/dependency_links.txt0000644000175000017500000000000111545173600021644 0ustar johnjohn mechanize-0.2.5/mechanize.egg-info/SOURCES.txt0000644000175000017500000000612611545173600017467 0ustar johnjohnCOPYING.txt INSTALL.txt MANIFEST.in README.txt ez_setup.py release.py setup.cfg setup.py test.py docs/development.txt docs/doc.txt docs/documentation.txt docs/download.txt docs/faq.txt docs/forms.txt docs/hints.txt docs/index.txt docs/support.txt docs/html/ChangeLog.txt docs/html/development.html docs/html/doc.html docs/html/documentation.html docs/html/download.html docs/html/faq.html docs/html/forms.html docs/html/hints.html docs/html/index.html docs/html/support.html docs/styles/ie6.js docs/styles/maxwidth.css docs/styles/style.css examples/hack21.py examples/pypi.py examples/forms/data.dat examples/forms/data.txt examples/forms/echo.cgi examples/forms/example.html examples/forms/example.py examples/forms/simple.py mechanize/__init__.py mechanize/_auth.py mechanize/_beautifulsoup.py mechanize/_clientcookie.py mechanize/_debug.py mechanize/_firefox3cookiejar.py mechanize/_form.py mechanize/_gzip.py mechanize/_headersutil.py mechanize/_html.py mechanize/_http.py mechanize/_lwpcookiejar.py mechanize/_markupbase.py mechanize/_mechanize.py mechanize/_mozillacookiejar.py mechanize/_msiecookiejar.py mechanize/_opener.py mechanize/_pullparser.py mechanize/_request.py mechanize/_response.py mechanize/_rfc3986.py mechanize/_sgmllib_copy.py mechanize/_sockettimeout.py mechanize/_testcase.py mechanize/_urllib2.py mechanize/_urllib2_fork.py mechanize/_useragent.py mechanize/_util.py mechanize/_version.py mechanize.egg-info/PKG-INFO mechanize.egg-info/SOURCES.txt mechanize.egg-info/dependency_links.txt mechanize.egg-info/top_level.txt mechanize.egg-info/zip-safe test/__init__.py test/test_api.py test/test_browser.doctest test/test_browser.py test/test_cookie.py test/test_cookies.py test/test_date.py test/test_form.py test/test_form_mutation.py test/test_forms.doctest test/test_functional.py test/test_headers.py test/test_history.doctest test/test_html.doctest test/test_html.py test/test_import.py test/test_opener.doctest test/test_opener.py test/test_password_manager.special_doctest test/test_performance.py test/test_pickle.py test/test_pullparser.py test/test_request.doctest test/test_response.doctest test/test_response.py test/test_rfc3986.doctest test/test_robotfileparser.doctest test/test_unittest.py test/test_urllib2.py test/test_urllib2_localnet.py test/test_useragent.py test-tools/cookietest.cgi test-tools/doctest.py test-tools/functools_copy.py test-tools/linecache_copy.py test-tools/testprogram.py test-tools/twisted-ftpserver.py test-tools/twisted-localserver.py test-tools/unittest/__init__.py test-tools/unittest/__main__.py test-tools/unittest/case.py test-tools/unittest/loader.py test-tools/unittest/main.py test-tools/unittest/result.py test-tools/unittest/runner.py test-tools/unittest/suite.py test-tools/unittest/util.py test/functional_tests_golden/FormsExamplesTests.test_example/output test/functional_tests_golden/FormsExamplesTests.test_simple/output test/test_form_data/Auth.html test/test_form_data/FullSearch.html test/test_form_data/GeneralSearch.html test/test_form_data/MarkedRecords.html test/test_form_data/MarkedResults.html test/test_form_data/Results.html test/test_form_data/SearchType.htmlmechanize-0.2.5/mechanize.egg-info/PKG-INFO0000644000175000017500000000577411545173600016710 0ustar johnjohnMetadata-Version: 1.0 Name: mechanize Version: 0.2.5 Summary: Stateful programmatic web browsing. Home-page: http://wwwsearch.sourceforge.net/mechanize/ Author: John J. Lee Author-email: jjl@pobox.com License: BSD Download-URL: http://pypi.python.org/packages/source/m/mechanize/mechanize-0.2.5.tar.gz Description: Stateful programmatic web browsing, after Andy Lester's Perl module WWW::Mechanize. mechanize.Browser implements the urllib2.OpenerDirector interface. Browser objects have state, including navigation history, HTML form state, cookies, etc. The set of features and URL schemes handled by Browser objects is configurable. The library also provides an API that is mostly compatible with urllib2: your urllib2 program will likely still work if you replace "urllib2" with "mechanize" everywhere. Features include: ftp:, http: and file: URL schemes, browser history, hyperlink and HTML form support, HTTP cookies, HTTP-EQUIV and Refresh, Referer [sic] header, robots.txt, redirections, proxies, and Basic and Digest HTTP authentication. Much of the code originally derived from Perl code by Gisle Aas (libwww-perl), Johnny Lee (MSIE Cookie support) and last but not least Andy Lester (WWW::Mechanize). urllib2 was written by Jeremy Hylton. Platform: any Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: Intended Audience :: System Administrators Classifier: License :: OSI Approved :: BSD License Classifier: License :: OSI Approved :: Zope Public License Classifier: Natural Language :: English Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.4 Classifier: Programming Language :: Python :: 2.5 Classifier: Programming Language :: Python :: 2.6 Classifier: Programming Language :: Python :: 2.7 Classifier: Topic :: Internet Classifier: Topic :: Internet :: File Transfer Protocol (FTP) Classifier: Topic :: Internet :: WWW/HTTP Classifier: Topic :: Internet :: WWW/HTTP :: Browsers Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search Classifier: Topic :: Internet :: WWW/HTTP :: Site Management Classifier: Topic :: Internet :: WWW/HTTP :: Site Management :: Link Checking Classifier: Topic :: Software Development :: Libraries Classifier: Topic :: Software Development :: Libraries :: Python Modules Classifier: Topic :: Software Development :: Testing Classifier: Topic :: Software Development :: Testing :: Traffic Generation Classifier: Topic :: System :: Archiving :: Mirroring Classifier: Topic :: System :: Networking :: Monitoring Classifier: Topic :: System :: Systems Administration Classifier: Topic :: Text Processing Classifier: Topic :: Text Processing :: Markup Classifier: Topic :: Text Processing :: Markup :: HTML Classifier: Topic :: Text Processing :: Markup :: XML mechanize-0.2.5/mechanize.egg-info/zip-safe0000644000175000017500000000000111545150657017235 0ustar johnjohn mechanize-0.2.5/test/0000755000175000017500000000000011545173600013120 5ustar johnjohnmechanize-0.2.5/test/test_pickle.py0000644000175000017500000000202211545150644015777 0ustar johnjohnimport cPickle import cStringIO as StringIO import pickle import mechanize import mechanize._response import mechanize._testcase def pickle_and_unpickle(obj, implementation): return implementation.loads(implementation.dumps(obj)) def test_pickling(obj, check=lambda unpickled: None): check(pickle_and_unpickle(obj, cPickle)) check(pickle_and_unpickle(obj, pickle)) class PickleTest(mechanize._testcase.TestCase): def test_pickle_cookie(self): cookiejar = mechanize.CookieJar() url = "http://example.com/" request = mechanize.Request(url) response = mechanize._response.test_response( headers=[("Set-Cookie", "spam=eggs")], url=url) [cookie] = cookiejar.make_cookies(response, request) check_equality = lambda unpickled: self.assertEqual(unpickled, cookie) test_pickling(cookie, check_equality) def test_pickle_cookiejar(self): test_pickling(mechanize.CookieJar()) if __name__ == "__main__": mechanize._testcase.main() mechanize-0.2.5/test/test_useragent.py0000644000175000017500000000531511545150644016535 0ustar johnjohn#!/usr/bin/env python from unittest import TestCase import mechanize from test_browser import make_mock_handler class UserAgentTests(TestCase): def _get_handler_from_ua(self, ua, name): handler = ua._ua_handlers.get(name) self.assertTrue(handler in ua.handlers) return handler def test_set_proxies(self): ua = mechanize.UserAgentBase() def proxy_bypass(hostname): return False proxies = {"http": "http://spam"} ua.set_proxies(proxies, proxy_bypass) proxy_handler = self._get_handler_from_ua(ua, "_proxy") self.assertTrue(proxy_handler._proxy_bypass is proxy_bypass) self.assertTrue(proxy_handler.proxies, proxies) def test_set_handled_schemes(self): class MockHandlerClass(make_mock_handler()): def __call__(self): return self class BlahHandlerClass(MockHandlerClass): pass class BlahProcessorClass(MockHandlerClass): pass BlahHandler = BlahHandlerClass([("blah_open", None)]) BlahProcessor = BlahProcessorClass([("blah_request", None)]) class TestUserAgent(mechanize.UserAgent): default_schemes = ["http"] default_others = [] default_features = [] handler_classes = mechanize.UserAgent.handler_classes.copy() handler_classes.update( {"blah": BlahHandler, "_blah": BlahProcessor}) ua = TestUserAgent() self.assertEqual(list(h.__class__.__name__ for h in ua.handlers), ["HTTPHandler"]) ua.set_handled_schemes(["http", "file"]) self.assertEqual(sorted(h.__class__.__name__ for h in ua.handlers), ["FileHandler", "HTTPHandler"]) self.assertRaises(ValueError, ua.set_handled_schemes, ["blah", "non-existent"]) self.assertRaises(ValueError, ua.set_handled_schemes, ["blah", "_blah"]) ua.set_handled_schemes(["blah"]) req = mechanize.Request("blah://example.com/") r = ua.open(req) exp_calls = [("blah_open", (req,), {})] assert len(ua.calls) == len(exp_calls) for got, expect in zip(ua.calls, exp_calls): self.assertEqual(expect, got[1:]) ua.calls = [] req = mechanize.Request("blah://example.com/") ua._set_handler("_blah", True) r = ua.open(req) exp_calls = [ ("blah_request", (req,), {}), ("blah_open", (req,), {})] assert len(ua.calls) == len(exp_calls) for got, expect in zip(ua.calls, exp_calls): self.assertEqual(expect, got[1:]) ua._set_handler("_blah", True) if __name__ == "__main__": import unittest unittest.main() mechanize-0.2.5/test/test_import.py0000644000175000017500000000042511545150644016047 0ustar johnjohnimport unittest import mechanize from mechanize._testcase import TestCase class ImportTests(TestCase): def test_import_all(self): for name in mechanize.__all__: exec "from mechanize import %s" % name if __name__ == "__main__": unittest.main() mechanize-0.2.5/test/test_urllib2.py0000644000175000017500000017775711545150644016137 0ustar johnjohn"""Tests for urllib2-level functionality. This is urllib2's tests (most of which came from mechanize originally), plus some extra tests added, and modifications from bug fixes and feature additions to mechanize. """ # TODO: # Request # CacheFTPHandler (hard to write) # parse_keqv_list, parse_http_list import StringIO import httplib import os import sys import unittest import mechanize from mechanize._http import parse_head from mechanize._response import test_response from mechanize import HTTPRedirectHandler, \ HTTPEquivProcessor, HTTPRefreshProcessor, \ HTTPCookieProcessor, HTTPRefererProcessor, \ HTTPErrorProcessor, HTTPHandler from mechanize import OpenerDirector, build_opener, Request from mechanize._urllib2_fork import AbstractHTTPHandler from mechanize._util import write_file import mechanize._response import mechanize._sockettimeout as _sockettimeout import mechanize._testcase import mechanize._urllib2_fork ## from logging import getLogger, DEBUG ## l = getLogger("mechanize") ## l.setLevel(DEBUG) class AlwaysEqual: def __cmp__(self, other): return 0 class TrivialTests(mechanize._testcase.TestCase): def test_trivial(self): # A couple trivial tests self.assertRaises(ValueError, mechanize.urlopen, 'bogus url') fname = os.path.join(self.make_temp_dir(), "test.txt") write_file(fname, "data") if fname[1:2] == ":": fname = fname[2:] # And more hacking to get it to work on MacOS. This assumes # urllib.pathname2url works, unfortunately... if os.name == 'mac': fname = '/' + fname.replace(':', '/') elif os.name == 'riscos': import string fname = os.expand(fname) fname = fname.translate(string.maketrans("/.", "./")) file_url = "file://%s" % fname f = mechanize.urlopen(file_url) buf = f.read() f.close() def test_parse_http_list(self): tests = [('a,b,c', ['a', 'b', 'c']), ('path"o,l"og"i"cal, example', ['path"o,l"og"i"cal', 'example']), ('a, b, "c", "d", "e,f", g, h', ['a', 'b', '"c"', '"d"', '"e,f"', 'g', 'h']), ('a="b\\"c", d="e\\,f", g="h\\\\i"', ['a="b"c"', 'd="e,f"', 'g="h\\i"'])] for string, list in tests: self.assertEquals(mechanize._urllib2_fork.parse_http_list(string), list) def test_request_headers_dict(): """ The Request.headers dictionary is not a documented interface. It should stay that way, because the complete set of headers are only accessible through the .get_header(), .has_header(), .header_items() interface. However, .headers pre-dates those methods, and so real code will be using the dictionary. The introduction in 2.4 of those methods was a mistake for the same reason: code that previously saw all (urllib2 user)-provided headers in .headers now sees only a subset (and the function interface is ugly and incomplete). A better change would have been to replace .headers dict with a dict subclass (or UserDict.DictMixin instance?) that preserved the .headers interface and also provided access to the "unredirected" headers. It's probably too late to fix that, though. Check .capitalize() case normalization: >>> url = "http://example.com" >>> Request(url, headers={"Spam-eggs": "blah"}).headers["Spam-eggs"] 'blah' >>> Request(url, headers={"spam-EggS": "blah"}).headers["Spam-eggs"] 'blah' Currently, Request(url, "Spam-eggs").headers["Spam-Eggs"] raises KeyError, but that could be changed in future. """ def test_request_headers_methods(): """ Note the case normalization of header names here, to .capitalize()-case. This should be preserved for backwards-compatibility. (In the HTTP case, normalization to .title()-case is done by urllib2 before sending headers to httplib). >>> url = "http://example.com" >>> r = Request(url, headers={"Spam-eggs": "blah"}) >>> r.has_header("Spam-eggs") True >>> r.header_items() [('Spam-eggs', 'blah')] >>> r.add_header("Foo-Bar", "baz") >>> items = r.header_items() >>> items.sort() >>> items [('Foo-bar', 'baz'), ('Spam-eggs', 'blah')] Note that e.g. r.has_header("spam-EggS") is currently False, and r.get_header("spam-EggS") returns None, but that could be changed in future. >>> r.has_header("Not-there") False >>> print r.get_header("Not-there") None >>> r.get_header("Not-there", "default") 'default' """ def test_password_manager(self): """ >>> mgr = mechanize.HTTPPasswordMgr() >>> add = mgr.add_password >>> add("Some Realm", "http://example.com/", "joe", "password") >>> add("Some Realm", "http://example.com/ni", "ni", "ni") >>> add("c", "http://example.com/foo", "foo", "ni") >>> add("c", "http://example.com/bar", "bar", "nini") >>> add("b", "http://example.com/", "first", "blah") >>> add("b", "http://example.com/", "second", "spam") >>> add("a", "http://example.com", "1", "a") >>> add("Some Realm", "http://c.example.com:3128", "3", "c") >>> add("Some Realm", "d.example.com", "4", "d") >>> add("Some Realm", "e.example.com:3128", "5", "e") >>> mgr.find_user_password("Some Realm", "example.com") ('joe', 'password') >>> mgr.find_user_password("Some Realm", "http://example.com") ('joe', 'password') >>> mgr.find_user_password("Some Realm", "http://example.com/") ('joe', 'password') >>> mgr.find_user_password("Some Realm", "http://example.com/spam") ('joe', 'password') >>> mgr.find_user_password("Some Realm", "http://example.com/spam/spam") ('joe', 'password') >>> mgr.find_user_password("c", "http://example.com/foo") ('foo', 'ni') >>> mgr.find_user_password("c", "http://example.com/bar") ('bar', 'nini') Actually, this is really undefined ATM ## Currently, we use the highest-level path where more than one match: ## >>> mgr.find_user_password("Some Realm", "http://example.com/ni") ## ('joe', 'password') Use latest add_password() in case of conflict: >>> mgr.find_user_password("b", "http://example.com/") ('second', 'spam') No special relationship between a.example.com and example.com: >>> mgr.find_user_password("a", "http://example.com/") ('1', 'a') >>> mgr.find_user_password("a", "http://a.example.com/") (None, None) Ports: >>> mgr.find_user_password("Some Realm", "c.example.com") (None, None) >>> mgr.find_user_password("Some Realm", "c.example.com:3128") ('3', 'c') >>> mgr.find_user_password("Some Realm", "http://c.example.com:3128") ('3', 'c') >>> mgr.find_user_password("Some Realm", "d.example.com") ('4', 'd') >>> mgr.find_user_password("Some Realm", "e.example.com:3128") ('5', 'e') """ pass def test_password_manager_default_port(self): """ >>> mgr = mechanize.HTTPPasswordMgr() >>> add = mgr.add_password The point to note here is that we can't guess the default port if there's no scheme. This applies to both add_password and find_user_password. >>> add("f", "http://g.example.com:80", "10", "j") >>> add("g", "http://h.example.com", "11", "k") >>> add("h", "i.example.com:80", "12", "l") >>> add("i", "j.example.com", "13", "m") >>> mgr.find_user_password("f", "g.example.com:100") (None, None) >>> mgr.find_user_password("f", "g.example.com:80") ('10', 'j') >>> mgr.find_user_password("f", "g.example.com") (None, None) >>> mgr.find_user_password("f", "http://g.example.com:100") (None, None) >>> mgr.find_user_password("f", "http://g.example.com:80") ('10', 'j') >>> mgr.find_user_password("f", "http://g.example.com") ('10', 'j') >>> mgr.find_user_password("g", "h.example.com") ('11', 'k') >>> mgr.find_user_password("g", "h.example.com:80") ('11', 'k') >>> mgr.find_user_password("g", "http://h.example.com:80") ('11', 'k') >>> mgr.find_user_password("h", "i.example.com") (None, None) >>> mgr.find_user_password("h", "i.example.com:80") ('12', 'l') >>> mgr.find_user_password("h", "http://i.example.com:80") ('12', 'l') >>> mgr.find_user_password("i", "j.example.com") ('13', 'm') >>> mgr.find_user_password("i", "j.example.com:80") (None, None) >>> mgr.find_user_password("i", "http://j.example.com") ('13', 'm') >>> mgr.find_user_password("i", "http://j.example.com:80") (None, None) """ class MockOpener: addheaders = [] def open(self, req, data=None, timeout=_sockettimeout._GLOBAL_DEFAULT_TIMEOUT): self.req, self.data, self.timeout = req, data, timeout def error(self, proto, *args): self.proto, self.args = proto, args class MockFile: def read(self, count=None): pass def readline(self, count=None): pass def close(self): pass def http_message(mapping): """ >>> http_message({"Content-Type": "text/html"}).items() [('content-type', 'text/html')] """ f = [] for kv in mapping.items(): f.append("%s: %s" % kv) f.append("") msg = httplib.HTTPMessage(StringIO.StringIO("\r\n".join(f))) return msg class MockResponse(StringIO.StringIO): def __init__(self, code, msg, headers, data, url=None): StringIO.StringIO.__init__(self, data) self.code, self.msg, self.headers, self.url = code, msg, headers, url def info(self): return self.headers def geturl(self): return self.url class MockCookieJar: def add_cookie_header(self, request, unverifiable=False): self.ach_req, self.ach_u = request, unverifiable def extract_cookies(self, response, request, unverifiable=False): self.ec_req, self.ec_r, self.ec_u = request, response, unverifiable class FakeMethod: def __init__(self, meth_name, action, handle): self.meth_name = meth_name self.handle = handle self.action = action def __call__(self, *args): return self.handle(self.meth_name, self.action, *args) class MockHandler: # useful for testing handler machinery # see add_ordered_mock_handlers() docstring handler_order = 500 def __init__(self, methods): self._define_methods(methods) def _define_methods(self, methods): for spec in methods: if len(spec) == 2: name, action = spec else: name, action = spec, None meth = FakeMethod(name, action, self.handle) setattr(self.__class__, name, meth) def handle(self, fn_name, action, *args, **kwds): self.parent.calls.append((self, fn_name, args, kwds)) if action is None: return None elif action == "return self": return self elif action == "return response": res = MockResponse(200, "OK", {}, "") return res elif action == "return request": return Request("http://blah/") elif action.startswith("error"): code = action[action.rfind(" ")+1:] try: code = int(code) except ValueError: pass res = MockResponse(200, "OK", {}, "") return self.parent.error("http", args[0], res, code, "", {}) elif action == "raise": raise mechanize.URLError("blah") assert False def close(self): pass def add_parent(self, parent): self.parent = parent self.parent.calls = [] def __lt__(self, other): if not hasattr(other, "handler_order"): # Try to preserve the old behavior of having custom classes # inserted after default ones (works only for custom user # classes which are not aware of handler_order). return True return self.handler_order < other.handler_order def add_ordered_mock_handlers(opener, meth_spec): """Create MockHandlers and add them to an OpenerDirector. meth_spec: list of lists of tuples and strings defining methods to define on handlers. eg: [["http_error", "ftp_open"], ["http_open"]] defines methods .http_error() and .ftp_open() on one handler, and .http_open() on another. These methods just record their arguments and return None. Using a tuple instead of a string causes the method to perform some action (see MockHandler.handle()), eg: [["http_error"], [("http_open", "return request")]] defines .http_error() on one handler (which simply returns None), and .http_open() on another handler, which returns a Request object. """ handlers = [] count = 0 for meths in meth_spec: class MockHandlerSubclass(MockHandler): pass h = MockHandlerSubclass(meths) h.handler_order += count h.add_parent(opener) count = count + 1 handlers.append(h) opener.add_handler(h) return handlers def build_test_opener(*handler_instances): opener = OpenerDirector() for h in handler_instances: opener.add_handler(h) return opener class MockHTTPHandler(mechanize.BaseHandler): # useful for testing redirections and auth # sends supplied headers and code as first response # sends 200 OK as second response def __init__(self, code, headers): self.code = code self.headers = headers self.reset() def reset(self): self._count = 0 self.requests = [] def http_open(self, req): import mimetools, copy from StringIO import StringIO self.requests.append(copy.deepcopy(req)) if self._count == 0: self._count = self._count + 1 name = "Not important" msg = mimetools.Message(StringIO(self.headers)) return self.parent.error( "http", req, test_response(), self.code, name, msg) else: self.req = req return test_response("", [], req.get_full_url()) class MockPasswordManager: def add_password(self, realm, uri, user, password): self.realm = realm self.url = uri self.user = user self.password = password def find_user_password(self, realm, authuri): self.target_realm = realm self.target_url = authuri return self.user, self.password class OpenerDirectorTests(unittest.TestCase): def test_add_non_handler(self): class NonHandler(object): pass self.assertRaises(TypeError, OpenerDirector().add_handler, NonHandler()) def test_badly_named_methods(self): # test work-around for three methods that accidentally follow the # naming conventions for handler methods # (*_open() / *_request() / *_response()) # These used to call the accidentally-named methods, causing a # TypeError in real code; here, returning self from these mock # methods would either cause no exception, or AttributeError. from mechanize import URLError o = OpenerDirector() meth_spec = [ [("do_open", "return self"), ("proxy_open", "return self")], [("redirect_request", "return self")], ] handlers = add_ordered_mock_handlers(o, meth_spec) o.add_handler(mechanize.UnknownHandler()) for scheme in "do", "proxy", "redirect": self.assertRaises(URLError, o.open, scheme+"://example.com/") def test_handled(self): # handler returning non-None means no more handlers will be called o = OpenerDirector() meth_spec = [ ["http_open", "ftp_open", "http_error_302"], ["ftp_open"], [("http_open", "return self")], [("http_open", "return self")], ] handlers = add_ordered_mock_handlers(o, meth_spec) req = Request("http://example.com/") r = o.open(req) # Second .http_open() gets called, third doesn't, since second returned # non-None. Handlers without .http_open() never get any methods called # on them. # In fact, second mock handler defining .http_open() returns self # (instead of response), which becomes the OpenerDirector's return # value. self.assertEqual(r, handlers[2]) calls = [(handlers[0], "http_open"), (handlers[2], "http_open")] for expected, got in zip(calls, o.calls): handler, name, args, kwds = got self.assertEqual((handler, name), expected) self.assertEqual(args, (req,)) def test_reindex_handlers(self): o = OpenerDirector() class MockHandler: def add_parent(self, parent): pass def close(self):pass def __lt__(self, other): return self.handler_order < other.handler_order # this first class is here as an obscure regression test for bug # encountered during development: if something manages to get through # to _maybe_reindex_handlers, make sure it's properly removed and # doesn't affect adding of subsequent handlers class NonHandler(MockHandler): handler_order = 1 class Handler(MockHandler): handler_order = 2 def http_open(self): pass class Processor(MockHandler): handler_order = 3 def any_response(self): pass def http_response(self): pass o.add_handler(NonHandler()) h = Handler() o.add_handler(h) p = Processor() o.add_handler(p) o._maybe_reindex_handlers() self.assertEqual(o.handle_open, {"http": [h]}) self.assertEqual(len(o.process_response.keys()), 1) self.assertEqual(list(o.process_response["http"]), [p]) self.assertEqual(list(o._any_response), [p]) self.assertEqual(o.handlers, [h, p]) def test_handler_order(self): o = OpenerDirector() handlers = [] for meths, handler_order in [ ([("http_open", "return self")], 500), (["http_open"], 0), ]: class MockHandlerSubclass(MockHandler): pass h = MockHandlerSubclass(meths) h.handler_order = handler_order handlers.append(h) o.add_handler(h) r = o.open("http://example.com/") # handlers called in reverse order, thanks to their sort order self.assertEqual(o.calls[0][0], handlers[1]) self.assertEqual(o.calls[1][0], handlers[0]) def test_raise(self): # raising URLError stops processing of request o = OpenerDirector() meth_spec = [ [("http_open", "raise")], [("http_open", "return self")], ] handlers = add_ordered_mock_handlers(o, meth_spec) req = Request("http://example.com/") self.assertRaises(mechanize.URLError, o.open, req) self.assertEqual(o.calls, [(handlers[0], "http_open", (req,), {})]) ## def test_error(self): ## # XXX this doesn't actually seem to be used in standard library, ## # but should really be tested anyway... def test_http_error(self): # XXX http_error_default # http errors are a special case o = OpenerDirector() meth_spec = [ [("http_open", "error 302")], [("http_error_400", "raise"), "http_open"], [("http_error_302", "return response"), "http_error_303", "http_error"], [("http_error_302")], ] handlers = add_ordered_mock_handlers(o, meth_spec) req = Request("http://example.com/") r = o.open(req) assert len(o.calls) == 2 calls = [(handlers[0], "http_open", (req,)), (handlers[2], "http_error_302", (req, AlwaysEqual(), 302, "", {}))] for expected, got in zip(calls, o.calls): handler, method_name, args = expected self.assertEqual((handler, method_name), got[:2]) self.assertEqual(args, got[2]) def test_http_error_raised(self): # should get an HTTPError if an HTTP handler raises a non-200 response # XXX it worries me that this is the only test that excercises the else # branch in HTTPDefaultErrorHandler from mechanize import _response o = mechanize.OpenerDirector() o.add_handler(mechanize.HTTPErrorProcessor()) o.add_handler(mechanize.HTTPDefaultErrorHandler()) class HTTPHandler(AbstractHTTPHandler): def http_open(self, req): return _response.test_response(code=302) o.add_handler(HTTPHandler()) self.assertRaises(mechanize.HTTPError, o.open, "http://example.com/") def test_processors(self): # *_request / *_response methods get called appropriately o = OpenerDirector() meth_spec = [ [("http_request", "return request"), ("http_response", "return response")], [("http_request", "return request"), ("http_response", "return response")], ] handlers = add_ordered_mock_handlers(o, meth_spec) req = Request("http://example.com/") r = o.open(req) # processor methods are called on *all* handlers that define them, # not just the first handler that handles the request calls = [ (handlers[0], "http_request"), (handlers[1], "http_request"), (handlers[0], "http_response"), (handlers[1], "http_response")] self.assertEqual(len(o.calls), len(calls)) for i, (handler, name, args, kwds) in enumerate(o.calls): if i < 2: # *_request self.assertEqual((handler, name), calls[i]) self.assertEqual(len(args), 1) self.assertTrue(isinstance(args[0], Request)) else: # *_response self.assertEqual((handler, name), calls[i]) self.assertEqual(len(args), 2) self.assertTrue(isinstance(args[0], Request)) # response from opener.open is None, because there's no # handler that defines http_open to handle it self.assertTrue(args[1] is None or isinstance(args[1], MockResponse)) def test_any(self): # XXXXX two handlers case: ordering o = OpenerDirector() meth_spec = [[ ("http_request", "return request"), ("http_response", "return response"), ("ftp_request", "return request"), ("ftp_response", "return response"), ("any_request", "return request"), ("any_response", "return response"), ]] handlers = add_ordered_mock_handlers(o, meth_spec) handler = handlers[0] for scheme in ["http", "ftp"]: o.calls = [] req = Request("%s://example.com/" % scheme) r = o.open(req) calls = [(handler, "any_request"), (handler, ("%s_request" % scheme)), (handler, "any_response"), (handler, ("%s_response" % scheme)), ] self.assertEqual(len(o.calls), len(calls)) for i, ((handler, name, args, kwds), calls) in ( enumerate(zip(o.calls, calls))): if i < 2: # *_request self.assert_((handler, name) == calls) self.assert_(len(args) == 1) self.assert_(isinstance(args[0], Request)) else: # *_response self.assert_((handler, name) == calls) self.assert_(len(args) == 2) self.assert_(isinstance(args[0], Request)) # response from opener.open is None, because there's no # handler that defines http_open to handle it self.assert_(args[1] is None or isinstance(args[1], MockResponse)) def sanepathname2url(path): import urllib urlpath = urllib.pathname2url(path) if os.name == "nt" and urlpath.startswith("///"): urlpath = urlpath[2:] # XXX don't ask me about the mac... return urlpath class MockRobotFileParserClass: def __init__(self): self.calls = [] self._can_fetch = True def clear(self): self.calls = [] def __call__(self): self.calls.append("__call__") return self def set_url(self, url): self.calls.append(("set_url", url)) def set_timeout(self, timeout): self.calls.append(("set_timeout", timeout)) def set_opener(self, opener): self.calls.append(("set_opener", opener)) def read(self): self.calls.append("read") def can_fetch(self, ua, url): self.calls.append(("can_fetch", ua, url)) return self._can_fetch class MockPasswordManager: def add_password(self, realm, uri, user, password): self.realm = realm self.url = uri self.user = user self.password = password def find_user_password(self, realm, authuri): self.target_realm = realm self.target_url = authuri return self.user, self.password class HandlerTests(mechanize._testcase.TestCase): def test_ftp(self): class MockFTPWrapper: def __init__(self, data): self.data = data def retrfile(self, filename, filetype): self.filename, self.filetype = filename, filetype return StringIO.StringIO(self.data), len(self.data) class NullFTPHandler(mechanize.FTPHandler): def __init__(self, data): self.data = data def connect_ftp(self, user, passwd, host, port, dirs, timeout): self.user, self.passwd = user, passwd self.host, self.port = host, port self.dirs = dirs self.timeout = timeout self.ftpwrapper = MockFTPWrapper(self.data) return self.ftpwrapper import ftplib, socket data = "rheum rhaponicum" h = NullFTPHandler(data) o = h.parent = MockOpener() for url, host, port, type_, dirs, timeout, filename, mimetype in [ ("ftp://localhost/foo/bar/baz.html", "localhost", ftplib.FTP_PORT, "I", ["foo", "bar"], _sockettimeout._GLOBAL_DEFAULT_TIMEOUT, "baz.html", "text/html"), ("ftp://localhost:80/foo/bar/", "localhost", 80, "D", ["foo", "bar"], _sockettimeout._GLOBAL_DEFAULT_TIMEOUT, "", None), ("ftp://localhost/baz.gif;type=a", "localhost", ftplib.FTP_PORT, "A", [], _sockettimeout._GLOBAL_DEFAULT_TIMEOUT, "baz.gif", None), # TODO: really this should guess image/gif ]: req = Request(url, timeout=timeout) r = h.ftp_open(req) # ftp authentication not yet implemented by FTPHandler self.assertTrue(h.user == h.passwd == "") self.assertEqual(h.host, socket.gethostbyname(host)) self.assertEqual(h.port, port) self.assertEqual(h.dirs, dirs) if sys.version_info >= (2, 6): self.assertEquals(h.timeout, timeout) self.assertEqual(h.ftpwrapper.filename, filename) self.assertEqual(h.ftpwrapper.filetype, type_) headers = r.info() self.assertEqual(headers.get("Content-type"), mimetype) self.assertEqual(int(headers["Content-length"]), len(data)) def test_file(self): import rfc822, socket h = mechanize.FileHandler() o = h.parent = MockOpener() temp_file = os.path.join(self.make_temp_dir(), "test.txt") urlpath = sanepathname2url(os.path.abspath(temp_file)) towrite = "hello, world\n" try: fqdn = socket.gethostbyname(socket.gethostname()) except socket.gaierror: fqdn = "localhost" for url in [ "file://localhost%s" % urlpath, "file://%s" % urlpath, "file://%s%s" % (socket.gethostbyname('localhost'), urlpath), "file://%s%s" % (fqdn, urlpath) ]: write_file(temp_file, towrite) r = h.file_open(Request(url)) try: data = r.read() headers = r.info() newurl = r.geturl() finally: r.close() stats = os.stat(temp_file) modified = rfc822.formatdate(stats.st_mtime) self.assertEqual(data, towrite) self.assertEqual(headers["Content-type"], "text/plain") self.assertEqual(headers["Content-length"], "13") self.assertEqual(headers["Last-modified"], modified) for url in [ "file://localhost:80%s" % urlpath, "file:///file_does_not_exist.txt", "file://%s:80%s/%s" % (socket.gethostbyname('localhost'), os.getcwd(), temp_file), "file://somerandomhost.ontheinternet.com%s/%s" % (os.getcwd(), temp_file), ]: write_file(temp_file, towrite) self.assertRaises(mechanize.URLError, h.file_open, Request(url)) h = mechanize.FileHandler() o = h.parent = MockOpener() # XXXX why does // mean ftp (and /// mean not ftp!), and where # is file: scheme specified? I think this is really a bug, and # what was intended was to distinguish between URLs like: # file:/blah.txt (a file) # file://localhost/blah.txt (a file) # file:///blah.txt (a file) # file://ftp.example.com/blah.txt (an ftp URL) for url, ftp in [ ("file://ftp.example.com//foo.txt", True), ("file://ftp.example.com///foo.txt", False), # XXXX bug: fails with OSError, should be URLError ("file://ftp.example.com/foo.txt", False), ]: req = Request(url) try: h.file_open(req) # XXXX remove OSError when bug fixed except (mechanize.URLError, OSError): self.assertFalse(ftp) else: self.assertTrue(o.req is req) self.assertEqual(req.type, "ftp") def test_http(self): class MockHTTPResponse: def __init__(self, fp, msg, status, reason): self.fp = fp self.msg = msg self.status = status self.reason = reason def read(self): return '' class MockHTTPClass: def __init__(self): self.req_headers = [] self.data = None self.raise_on_endheaders = False def __call__(self, host, timeout=_sockettimeout._GLOBAL_DEFAULT_TIMEOUT): self.host = host self.timeout = timeout return self def set_debuglevel(self, level): self.level = level def request(self, method, url, body=None, headers={}): self.method = method self.selector = url self.req_headers += headers.items() self.req_headers.sort() if body: self.data = body if self.raise_on_endheaders: import socket raise socket.error() def getresponse(self): return MockHTTPResponse(MockFile(), {}, 200, "OK") h = AbstractHTTPHandler() o = h.parent = MockOpener() url = "http://example.com/" for method, data in [("GET", None), ("POST", "blah")]: req = Request(url, data, {"Foo": "bar"}) req.add_unredirected_header("Spam", "eggs") http = MockHTTPClass() r = h.do_open(http, req) # result attributes r.read; r.readline # wrapped MockFile methods r.info; r.geturl # addinfourl methods r.code, r.msg == 200, "OK" # added from MockHTTPClass.getreply() hdrs = r.info() hdrs.get; hdrs.has_key # r.info() gives dict from .getreply() self.assertEqual(r.geturl(), url) self.assertEqual(http.host, "example.com") self.assertEqual(http.level, 0) self.assertEqual(http.method, method) self.assertEqual(http.selector, "/") self.assertEqual(http.req_headers, [("Connection", "close"), ("Foo", "bar"), ("Spam", "eggs")]) self.assertEqual(http.data, data) # check socket.error converted to URLError http.raise_on_endheaders = True self.assertRaises(mechanize.URLError, h.do_open, http, req) # check adding of standard headers o.addheaders = [("Spam", "eggs")] for data in "", None: # POST, GET req = Request("http://example.com/", data) r = MockResponse(200, "OK", {}, "") newreq = h.do_request_(req) if data is None: # GET self.assertTrue("Content-length" not in req.unredirected_hdrs) self.assertTrue("Content-type" not in req.unredirected_hdrs) else: # POST self.assertEqual(req.unredirected_hdrs["Content-length"], "0") self.assertEqual(req.unredirected_hdrs["Content-type"], "application/x-www-form-urlencoded") # XXX the details of Host could be better tested self.assertEqual(req.unredirected_hdrs["Host"], "example.com") self.assertEqual(req.unredirected_hdrs["Spam"], "eggs") # don't clobber existing headers req.add_unredirected_header("Content-length", "foo") req.add_unredirected_header("Content-type", "bar") req.add_unredirected_header("Host", "baz") req.add_unredirected_header("Spam", "foo") newreq = h.do_request_(req) self.assertEqual(req.unredirected_hdrs["Content-length"], "foo") self.assertEqual(req.unredirected_hdrs["Content-type"], "bar") self.assertEqual(req.unredirected_hdrs["Host"], "baz") self.assertEqual(req.unredirected_hdrs["Spam"], "foo") def test_http_double_slash(self): # Checks that the presence of an unnecessary double slash in a url # doesn't break anything Previously, a double slash directly after the # host could cause incorrect parsing of the url h = AbstractHTTPHandler() o = h.parent = MockOpener() data = "" ds_urls = [ "http://example.com/foo/bar/baz.html", "http://example.com//foo/bar/baz.html", "http://example.com/foo//bar/baz.html", "http://example.com/foo/bar//baz.html", ] for ds_url in ds_urls: ds_req = Request(ds_url, data) # Check whether host is determined correctly if there is no proxy np_ds_req = h.do_request_(ds_req) self.assertEqual(np_ds_req.unredirected_hdrs["Host"],"example.com") # Check whether host is determined correctly if there is a proxy ds_req.set_proxy("someproxy:3128",None) p_ds_req = h.do_request_(ds_req) self.assertEqual(p_ds_req.unredirected_hdrs["Host"],"example.com") def test_errors(self): h = HTTPErrorProcessor() o = h.parent = MockOpener() req = Request("http://example.com") # all 2xx are passed through r = mechanize._response.test_response() newr = h.http_response(req, r) self.assertTrue(r is newr) self.assertTrue(not hasattr(o, "proto")) # o.error not called r = mechanize._response.test_response(code=202, msg="Accepted") newr = h.http_response(req, r) self.assertTrue(r is newr) self.assertTrue(not hasattr(o, "proto")) # o.error not called r = mechanize._response.test_response(code=206, msg="Partial content") newr = h.http_response(req, r) self.assertTrue(r is newr) self.assertTrue(not hasattr(o, "proto")) # o.error not called # anything else calls o.error (and MockOpener returns None, here) r = mechanize._response.test_response(code=502, msg="Bad gateway") self.assertTrue(h.http_response(req, r) is None) self.assertEqual(o.proto, "http") # o.error called self.assertEqual(o.args, (req, r, 502, "Bad gateway", AlwaysEqual())) def test_referer(self): h = HTTPRefererProcessor() o = h.parent = MockOpener() # normal case url = "http://example.com/" req = Request(url) r = MockResponse(200, "OK", {}, "", url) newr = h.http_response(req, r) self.assert_(r is newr) self.assert_(h.referer == url) newreq = h.http_request(req) self.assert_(req is newreq) self.assert_(req.unredirected_hdrs["Referer"] == url) # don't clobber existing Referer ref = "http://set.by.user.com/" req.add_unredirected_header("Referer", ref) newreq = h.http_request(req) self.assert_(req is newreq) self.assert_(req.unredirected_hdrs["Referer"] == ref) def test_raise_http_errors(self): # HTTPDefaultErrorHandler should raise HTTPError if no error handler # handled the error response from mechanize import _response h = mechanize.HTTPDefaultErrorHandler() url = "http://example.com"; code = 500; msg = "Error" request = mechanize.Request(url) response = _response.test_response(url=url, code=code, msg=msg) # case 1. it's not an HTTPError try: h.http_error_default( request, response, code, msg, response.info()) except mechanize.HTTPError, exc: self.assert_(exc is not response) self.assert_(exc.fp is response) else: self.assert_(False) # case 2. response object is already an HTTPError, so just re-raise it error = mechanize.HTTPError( url, code, msg, "fake headers", response) try: h.http_error_default( request, error, code, msg, error.info()) except mechanize.HTTPError, exc: self.assert_(exc is error) else: self.assert_(False) def test_robots(self): # XXX useragent from mechanize import HTTPRobotRulesProcessor opener = OpenerDirector() rfpc = MockRobotFileParserClass() h = HTTPRobotRulesProcessor(rfpc) opener.add_handler(h) url = "http://example.com:80/foo/bar.html" req = Request(url) # first time: initialise and set up robots.txt parser before checking # whether OK to fetch URL h.http_request(req) self.assertEquals(rfpc.calls, [ "__call__", ("set_opener", opener), ("set_url", "http://example.com:80/robots.txt"), ("set_timeout", _sockettimeout._GLOBAL_DEFAULT_TIMEOUT), "read", ("can_fetch", "", url), ]) # second time: just use existing parser rfpc.clear() req = Request(url) h.http_request(req) self.assert_(rfpc.calls == [ ("can_fetch", "", url), ]) # different URL on same server: same again rfpc.clear() url = "http://example.com:80/blah.html" req = Request(url) h.http_request(req) self.assert_(rfpc.calls == [ ("can_fetch", "", url), ]) # disallowed URL rfpc.clear() rfpc._can_fetch = False url = "http://example.com:80/rhubarb.html" req = Request(url) try: h.http_request(req) except mechanize.HTTPError, e: self.assert_(e.request == req) self.assert_(e.code == 403) # new host: reload robots.txt (even though the host and port are # unchanged, we treat this as a new host because # "example.com" != "example.com:80") rfpc.clear() rfpc._can_fetch = True url = "http://example.com/rhubarb.html" req = Request(url) h.http_request(req) self.assertEquals(rfpc.calls, [ "__call__", ("set_opener", opener), ("set_url", "http://example.com/robots.txt"), ("set_timeout", _sockettimeout._GLOBAL_DEFAULT_TIMEOUT), "read", ("can_fetch", "", url), ]) # https url -> should fetch robots.txt from https url too rfpc.clear() url = "https://example.org/rhubarb.html" req = Request(url) h.http_request(req) self.assertEquals(rfpc.calls, [ "__call__", ("set_opener", opener), ("set_url", "https://example.org/robots.txt"), ("set_timeout", _sockettimeout._GLOBAL_DEFAULT_TIMEOUT), "read", ("can_fetch", "", url), ]) # non-HTTP URL -> ignore robots.txt rfpc.clear() url = "ftp://example.com/" req = Request(url) h.http_request(req) self.assert_(rfpc.calls == []) def test_redirected_robots_txt(self): # redirected robots.txt fetch shouldn't result in another attempted # robots.txt fetch to check the redirection is allowed! import mechanize from mechanize import build_opener, HTTPHandler, \ HTTPDefaultErrorHandler, HTTPRedirectHandler, \ HTTPRobotRulesProcessor class MockHTTPHandler(mechanize.BaseHandler): def __init__(self): self.requests = [] def http_open(self, req): import mimetools, httplib, copy from StringIO import StringIO self.requests.append(copy.deepcopy(req)) if req.get_full_url() == "http://example.com/robots.txt": hdr = "Location: http://example.com/en/robots.txt\r\n\r\n" msg = mimetools.Message(StringIO(hdr)) return self.parent.error( "http", req, test_response(), 302, "Blah", msg) else: return test_response("Allow: *", [], req.get_full_url()) hh = MockHTTPHandler() hdeh = HTTPDefaultErrorHandler() hrh = HTTPRedirectHandler() rh = HTTPRobotRulesProcessor() o = build_test_opener(hh, hdeh, hrh, rh) o.open("http://example.com/") self.assertEqual([req.get_full_url() for req in hh.requests], ["http://example.com/robots.txt", "http://example.com/en/robots.txt", "http://example.com/", ]) def test_cookies(self): cj = MockCookieJar() h = HTTPCookieProcessor(cj) o = h.parent = MockOpener() req = Request("http://example.com/") r = MockResponse(200, "OK", {}, "") newreq = h.http_request(req) self.assertTrue(cj.ach_req is req is newreq) self.assertEquals(req.get_origin_req_host(), "example.com") self.assertFalse(cj.ach_u) newr = h.http_response(req, r) self.assertTrue(cj.ec_req is req) self.assertTrue(cj.ec_r is r is newr) self.assertFalse(cj.ec_u) def test_http_equiv(self): h = HTTPEquivProcessor() o = h.parent = MockOpener() data = ('' '' '' ) headers = [("Foo", "Bar"), ("Content-type", "text/html"), ("Refresh", "blah"), ] url = "http://example.com/" req = Request(url) r = mechanize._response.make_response(data, headers, url, 200, "OK") newr = h.http_response(req, r) new_headers = newr.info() self.assertEqual(new_headers["Foo"], "Bar") self.assertEqual(new_headers["Refresh"], "spam&eggs") self.assertEqual(new_headers.getheaders("Refresh"), ["blah", "spam&eggs"]) def test_refresh(self): # XXX test processor constructor optional args h = HTTPRefreshProcessor(max_time=None, honor_time=False) for val, valid in [ ('0; url="http://example.com/foo/"', True), ("2", True), # in the past, this failed with UnboundLocalError ('0; "http://example.com/foo/"', False), ]: o = h.parent = MockOpener() req = Request("http://example.com/") headers = http_message({"refresh": val}) r = MockResponse(200, "OK", headers, "", "http://example.com/") newr = h.http_response(req, r) if valid: self.assertEqual(o.proto, "http") self.assertEqual(o.args, (req, r, "refresh", "OK", headers)) def test_refresh_honor_time(self): class SleepTester: def __init__(self, test, seconds): self._test = test if seconds is 0: seconds = None # don't expect a sleep for 0 seconds self._expected = seconds self._got = None def sleep(self, seconds): self._got = seconds def verify(self): self._test.assertEqual(self._expected, self._got) class Opener: called = False def error(self, *args, **kwds): self.called = True def test(rp, header, refresh_after): expect_refresh = refresh_after is not None opener = Opener() rp.parent = opener st = SleepTester(self, refresh_after) rp._sleep = st.sleep rp.http_response(Request("http://example.com"), test_response(headers=[("Refresh", header)]), ) self.assertEqual(expect_refresh, opener.called) st.verify() # by default, only zero-time refreshes are honoured test(HTTPRefreshProcessor(), "0", 0) test(HTTPRefreshProcessor(), "2", None) # if requested, more than zero seconds are allowed test(HTTPRefreshProcessor(max_time=None), "2", 2) test(HTTPRefreshProcessor(max_time=30), "2", 2) # no sleep if we don't "honor_time" test(HTTPRefreshProcessor(max_time=30, honor_time=False), "2", 0) # request for too-long wait before refreshing --> no refresh occurs test(HTTPRefreshProcessor(max_time=30), "60", None) def test_redirect(self): from_url = "http://example.com/a.html" to_url = "http://example.com/b.html" h = HTTPRedirectHandler() o = h.parent = MockOpener() # ordinary redirect behaviour for code in 301, 302, 303, 307, "refresh": for data in None, "blah\nblah\n": method = getattr(h, "http_error_%s" % code) req = Request(from_url, data) req.add_header("Nonsense", "viking=withhold") req.add_unredirected_header("Spam", "spam") req.origin_req_host = "example.com" # XXX try: method(req, MockFile(), code, "Blah", http_message({"location": to_url})) except mechanize.HTTPError: # 307 in response to POST requires user OK self.assertEqual(code, 307) self.assertTrue(data is not None) self.assertEqual(o.req.get_full_url(), to_url) try: self.assertEqual(o.req.get_method(), "GET") except AttributeError: self.assertFalse(o.req.has_data()) # now it's a GET, there should not be headers regarding content # (possibly dragged from before being a POST) headers = [x.lower() for x in o.req.headers] self.assertTrue("content-length" not in headers) self.assertTrue("content-type" not in headers) self.assertEqual(o.req.headers["Nonsense"], "viking=withhold") self.assertTrue("Spam" not in o.req.headers) self.assertTrue("Spam" not in o.req.unredirected_hdrs) # loop detection req = Request(from_url) def redirect(h, req, url=to_url): h.http_error_302(req, MockFile(), 302, "Blah", http_message({"location": url})) # Note that the *original* request shares the same record of # redirections with the sub-requests caused by the redirections. # detect infinite loop redirect of a URL to itself req = Request(from_url, origin_req_host="example.com") count = 0 try: while 1: redirect(h, req, "http://example.com/") count = count + 1 except mechanize.HTTPError: # don't stop until max_repeats, because cookies may introduce state self.assertEqual(count, HTTPRedirectHandler.max_repeats) # detect endless non-repeating chain of redirects req = Request(from_url, origin_req_host="example.com") count = 0 try: while 1: redirect(h, req, "http://example.com/%d" % count) count = count + 1 except mechanize.HTTPError: self.assertEqual(count, HTTPRedirectHandler.max_redirections) def test_redirect_bad_uri(self): # bad URIs should be cleaned up before redirection from mechanize._response import test_html_response from_url = "http://example.com/a.html" bad_to_url = "http://example.com/b. |html" good_to_url = "http://example.com/b.%20%7Chtml" h = HTTPRedirectHandler() o = h.parent = MockOpener() req = Request(from_url) h.http_error_302(req, test_html_response(), 302, "Blah", http_message({"location": bad_to_url}), ) self.assertEqual(o.req.get_full_url(), good_to_url) def test_refresh_bad_uri(self): # bad URIs should be cleaned up before redirection from mechanize._response import test_html_response from_url = "http://example.com/a.html" bad_to_url = "http://example.com/b. |html" good_to_url = "http://example.com/b.%20%7Chtml" h = HTTPRefreshProcessor(max_time=None, honor_time=False) o = h.parent = MockOpener() req = Request("http://example.com/") r = test_html_response( headers=[("refresh", '0; url="%s"' % bad_to_url)]) newr = h.http_response(req, r) headers = o.args[-1] self.assertEqual(headers["Location"], good_to_url) def test_cookie_redirect(self): # cookies shouldn't leak into redirected requests import mechanize from mechanize import CookieJar, build_opener, HTTPHandler, \ HTTPCookieProcessor, HTTPError, HTTPDefaultErrorHandler, \ HTTPRedirectHandler from test_cookies import interact_netscape cj = CookieJar() interact_netscape(cj, "http://www.example.com/", "spam=eggs") hh = MockHTTPHandler(302, "Location: http://www.cracker.com/\r\n\r\n") hdeh = HTTPDefaultErrorHandler() hrh = HTTPRedirectHandler() cp = HTTPCookieProcessor(cj) o = build_test_opener(hh, hdeh, hrh, cp) o.open("http://www.example.com/") self.assertFalse(hh.req.has_header("Cookie")) def test_proxy(self): o = OpenerDirector() ph = mechanize.ProxyHandler(dict(http="proxy.example.com:3128")) o.add_handler(ph) meth_spec = [ [("http_open", "return response")] ] handlers = add_ordered_mock_handlers(o, meth_spec) o._maybe_reindex_handlers() req = Request("http://acme.example.com/") self.assertEqual(req.get_host(), "acme.example.com") r = o.open(req) self.assertEqual(req.get_host(), "proxy.example.com:3128") self.assertEqual([(handlers[0], "http_open")], [tup[0:2] for tup in o.calls]) def test_proxy_no_proxy(self): self.monkey_patch_environ("no_proxy", "python.org") o = OpenerDirector() ph = mechanize.ProxyHandler(dict(http="proxy.example.com")) o.add_handler(ph) req = Request("http://www.perl.org/") self.assertEqual(req.get_host(), "www.perl.org") r = o.open(req) self.assertEqual(req.get_host(), "proxy.example.com") req = Request("http://www.python.org") self.assertEqual(req.get_host(), "www.python.org") r = o.open(req) if sys.version_info >= (2, 6): # no_proxy environment variable not supported in python 2.5 self.assertEqual(req.get_host(), "www.python.org") def test_proxy_custom_proxy_bypass(self): self.monkey_patch_environ("no_proxy", mechanize._testcase.MonkeyPatcher.Unset) def proxy_bypass(hostname): return hostname == "noproxy.com" o = OpenerDirector() ph = mechanize.ProxyHandler(dict(http="proxy.example.com"), proxy_bypass=proxy_bypass) def is_proxied(url): o.add_handler(ph) req = Request(url) o.open(req) return req.has_proxy() self.assertTrue(is_proxied("http://example.com")) self.assertFalse(is_proxied("http://noproxy.com")) def test_proxy_https(self): o = OpenerDirector() ph = mechanize.ProxyHandler(dict(https='proxy.example.com:3128')) o.add_handler(ph) meth_spec = [ [("https_open","return response")] ] handlers = add_ordered_mock_handlers(o, meth_spec) req = Request("https://www.example.com/") self.assertEqual(req.get_host(), "www.example.com") r = o.open(req) self.assertEqual(req.get_host(), "proxy.example.com:3128") self.assertEqual([(handlers[0], "https_open")], [tup[0:2] for tup in o.calls]) def test_basic_auth(self, quote_char='"'): opener = OpenerDirector() password_manager = MockPasswordManager() auth_handler = mechanize.HTTPBasicAuthHandler(password_manager) realm = "ACME Widget Store" http_handler = MockHTTPHandler( 401, 'WWW-Authenticate: Basic realm=%s%s%s\r\n\r\n' % (quote_char, realm, quote_char) ) opener.add_handler(auth_handler) opener.add_handler(http_handler) self._test_basic_auth(opener, auth_handler, "Authorization", realm, http_handler, password_manager, "http://acme.example.com/protected", "http://acme.example.com/protected", ) def test_basic_auth_with_single_quoted_realm(self): self.test_basic_auth(quote_char="'") def test_proxy_basic_auth(self): opener = OpenerDirector() ph = mechanize.ProxyHandler(dict(http="proxy.example.com:3128")) opener.add_handler(ph) password_manager = MockPasswordManager() auth_handler = mechanize.ProxyBasicAuthHandler(password_manager) realm = "ACME Networks" http_handler = MockHTTPHandler( 407, 'Proxy-Authenticate: Basic realm="%s"\r\n\r\n' % realm) opener.add_handler(auth_handler) opener.add_handler(http_handler) self._test_basic_auth(opener, auth_handler, "Proxy-authorization", realm, http_handler, password_manager, "http://acme.example.com:3128/protected", "proxy.example.com:3128", ) def test_basic_and_digest_auth_handlers(self): # HTTPDigestAuthHandler threw an exception if it couldn't handle a 40* # response (http://python.org/sf/1479302), where it should instead # return None to allow another handler (especially # HTTPBasicAuthHandler) to handle the response. # Also (http://python.org/sf/1479302, RFC 2617 section 1.2), we must # try digest first (since it's the strongest auth scheme), so we record # order of calls here to check digest comes first: class RecordingOpenerDirector(OpenerDirector): def __init__(self): OpenerDirector.__init__(self) self.recorded = [] def record(self, info): self.recorded.append(info) class TestDigestAuthHandler(mechanize.HTTPDigestAuthHandler): def http_error_401(self, *args, **kwds): self.parent.record("digest") mechanize.HTTPDigestAuthHandler.http_error_401(self, *args, **kwds) class TestBasicAuthHandler(mechanize.HTTPBasicAuthHandler): def http_error_401(self, *args, **kwds): self.parent.record("basic") mechanize.HTTPBasicAuthHandler.http_error_401(self, *args, **kwds) opener = RecordingOpenerDirector() password_manager = MockPasswordManager() digest_handler = TestDigestAuthHandler(password_manager) basic_handler = TestBasicAuthHandler(password_manager) realm = "ACME Networks" http_handler = MockHTTPHandler( 401, 'WWW-Authenticate: Basic realm="%s"\r\n\r\n' % realm) opener.add_handler(digest_handler) opener.add_handler(basic_handler) opener.add_handler(http_handler) opener._maybe_reindex_handlers() # check basic auth isn't blocked by digest handler failing self._test_basic_auth(opener, basic_handler, "Authorization", realm, http_handler, password_manager, "http://acme.example.com/protected", "http://acme.example.com/protected", ) # check digest was tried before basic (twice, because # _test_basic_auth called .open() twice) self.assertEqual(opener.recorded, ["digest", "basic"]*2) def _test_basic_auth(self, opener, auth_handler, auth_header, realm, http_handler, password_manager, request_url, protected_url): import base64 user, password = "wile", "coyote" # .add_password() fed through to password manager auth_handler.add_password(realm, request_url, user, password) self.assertEqual(realm, password_manager.realm) self.assertEqual(request_url, password_manager.url) self.assertEqual(user, password_manager.user) self.assertEqual(password, password_manager.password) r = opener.open(request_url) # should have asked the password manager for the username/password self.assertEqual(password_manager.target_realm, realm) self.assertEqual(password_manager.target_url, protected_url) # expect one request without authorization, then one with self.assertEqual(len(http_handler.requests), 2) self.assertFalse(http_handler.requests[0].has_header(auth_header)) userpass = '%s:%s' % (user, password) auth_hdr_value = 'Basic '+base64.encodestring(userpass).strip() self.assertEqual(http_handler.requests[1].get_header(auth_header), auth_hdr_value) # if the password manager can't find a password, the handler won't # handle the HTTP auth error password_manager.user = password_manager.password = None http_handler.reset() r = opener.open(request_url) self.assertEqual(len(http_handler.requests), 1) self.assertFalse(http_handler.requests[0].has_header(auth_header)) class HeadParserTests(unittest.TestCase): def test(self): # XXX XHTML from mechanize import HeadParser htmls = [ (""" """, [("refresh", "1; http://example.com/")] ), ("""

""", [("refresh", "1; http://example.com/"), ("foo", "bar")]), (""" """, []) ] for html, result in htmls: self.assertEqual(parse_head(StringIO.StringIO(html), HeadParser()), result) class A: def a(self): pass class B(A): def a(self): pass def b(self): pass class C(A): def c(self): pass class D(C, B): def a(self): pass def d(self): pass class FunctionTests(unittest.TestCase): def test_build_opener(self): class MyHTTPHandler(HTTPHandler): pass class FooHandler(mechanize.BaseHandler): def foo_open(self): pass class BarHandler(mechanize.BaseHandler): def bar_open(self): pass o = build_opener(FooHandler, BarHandler) self.opener_has_handler(o, FooHandler) self.opener_has_handler(o, BarHandler) # can take a mix of classes and instances o = build_opener(FooHandler, BarHandler()) self.opener_has_handler(o, FooHandler) self.opener_has_handler(o, BarHandler) # subclasses of default handlers override default handlers o = build_opener(MyHTTPHandler) self.opener_has_handler(o, MyHTTPHandler) # a particular case of overriding: default handlers can be passed # in explicitly o = build_opener() self.opener_has_handler(o, HTTPHandler) o = build_opener(HTTPHandler) self.opener_has_handler(o, HTTPHandler) o = build_opener(HTTPHandler()) self.opener_has_handler(o, HTTPHandler) # Issue2670: multiple handlers sharing the same base class class MyOtherHTTPHandler(HTTPHandler): pass o = build_opener(MyHTTPHandler, MyOtherHTTPHandler) self.opener_has_handler(o, MyHTTPHandler) self.opener_has_handler(o, MyOtherHTTPHandler) def opener_has_handler(self, opener, handler_class): for h in opener.handlers: if h.__class__ == handler_class: break else: self.assertTrue(False) class RequestTests(unittest.TestCase): def setUp(self): self.get = Request("http://www.python.org/~jeremy/") self.post = Request("http://www.python.org/~jeremy/", "data", headers={"X-Test": "test"}) def test_method(self): self.assertEqual("POST", self.post.get_method()) self.assertEqual("GET", self.get.get_method()) def test_add_data(self): self.assertTrue(not self.get.has_data()) self.assertEqual("GET", self.get.get_method()) self.get.add_data("spam") self.assertTrue(self.get.has_data()) self.assertEqual("POST", self.get.get_method()) def test_get_full_url(self): self.assertEqual("http://www.python.org/~jeremy/", self.get.get_full_url()) def test_selector(self): self.assertEqual("/~jeremy/", self.get.get_selector()) req = Request("http://www.python.org/") self.assertEqual("/", req.get_selector()) def test_get_type(self): self.assertEqual("http", self.get.get_type()) def test_get_host(self): self.assertEqual("www.python.org", self.get.get_host()) def test_get_host_unquote(self): req = Request("http://www.%70ython.org/") self.assertEqual("www.python.org", req.get_host()) def test_proxy(self): self.assertTrue(not self.get.has_proxy()) self.get.set_proxy("www.perl.org", "http") self.assertTrue(self.get.has_proxy()) self.assertEqual("www.python.org", self.get.get_origin_req_host()) self.assertEqual("www.perl.org", self.get.get_host()) if __name__ == "__main__": import doctest doctest.testmod() unittest.main() mechanize-0.2.5/test/test_urllib2_localnet.py0000644000175000017500000004510011545150644017770 0ustar johnjohn#!/usr/bin/env python """Functional tests from the Python standard library test suite.""" import mimetools import threading import urlparse import mechanize import BaseHTTPServer import unittest from mechanize._testcase import TestCase from mechanize._urllib2_fork import md5_digest import testprogram # Loopback http server infrastructure class LoopbackHttpServer(BaseHTTPServer.HTTPServer): """HTTP server w/ a few modifications that make it useful for loopback testing purposes. """ def __init__(self, server_address, RequestHandlerClass): BaseHTTPServer.HTTPServer.__init__(self, server_address, RequestHandlerClass) # Set the timeout of our listening socket really low so # that we can stop the server easily. self.socket.settimeout(1.0) def get_request(self): """BaseHTTPServer method, overridden.""" request, client_address = self.socket.accept() # It's a loopback connection, so setting the timeout # really low shouldn't affect anything, but should make # deadlocks less likely to occur. request.settimeout(10.0) return (request, client_address) class LoopbackHttpServerThread(threading.Thread): """Stoppable thread that runs a loopback http server.""" def __init__(self, handle_request=None): threading.Thread.__init__(self) self._stop = False self.ready = threading.Event() self._request_handler = None if handle_request is None: handle_request = self._handle_request self.httpd = LoopbackHttpServer(('127.0.0.1', 0), handle_request) #print "Serving HTTP on %s port %s" % (self.httpd.server_name, # self.httpd.server_port) self.port = self.httpd.server_port def set_request_handler(self, request_handler): self._request_handler = request_handler def _handle_request(self, *args, **kwds): self._request_handler.handle_request(*args, **kwds) return self._request_handler def stop(self): """Stops the webserver if it's currently running.""" # Set the stop flag. self._stop = True self.join() def run(self): self.ready.set() while not self._stop: self.httpd.handle_request() # Authentication infrastructure class DigestAuthHandler: """Handler for performing digest authentication.""" def __init__(self): self._request_num = 0 self._nonces = [] self._users = {} self._realm_name = "Test Realm" self._qop = "auth" def set_qop(self, qop): self._qop = qop def set_users(self, users): assert isinstance(users, dict) self._users = users def set_realm(self, realm): self._realm_name = realm def _generate_nonce(self): self._request_num += 1 nonce = md5_digest(str(self._request_num)) self._nonces.append(nonce) return nonce def _create_auth_dict(self, auth_str): first_space_index = auth_str.find(" ") auth_str = auth_str[first_space_index+1:] parts = auth_str.split(",") auth_dict = {} for part in parts: name, value = part.split("=") name = name.strip() if value[0] == '"' and value[-1] == '"': value = value[1:-1] else: value = value.strip() auth_dict[name] = value return auth_dict def _validate_auth(self, auth_dict, password, method, uri): final_dict = {} final_dict.update(auth_dict) final_dict["password"] = password final_dict["method"] = method final_dict["uri"] = uri HA1_str = "%(username)s:%(realm)s:%(password)s" % final_dict HA1 = md5_digest(HA1_str) HA2_str = "%(method)s:%(uri)s" % final_dict HA2 = md5_digest(HA2_str) final_dict["HA1"] = HA1 final_dict["HA2"] = HA2 response_str = "%(HA1)s:%(nonce)s:%(nc)s:" \ "%(cnonce)s:%(qop)s:%(HA2)s" % final_dict response = md5_digest(response_str) return response == auth_dict["response"] def _return_auth_challenge(self, request_handler): request_handler.send_response(407, "Proxy Authentication Required") request_handler.send_header("Content-Type", "text/html") request_handler.send_header( 'Proxy-Authenticate', 'Digest realm="%s", ' 'qop="%s",' 'nonce="%s", ' % \ (self._realm_name, self._qop, self._generate_nonce())) # XXX: Not sure if we're supposed to add this next header or # not. #request_handler.send_header('Connection', 'close') request_handler.end_headers() request_handler.wfile.write("Proxy Authentication Required.") return False def handle_request(self, request_handler): """Performs digest authentication on the given HTTP request handler. Returns True if authentication was successful, False otherwise. If no users have been set, then digest auth is effectively disabled and this method will always return True. """ if len(self._users) == 0: return True if not request_handler.headers.has_key('Proxy-Authorization'): return self._return_auth_challenge(request_handler) else: auth_dict = self._create_auth_dict( request_handler.headers['Proxy-Authorization'] ) if self._users.has_key(auth_dict["username"]): password = self._users[ auth_dict["username"] ] else: return self._return_auth_challenge(request_handler) if not auth_dict.get("nonce") in self._nonces: return self._return_auth_challenge(request_handler) else: self._nonces.remove(auth_dict["nonce"]) auth_validated = False # MSIE uses short_path in its validation, but mechanize uses the # full path, so we're going to see if either of them works here. for path in [request_handler.path, request_handler.short_path]: if self._validate_auth(auth_dict, password, request_handler.command, path): auth_validated = True if not auth_validated: return self._return_auth_challenge(request_handler) return True # Proxy test infrastructure class FakeProxyHandler(BaseHTTPServer.BaseHTTPRequestHandler): """This is a 'fake proxy' that makes it look like the entire internet has gone down due to a sudden zombie invasion. It main utility is in providing us with authentication support for testing. """ protocol_version = "HTTP/1.0" def __init__(self, digest_auth_handler, *args, **kwargs): # This has to be set before calling our parent's __init__(), which will # try to call do_GET(). self.digest_auth_handler = digest_auth_handler BaseHTTPServer.BaseHTTPRequestHandler.__init__(self, *args, **kwargs) def log_message(self, format, *args): # Uncomment the next line for debugging. #sys.stderr.write(format % args) pass def do_GET(self): (scm, netloc, path, params, query, fragment) = urlparse.urlparse( self.path, 'http') self.short_path = path if self.digest_auth_handler.handle_request(self): self.send_response(200, "OK") self.send_header("Content-Type", "text/html") self.end_headers() self.wfile.write("You've reached %s!
" % self.path) self.wfile.write("Our apologies, but our server is down due to " "a sudden zombie invasion.") def make_started_server(make_request_handler=None): server = LoopbackHttpServerThread(make_request_handler) server.start() server.ready.wait() return server # Test cases class ProxyAuthTests(TestCase): URL = "http://localhost" USER = "tester" PASSWD = "test123" REALM = "TestRealm" def _make_server(self, qop="auth"): digest_auth_handler = DigestAuthHandler() digest_auth_handler.set_users({self.USER: self.PASSWD}) digest_auth_handler.set_realm(self.REALM) digest_auth_handler.set_qop(qop) def create_fake_proxy_handler(*args, **kwargs): return FakeProxyHandler(digest_auth_handler, *args, **kwargs) return make_started_server(create_fake_proxy_handler) def setUp(self): TestCase.setUp(self) fixture_name = "test_urllib2_localnet_ProxyAuthTests_server" self.register_context_manager(fixture_name, testprogram.ServerCM(self._make_server)) server = self.get_cached_fixture(fixture_name) proxy_url = "http://127.0.0.1:%d" % server.port handler = mechanize.ProxyHandler({"http" : proxy_url}) self.proxy_digest_handler = mechanize.ProxyDigestAuthHandler() self.opener = mechanize.build_opener(handler, self.proxy_digest_handler) def test_proxy_with_bad_password_raises_httperror(self): self.proxy_digest_handler.add_password(self.REALM, self.URL, self.USER, self.PASSWD+"bad") self.assertRaises(mechanize.HTTPError, self.opener.open, self.URL) def test_proxy_with_no_password_raises_httperror(self): self.assertRaises(mechanize.HTTPError, self.opener.open, self.URL) def test_proxy_qop_auth_works(self): self.proxy_digest_handler.add_password(self.REALM, self.URL, self.USER, self.PASSWD) result = self.opener.open(self.URL) while result.read(): pass result.close() def test_proxy_qop_auth_int_works_or_throws_urlerror(self): server = self._make_server("auth-int") self.add_teardown(lambda: server.stop()) self.proxy_digest_handler.add_password(self.REALM, self.URL, self.USER, self.PASSWD) try: result = self.opener.open(self.URL) except mechanize.URLError: # It's okay if we don't support auth-int, but we certainly # shouldn't receive any kind of exception here other than # a URLError. result = None if result: while result.read(): pass result.close() class RecordingHTTPRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler): server_version = "TestHTTP/" protocol_version = "HTTP/1.0" def __init__(self, port, get_next_response, record_request, record_received_headers, *args, **kwds): self._port = port self._get_next_response = get_next_response self._record_request = record_request self._record_received_headers = record_received_headers BaseHTTPServer.BaseHTTPRequestHandler.__init__(self, *args, **kwds) def do_GET(self): body = self.send_head() if body: self.wfile.write(body) def do_POST(self): content_length = self.headers['Content-Length'] post_data = self.rfile.read(int(content_length)) self.do_GET() self._record_request(post_data) def send_head(self): self._record_received_headers(self.headers) self._record_request(self.path) response_code, headers, body = self._get_next_response() self.send_response(response_code) for (header, value) in headers: self.send_header(header, value % self._port) if body: self.send_header('Content-type', 'text/plain') self.end_headers() return body self.end_headers() def log_message(self, *args): pass class FakeHTTPRequestHandler(object): def __init__(self, port, responses): self.port = port self._responses = responses self.requests = [] self.received_headers = None def _get_next_response(self): return self._responses.pop(0) def _record_request(self, request): self.requests.append(request) def _record_received_headers(self, headers): self.received_headers = headers def handle_request(self, *args, **kwds): RecordingHTTPRequestHandler( self.port, self._get_next_response, self._record_request, self._record_received_headers, *args, **kwds) class TestUrlopen(TestCase): """Tests mechanize.urlopen using the network. These tests are not exhaustive. Assuming that testing using files does a good job overall of some of the basic interface features. There are no tests exercising the optional 'data' and 'proxies' arguments. No tests for transparent redirection have been written. """ fixture_name = "test_urllib2_localnet_TestUrlopen_server" def setUp(self): TestCase.setUp(self) self.register_context_manager( self.fixture_name, testprogram.ServerCM(make_started_server)) def get_server(self): return self.get_cached_fixture(self.fixture_name) def _make_request_handler(self, responses): server = self.get_server() handler = FakeHTTPRequestHandler(server.port, responses) server.set_request_handler(handler) return handler def test_redirection(self): expected_response = 'We got here...' responses = [ (302, [('Location', 'http://localhost:%s/somewhere_else')], ''), (200, [], expected_response) ] handler = self._make_request_handler(responses) f = mechanize.urlopen('http://localhost:%s/' % handler.port) data = f.read() f.close() self.assertEquals(data, expected_response) self.assertEquals(handler.requests, ['/', '/somewhere_else']) def test_404(self): expected_response = 'Bad bad bad...' handler = self._make_request_handler([(404, [], expected_response)]) try: mechanize.urlopen('http://localhost:%s/weeble' % handler.port) except mechanize.URLError, f: pass else: self.fail('404 should raise URLError') data = f.read() f.close() self.assertEquals(data, expected_response) self.assertEquals(handler.requests, ['/weeble']) def test_200(self): expected_response = 'pycon 2008...' handler = self._make_request_handler([(200, [], expected_response)]) f = mechanize.urlopen('http://localhost:%s/bizarre' % handler.port) data = f.read() f.close() self.assertEquals(data, expected_response) self.assertEquals(handler.requests, ['/bizarre']) def test_200_with_parameters(self): expected_response = 'pycon 2008...' handler = self._make_request_handler([(200, [], expected_response)]) f = mechanize.urlopen('http://localhost:%s/bizarre' % handler.port, 'get=with_feeling') data = f.read() f.close() self.assertEquals(data, expected_response) self.assertEquals(handler.requests, ['/bizarre', 'get=with_feeling']) def test_sending_headers(self): handler = self._make_request_handler([(200, [], "we don't care")]) req = mechanize.Request("http://localhost:%s/" % handler.port, headers={'Range': 'bytes=20-39'}) mechanize.urlopen(req) self.assertEqual(handler.received_headers['Range'], 'bytes=20-39') def test_basic(self): handler = self._make_request_handler([(200, [], "we don't care")]) open_url = mechanize.urlopen("http://localhost:%s" % handler.port) for attr in ("read", "close", "info", "geturl"): self.assertTrue(hasattr(open_url, attr), "object returned from " "urlopen lacks the %s attribute" % attr) try: self.assertTrue(open_url.read(), "calling 'read' failed") finally: open_url.close() def test_info(self): handler = self._make_request_handler([(200, [], "we don't care")]) open_url = mechanize.urlopen("http://localhost:%s" % handler.port) info_obj = open_url.info() self.assertTrue(isinstance(info_obj, mimetools.Message), "object returned by 'info' is not an instance of " "mimetools.Message") self.assertEqual(info_obj.getsubtype(), "plain") def test_geturl(self): # Make sure same URL as opened is returned by geturl. handler = self._make_request_handler([(200, [], "we don't care")]) open_url = mechanize.urlopen("http://localhost:%s" % handler.port) url = open_url.geturl() self.assertEqual(url, "http://localhost:%s" % handler.port) def test_bad_address(self): # Make sure proper exception is raised when connecting to a bogus # address. self.assertRaises(IOError, # Given that both VeriSign and various ISPs have in # the past or are presently hijacking various invalid # domain name requests in an attempt to boost traffic # to their own sites, finding a domain name to use # for this test is difficult. RFC2606 leads one to # believe that '.invalid' should work, but experience # seemed to indicate otherwise. Single character # TLDs are likely to remain invalid, so this seems to # be the best choice. The trailing '.' prevents a # related problem: The normal DNS resolver appends # the domain names from the search path if there is # no '.' the end and, and if one of those domains # implements a '*' rule a result is returned. # However, none of this will prevent the test from # failing if the ISP hijacks all invalid domain # requests. The real solution would be to be able to # parameterize the framework with a mock resolver. mechanize.urlopen, "http://sadflkjsasf.i.nvali.d./") if __name__ == "__main__": unittest.main() mechanize-0.2.5/test/test_response.doctest0000644000175000017500000001142311545150644017410 0ustar johnjohnThe read_complete flag lets us know if all of the wrapped file's data has been read. We want to know this because Browser.back() must .reload() the response if not. I've noted here the various cases where .read_complete may be set. >>> import mechanize >>> text = "To err is human, to moo, bovine.\n"*10 >>> def get_wrapper(): ... import cStringIO ... from mechanize._response import seek_wrapper ... f = cStringIO.StringIO(text) ... wr = seek_wrapper(f) ... return wr .read() case #1 >>> wr = get_wrapper() >>> wr.read_complete False >>> junk = wr.read() >>> wr.read_complete True >>> wr.seek(0) >>> wr.read_complete True Excercise partial .read() and .readline(), and .seek() case #1 >>> wr = get_wrapper() >>> junk = wr.read(10) >>> wr.read_complete False >>> junk = wr.readline() >>> wr.read_complete False >>> wr.seek(0, 2) >>> wr.read_complete True >>> wr.seek(0) >>> wr.read_complete True .readlines() case #1 >>> wr = get_wrapper() >>> junk = wr.readlines() >>> wr.read_complete True >>> wr.seek(0) >>> wr.read_complete True .seek() case #2 >>> wr = get_wrapper() >>> wr.seek(10) >>> wr.read_complete False >>> wr.seek(1000000) .read() case #2 >>> wr = get_wrapper() >>> junk = wr.read(1000000) >>> wr.read_complete # we read to the end, but don't know it yet False >>> junk = wr.read(10) >>> wr.read_complete True .readline() case #1 >>> wr = get_wrapper() >>> junk = wr.read(len(text)-10) >>> wr.read_complete False >>> junk = wr.readline() >>> wr.read_complete # we read to the end, but don't know it yet False >>> junk = wr.readline() >>> wr.read_complete True Test copying and sharing of .read_complete state >>> import copy >>> wr = get_wrapper() >>> wr2 = copy.copy(wr) >>> wr.read_complete False >>> wr2.read_complete False >>> junk = wr2.read() >>> wr.read_complete True >>> wr2.read_complete True Fix from -r36082: .read() after .close() used to break .read_complete state >>> from mechanize._response import test_response >>> r = test_response(text) >>> junk = r.read(64) >>> r.close() >>> r.read_complete False >>> r.read() '' >>> r.read_complete False Tests for the truly horrendous upgrade_response() >>> def is_response(r): ... names = "get_data read readline readlines close seek code msg".split() ... for name in names: ... if not hasattr(r, name): ... return False ... return r.get_data() == "test data" >>> from cStringIO import StringIO >>> from mechanize._response import upgrade_response, make_headers, \ ... make_response, closeable_response, seek_wrapper >>> data="test data"; url="http://example.com/"; code=200; msg="OK" Normal response (closeable_response wrapped with seek_wrapper): return a copy >>> r1 = make_response(data, [], url, code, msg) >>> r2 = upgrade_response(r1) >>> is_response(r2) True >>> r1 is not r2 True >>> r1.wrapped is r2.wrapped True closeable_response with no seek_wrapper: wrap with seek_wrapper >>> r1 = closeable_response(StringIO(data), make_headers([]), url, code, msg) >>> is_response(r1) False >>> r2 = upgrade_response(r1) >>> is_response(r2) True >>> r1 is not r2 True >>> r1 is r2.wrapped True addinfourl: extract .fp and wrap it with closeable_response and seek_wrapper >>> from mechanize._urllib2_fork import addinfourl >>> r1= addinfourl(StringIO(data), make_headers([]), url) >>> is_response(r1) False >>> r2 = upgrade_response(r1) >>> is_response(r2) True >>> r1 is not r2 True >>> r1 is not r2.wrapped True >>> r1.fp is r2.wrapped.fp True addinfourl with code, msg >>> r1= addinfourl(StringIO(data), make_headers([]), url) >>> r1.code = 206 >>> r1.msg = "cool" >>> r2 = upgrade_response(r1) >>> is_response(r2) True >>> r2.code == r1.code True >>> r2.msg == r1.msg True addinfourl with seek wrapper: cached data is not lost >>> r1= addinfourl(StringIO(data), make_headers([]), url) >>> r1 = seek_wrapper(r1) >>> r1.read(4) 'test' >>> r2 = upgrade_response(r1) >>> is_response(r2) True addinfourl wrapped with HTTPError -- remains an HTTPError of the same subclass (through horrible trickery) >>> hdrs = make_headers([]) >>> r1 = addinfourl(StringIO(data), hdrs, url) >>> class MyHTTPError(mechanize.HTTPError): pass >>> r1 = MyHTTPError(url, code, msg, hdrs, r1) >>> is_response(r1) False >>> r2 = upgrade_response(r1) >>> is_response(r2) True >>> isinstance(r2, MyHTTPError) True >>> r2 # doctest: +ELLIPSIS >> r3 = upgrade_response(r2) >>> is_response(r3) True >>> r3 is not r2 True >>> r3.wrapped is r2.wrapped True Test dynamically-created class __repr__ for case where we have the module name >>> r4 = addinfourl(StringIO(data), hdrs, url) >>> r4 = mechanize.HTTPError(url, code, msg, hdrs, r4) >>> upgrade_response(r4) # doctest: +ELLIPSIS # Copyright 2005 Gary Poster # Copyright 2005 Zope Corporation # Copyright 1998-2000 Gisle Aas. from cStringIO import StringIO import os import string import unittest import mechanize import mechanize._form as _form from mechanize import ControlNotFoundError, ItemNotFoundError, \ ItemCountError, AmbiguityError import mechanize._testcase as _testcase from mechanize._util import get1 # XXX # HTMLForm.set/get_value_by_label() # Base control tests on ParseFile, so can use same tests for different form # implementations. # HTMLForm.enctype # XHTML try: True except NameError: True = 1 False = 0 try: bool except NameError: def bool(expr): if expr: return True else: return False try: import warnings except ImportError: warnings_imported = False def hide_deprecations(): pass def reset_deprecations(): pass def raise_deprecations(): pass else: warnings_imported = True def hide_deprecations(): warnings.filterwarnings('ignore', category=DeprecationWarning) def reset_deprecations(): warnings.filterwarnings('default', category=DeprecationWarning) #warnings.resetwarnings() # XXX probably safer def raise_deprecations(): try: registry = _form.__warningregistry__ except AttributeError: pass else: registry.clear() warnings.filterwarnings('error', category=DeprecationWarning) class DummyForm: def __init__(self): self._forms = [] self._labels = [] self._id_to_labels = {} self.backwards_compat = False self.controls = [] def find_control(self, name, type): raise mechanize.ControlNotFoundError class UnescapeTests(unittest.TestCase): def test_unescape_charref(self): unescape_charref = _form.unescape_charref mdash_utf8 = u"\u2014".encode("utf-8") for ref, codepoint, utf8, latin1 in [ ("38", 38, u"&".encode("utf-8"), "&"), ("x2014", 0x2014, mdash_utf8, "—"), ("8212", 8212, mdash_utf8, "—"), ]: self.assertEqual(unescape_charref(ref, None), unichr(codepoint)) self.assertEqual(unescape_charref(ref, 'latin-1'), latin1) self.assertEqual(unescape_charref(ref, 'utf-8'), utf8) def test_get_entitydefs(self): get_entitydefs = _form.get_entitydefs ed = get_entitydefs() for name, char in [ ("&", u"&"), ("<", u"<"), (">", u">"), ("—", u"\u2014"), ("♠", u"\u2660"), ]: self.assertEqual(ed[name], char) def test_unescape1(self): unescape = _form.unescape get_entitydefs = _form.get_entitydefs data = "& < — — —" mdash_utf8 = u"\u2014".encode("utf-8") ue = unescape(data, get_entitydefs(), "utf-8") self.assertEqual("& < %s %s %s" % ((mdash_utf8,)*3), ue) for text, expect in [ ("&a&", "&a&"), ("a&", "a&"), ]: got = unescape(text, get_entitydefs(), "latin-1") self.assertEqual(got, expect) def test_unescape2(self): unescape = _form.unescape get_entitydefs = _form.get_entitydefs self.assertEqual(unescape("Donald Duck & Co", {"&": "&"}), "Donald Duck & Co") self.assertEqual( unescape("<Donald Duck & Co>", {"&": "&", "<": "<", ">": ">"}), "") self.assertEqual(unescape("Hei på deg", {"å" : "å"}), "Hei på deg") self.assertEqual( unescape("&foo;", {"&": "&", "&foo;": "splat"}), "&foo;") self.assertEqual(unescape("&", {}), "&") for encoding, expected in [ ("utf-8", u"&\u06aa\u2014\u2014".encode("utf-8")), ("latin-1", "&ڪ——")]: self.assertEqual( expected, unescape("&ڪ——", get_entitydefs(), encoding)) def test_unescape_parsing(self): file = StringIO( """

""") #" forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False, encoding="utf-8") form = forms[0] test_string = "&"+(u"\u2014".encode('utf8')*3) self.assertEqual(form.action, "http://localhost/"+test_string) control = form.find_control(type="textarea", nr=0) self.assertEqual(control.value, "val"+test_string) self.assertEqual(control.name, "name"+test_string) def test_unescape_parsing_select(self): f = StringIO("""\
""") #" forms = mechanize.ParseFileEx(f, "http://localhost/", encoding="utf-8") form = forms[1] test_string = "&"+(u"\u2014".encode('utf8')*3) control = form.find_control(nr=0) for ii in range(len(control.items)): item = control.items[ii] self.assertEqual(item.name, str(ii+1)+test_string) # XXX label def test_unescape_parsing_data(self): file = StringIO( """\
""") #" # don't crash if we can't encode -- rather, leave entity ref intact forms = mechanize.ParseFile( file, "http://localhost/", backwards_compat=False, encoding="latin-1") label = forms[0].find_control(nr=0).get_labels()[0] self.assertEqual(label.text, "Blah ” ” blah") class LWPFormTests(unittest.TestCase): """The original tests from libwww-perl 5.64.""" def testEmptyParse(self): forms = mechanize.ParseFile(StringIO(""), "http://localhost", backwards_compat=False) self.assert_(len(forms) == 0) def _forms(self): file = StringIO("""
""") return mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) def testParse(self): forms = self._forms() self.assert_(len(forms) == 1) self.assert_(forms[0]["firstname"] == "Gisle") def testFillForm(self): forms = self._forms() form = forms[0] form["firstname"] = "Gisle Aas" req = form.click() def request_method(req): if req.has_data(): return "POST" else: return "GET" self.assert_(request_method(req) == "GET") self.assert_(req.get_full_url() == "http://localhost/abc?firstname=Gisle+Aas") def get_header(req, name): try: return req.get_header(name) except AttributeError: return req.headers[name] def header_items(req): try: return req.header_items() except AttributeError: return req.headers.items() class MockResponse: def __init__(self, f, url): self._file = f self._url = url def geturl(self): return self._url def __getattr__(self, name): return getattr(self._file, name) class ParseErrorTests(_testcase.TestCase): def test_parseerror_str(self): e = mechanize.ParseError("spam") self.assertEqual(str(e), "spam") class ParseTests(unittest.TestCase): def test_failing_parse(self): # XXX couldn't provoke an error from BeautifulSoup (!), so this has not # been tested with RobustFormParser import sgmllib # Python 2.0 sgmllib raises RuntimeError rather than SGMLParseError, # but seems never to even raise that except as an assertion, from # reading the code... if hasattr(sgmllib, "SGMLParseError"): f = StringIO("") base_uri = "http://localhost/" self.assertRaises( mechanize.ParseError, mechanize.ParseFile, f, base_uri, backwards_compat=False, ) self.assert_(issubclass(mechanize.ParseError, sgmllib.SGMLParseError)) def test_unknown_control(self): f = StringIO( """
""") base_uri = "http://localhost/" forms = mechanize.ParseFile(f, base_uri, backwards_compat=False) form = forms[0] for ctl in form.controls: self.assert_(isinstance(ctl, _form.TextControl)) def test_ParseFileEx(self): # empty "outer form" (where the "outer form" is the form consisting of # all controls outside of any form) f = StringIO( """
""") base_uri = "http://localhost/" forms = mechanize.ParseFileEx(f, base_uri) outer = forms[0] self.assertEqual(len(forms), 2) self.assertEqual(outer.controls, []) self.assertEqual(outer.name, None) self.assertEqual(outer.action, base_uri) self.assertEqual(outer.method, "GET") self.assertEqual(outer.enctype, "application/x-www-form-urlencoded") self.assertEqual(outer.attrs, {}) # non-empty outer form f = StringIO( """
""") base_uri = "http://localhost/" forms = mechanize.ParseFileEx(f, base_uri) outer = forms[0] self.assertEqual(len(forms), 3) self.assertEqual([c.name for c in outer.controls], ["a", "c", "e"]) self.assertEqual(outer.name, None) self.assertEqual(outer.action, base_uri) self.assertEqual(outer.method, "GET") self.assertEqual(outer.enctype, "application/x-www-form-urlencoded") self.assertEqual(outer.attrs, {}) def test_ParseResponse(self): url = "http://example.com/" r = MockResponse( StringIO("""\
"""), url, ) hide_deprecations() forms = mechanize.ParseResponse(r) reset_deprecations() self.assertEqual(len(forms), 1) form = forms[0] self.assertEqual(form.action, url+"abc") self.assertEqual(form.controls[0].name, "inner") def test_ParseResponseEx(self): url = "http://example.com/" r = MockResponse( StringIO("""\
"""), url, ) forms = mechanize.ParseResponseEx(r) self.assertEqual(len(forms), 2) outer = forms[0] inner = forms[1] self.assertEqual(inner.action, url+"abc") self.assertEqual(outer.action, url) self.assertEqual(outer.controls[0].name, "outer") self.assertEqual(inner.controls[0].name, "inner") def test_ParseString(self): class DerivedRequest(mechanize.Request): pass forms = mechanize.ParseString('', "http://example.com/", request_class=DerivedRequest) self.assertEqual(len(forms), 1) self.assertEqual(forms[0].controls[0].name, "a") # arguments were passed through self.assertTrue(isinstance(forms[0].click(), DerivedRequest)) def test_parse_error(self): f = StringIO( """
""") base_uri = "http://localhost/" try: mechanize.ParseFile(f, base_uri, backwards_compat=False) except mechanize.ParseError, e: self.assert_(e.base_uri == base_uri) else: self.assert_(0) def test_base_uri(self): # BASE element takes priority over document URI file = StringIO( """
""") forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) form = forms[0] self.assert_(form.action == "http://example.com/abc") file = StringIO( """
""") forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) form = forms[0] self.assert_(form.action == "http://localhost/abc") def testTextarea(self): file = StringIO( """
""") forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False, encoding="utf-8") self.assert_(len(forms) == 1) form = forms[0] self.assert_(form.name is None) self.assertEqual( form.action, "http://localhost/abc&"+u"\u2014".encode('utf8')+"d") control = form.find_control(type="textarea", nr=0) self.assert_(control.name is None) self.assert_(control.value == "blah, blah,\r\nRhubarb.\r\n\r\n") empty_control = form.find_control(type="textarea", nr=1) self.assert_(str(empty_control) == "=)>") self.assert_(empty_control.value == "") entity_ctl = form.find_control(type="textarea", nr=2) self.assertEqual(entity_ctl.name, '"ta"') self.assertEqual(entity_ctl.attrs["id"], "foo&bar") self.assertEqual(entity_ctl.value, "Hello testers & users!") def testSelect(self): file = StringIO( """
""") forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) self.assert_(len(forms) == 1) form = forms[0] entity_ctl = form.find_control(type="select") self.assert_(entity_ctl.name == "foo") self.assertEqual(entity_ctl.value[0], "Hello testers & &blah; users!") hide_deprecations() opt = entity_ctl.get_item_attrs("Hello testers & &blah; users!") reset_deprecations() self.assertEqual(opt["value"], "Hello testers & &blah; users!") self.assertEqual(opt["label"], "Hello testers & &blah; users!") self.assertEqual(opt["contents"], "Hello testers & &blah; users!") def testButton(self): file = StringIO( """
""") forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) form = forms[0] self.assert_(form.name == "myform") control = form.find_control(name="b") self.assert_(control.type == "submitbutton") self.assert_(control.value == "") self.assert_(form.find_control("b2").type == "resetbutton") self.assert_(form.find_control("b3").type == "buttonbutton") pairs = form.click_pairs() self.assert_(pairs == [("moo", "cow"), ("b", "")]) def testIsindex(self): file = StringIO( """
""") forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) form = forms[0] control = form.find_control(type="isindex") self.assert_(control.type == "isindex") self.assert_(control.name is None) self.assert_(control.value == "") control.value = "some stuff" self.assert_(form.click_pairs() == []) self.assert_(form.click_request_data() == ("http://localhost/abc?some+stuff", None, [])) self.assert_(form.click().get_full_url() == "http://localhost/abc?some+stuff") def testEmptySelect(self): file = StringIO( """
""") forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) form = forms[0] control0 = form.find_control(type="select", nr=0) control1 = form.find_control(type="select", nr=1) self.assert_(str(control0) == "") self.assert_(str(control1) == "") form.set_value([], "foo") self.assertRaises(ItemNotFoundError, form.set_value, ["oops"], "foo") self.assert_(form.click_pairs() == []) # XXX figure out what to do in these sorts of cases ## def badSelect(self): ## # what objects should these generate, if any? ## # what should happen on submission of these? ## # what about similar checkboxes and radios? ## """
## ## ##
## """ ## """
## ##
## """ ## ## ## """ ## """
## ## ##
## """ ## def testBadCheckbox(self): ## # see comments above ## # split checkbox -- is it one control, or two? ## """ ## ## ## ## ## ## ## ## """ def testUnnamedControl(self): file = StringIO("""
""") forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) form = forms[0] self.assert_(form.controls[0].name is None) def testNamelessListItems(self): # XXX SELECT # these controls have no item names file = StringIO("""
""") forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) form = forms[0] hide_deprecations() self.assert_(form.possible_items("foo") == ["on"]) self.assert_(form.possible_items("bar") == ["on"]) reset_deprecations() #self.assert_(form.possible_items("baz") == []) self.assert_(form["foo"] == []) self.assert_(form["bar"] == []) #self.assert_(form["baz"] == []) form["foo"] = ["on"] form["bar"] = ["on"] pairs = form.click_pairs() self.assert_(pairs == [("foo", "on"), ("bar", "on"), ("submit", "")]) def testSingleSelectFixup(self): # HTML 4.01 section 17.6.1: single selection SELECT controls shouldn't # have > 1 item selected, but if they do, not more than one should end # up selected. # In fact, testing really obscure stuff here, which follows Firefox # 1.0.7 -- IE doesn't even support disabled OPTIONs. file = StringIO("""
""") forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) form = forms[0] # deselect all but last item if more than one were selected... spam = form.find_control("spam") self.assertEqual([ii.name for ii in spam.items if ii.selected], ["2"]) # ...even if it's disabled cow = form.find_control("cow") self.assertEqual([ii.name for ii in cow.items if ii.selected], ["2"]) # exactly one selected item is OK even if it's disabled moo = form.find_control("moo") self.assertEqual([ii.name for ii in moo.items if ii.selected], ["1"]) # if nothing was selected choose the first non-disabled item moo = form.find_control("nnn") self.assertEqual([ii.name for ii in moo.items if ii.selected], ["2"]) def testSelectDefault(self): file = StringIO( """
""") forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) form = forms[0] control = form.find_control("a") self.assert_(control.value == []) single_control = form.find_control("b") self.assert_(single_control.value == ["1"]) file.seek(0) forms = mechanize.ParseFile(file, "http://localhost/", select_default=1, backwards_compat=False) form = forms[0] # select_default only affects *multiple* selection select controls control = form.find_control(type="select", nr=0) self.assert_(control.value == ["1"]) single_control = form.find_control(type="select", nr=1) self.assert_(single_control.value == ["1"]) def test_close_base_tag(self): # Benji York: a single newline immediately after a start tag is # stripped by browsers, but not one immediately before an end tag. # TEXTAREA content is converted to the DOS newline convention. forms = mechanize.ParseFile( StringIO("
"), "http://example.com/", backwards_compat=False, ) ctl = forms[0].find_control(type="textarea") self.assertEqual(ctl.value, "\r\nblah\r\n") def test_embedded_newlines(self): # newlines that happen to be at the start of strings passed to the # parser's .handle_data() method must not be trimmed unless they also # follow immediately after a start tag forms = mechanize.ParseFile( StringIO("
"), "http://example.com/", backwards_compat=False, ) ctl = forms[0].find_control(type="textarea") self.assertEqual(ctl.value, "\r\nspam&\r\neggs\r\n") def test_double_select(self): # More than one SELECT control of the same name in a form never # represent a single control (unlike RADIO and CHECKBOX elements), so # don't merge them. forms = mechanize.ParseFile( StringIO("""\
"""), "http://example.com/", backwards_compat=False, ) form = forms[0] self.assertEquals(len(form.controls), 2) ctl = form.find_control(name="a", nr=0) self.assertEqual([item.name for item in ctl.items], ["b", "c"]) ctl = form.find_control(name="a", nr=1) self.assertEqual([item.name for item in ctl.items], ["d", "e"]) def test_global_select(self): # regression test: closing select and textarea tags should not be # ignored, causing a ParseError due to incorrect tag nesting mechanize.ParseFileEx( StringIO("""\ """), "http://example.com/", ) mechanize.ParseFile( StringIO("""\ """), "http://example.com/", backwards_compat=False, ) def test_empty_document(self): forms = mechanize.ParseFileEx(StringIO(""), "http://example.com/") self.assertEquals(len(forms), 1) # just the "global form" def test_missing_closing_body_tag(self): # Even if there is no closing form or body tag, the last form on the # page should be returned. forms = mechanize.ParseFileEx( StringIO('
'), "http://example.com/", ) self.assertEquals(len(forms), 2) self.assertEquals(forms[1].name, "spam") class DisabledTests(unittest.TestCase): def testOptgroup(self): for compat in [False, True]: self._testOptgroup(compat) def _testOptgroup(self, compat): file = StringIO( """
""") def get_control(name, file=file, compat=compat): file.seek(0) forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=False) form = forms[0] form.backwards_compat = compat return form.find_control(name) # can't call item_disabled with no args control = get_control("foo") self.assertRaises(TypeError, control.get_item_disabled) hide_deprecations() control.set_item_disabled(True, "2") reset_deprecations() self.assertEqual( str(control), "") # list controls only allow assignment to .value if no attempt is # made to set any disabled item... # ...multi selection control = get_control("foo") if compat: extra = ["7"] else: extra = [] # disabled items are not part of the submitted value, so "7" not # included (they are not "successful": # http://www.w3.org/TR/REC-html40/interact/forms.html#successful-controls # ). This behavior was confirmed in Firefox 1.0.4 at least. self.assertEqual(control.value, []+extra) control.value = ["1"] self.assertEqual(control.value, ["1"]) control = get_control("foo") self.assertRaises(AttributeError, setattr, control, 'value', ['8']) self.assertEqual(control.value, []+extra) # even though 7 is set already, attempt to set it fails self.assertRaises(AttributeError, setattr, control, 'value', ['7']) control.value = ["1", "3"] self.assertEqual(control.value, ["1", "3"]) control = get_control("foo") self.assertRaises(AttributeError, setattr, control, 'value', ['1', '7']) self.assertEqual(control.value, []+extra) # enable all items control.set_all_items_disabled(False) control.value = ['1', '7'] self.assertEqual(control.value, ["1", "7"]) control = get_control("foo") hide_deprecations() for name in 7, 8, 10: self.assert_(control.get_item_disabled(str(name))) if not compat: # a disabled option is never "successful" (see above) so never # in value self.assert_(str(name) not in control.value) # a disabled option always is always upset if you try to set it self.assertRaises(AttributeError, control.set, True, str(name)) self.assert_(str(name) not in control.value) self.assertRaises(AttributeError, control.set, False, str(name)) self.assert_(str(name) not in control.value) self.assertRaises(AttributeError, control.toggle, str(name)) self.assert_(str(name) not in control.value) else: self.assertRaises(AttributeError, control.set, True, str(name)) control.set(False, str(name)) self.assert_(str(name) not in control.value) control.set(False, str(name)) self.assert_(str(name) not in control.value) self.assertRaises(AttributeError, control.toggle, str(name)) self.assert_(str(name) not in control.value) self.assertRaises(AttributeError, control.set, True, str(name)) self.assert_(str(name) not in control.value) control = get_control("foo") for name in 1, 2, 3, 4, 5, 6, 9: self.assert_(not control.get_item_disabled(str(name))) control.set(False, str(name)) self.assert_(str(name) not in control.value) control.toggle(str(name)) self.assert_(str(name) in control.value) control.set(True, str(name)) self.assert_(str(name) in control.value) control.toggle(str(name)) self.assert_(str(name) not in control.value) control = get_control("foo") self.assert_(control.get_item_disabled("7")) control.set_item_disabled(True, "7") self.assert_(control.get_item_disabled("7")) self.assertRaises(AttributeError, control.set, True, "7") control.set_item_disabled(False, "7") self.assert_(not control.get_item_disabled("7")) control.set(True, "7") control.set(False, "7") control.toggle("7") control.toggle("7") reset_deprecations() # ...single-selection control = get_control("bar") # 7 is selected but disabled if compat: value = ["7"] else: value = [] self.assertEqual(control.value, value) self.assertEqual( [ii.name for ii in control.items if ii.selected], ["7"]) control.value = ["2"] control = get_control("bar") def assign_8(control=control): control.value = ["8"] self.assertRaises(AttributeError, assign_8) self.assertEqual(control.value, value) def assign_7(control=control): control.value = ["7"] self.assertRaises(AttributeError, assign_7) # enable all items control.set_all_items_disabled(False) assign_7() self.assertEqual(control.value, ['7']) control = get_control("bar") hide_deprecations() for name in 7, 8, 10: self.assert_(control.get_item_disabled(str(name))) if not compat: # a disabled option is never "successful" (see above) so never in # value self.assert_(str(name) not in control.value) # a disabled option always is always upset if you try to set it self.assertRaises(AttributeError, control.set, True, str(name)) self.assert_(str(name) not in control.value) self.assertRaises(AttributeError, control.set, False, str(name)) self.assert_(str(name) not in control.value) self.assertRaises(AttributeError, control.toggle, str(name)) self.assert_(str(name) not in control.value) else: self.assertRaises(AttributeError, control.set, True, str(name)) control.set(False, str(name)) self.assert_(str(name) != control.value) control.set(False, str(name)) self.assert_(str(name) != control.value) self.assertRaises(AttributeError, control.toggle, str(name)) self.assert_(str(name) != control.value) self.assertRaises(AttributeError, control.set, True, str(name)) self.assert_(str(name) != control.value) control = get_control("bar") for name in 1, 2, 3, 4, 5, 6, 9: self.assert_(not control.get_item_disabled(str(name))) control.set(False, str(name)) self.assert_(str(name) not in control.value) control.toggle(str(name)) self.assert_(str(name) == control.value[0]) control.set(True, str(name)) self.assert_(str(name) == control.value[0]) control.toggle(str(name)) self.assert_(str(name) not in control.value) control = get_control("bar") self.assert_(control.get_item_disabled("7")) control.set_item_disabled(True, "7") self.assert_(control.get_item_disabled("7")) self.assertRaises(AttributeError, control.set, True, "7") self.assertEqual(control.value, value) control.set_item_disabled(False, "7") self.assertEqual(control.value, ["7"]) self.assert_(not control.get_item_disabled("7")) control.set(True, "7") control.set(False, "7") control.toggle("7") control.toggle("7") # set_all_items_disabled for name in "foo", "bar": control = get_control(name) control.set_all_items_disabled(False) control.set(True, "7") control.set(True, "1") control.set_all_items_disabled(True) self.assertRaises(AttributeError, control.set, True, "7") self.assertRaises(AttributeError, control.set, True, "1") reset_deprecations() # XXX single select def testDisabledSelect(self): for compat in [False, True]: self._testDisabledSelect(compat) def _testDisabledSelect(self, compat): file = StringIO( """
""") hide_deprecations() forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=compat) reset_deprecations() form = forms[0] for name, control_disabled, item_disabled in [ ("foo", False, False), ("bar", False, True), ("baz", True, False), ("spam", True, True)]: control = form.find_control(name) self.assertEqual(bool(control.disabled), control_disabled) hide_deprecations() item = control.get_item_attrs("2") reset_deprecations() self.assertEqual(bool(item.has_key("disabled")), item_disabled) def bad_assign(value, control=control): control.value = value hide_deprecations() if control_disabled: for name in "1", "2", "3": self.assertRaises(AttributeError, control.set, True, name) self.assertRaises(AttributeError, bad_assign, [name]) elif item_disabled: self.assertRaises(AttributeError, control.set, True, "2") self.assertRaises(AttributeError, bad_assign, ["2"]) for name in "1", "3": control.set(True, name) else: control.value = ["1", "2", "3"] reset_deprecations() control = form.find_control("foo") # missing disabled arg hide_deprecations() self.assertRaises(TypeError, control.set_item_disabled, "1") # by_label self.assert_(not control.get_item_disabled("a", by_label=True)) control.set_item_disabled(True, "a", by_label=True) self.assert_(control.get_item_disabled("a", by_label=True)) reset_deprecations() def testDisabledRadio(self): for compat in False, True: self._testDisabledRadio(compat) def _testDisabledRadio(self, compat): file = StringIO( """
""") hide_deprecations() forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=compat) form = forms[0] control = form.find_control('foo') # since all items are disabled, .fixup() should not select # anything self.assertEquals( [item.name for item in control.items if item.selected], [], ) reset_deprecations() def testDisabledCheckbox(self): for compat in False, True: self._testDisabledCheckbox(compat) def _testDisabledCheckbox(self, compat): file = StringIO( """
""") hide_deprecations() forms = mechanize.ParseFile(file, "http://localhost/", backwards_compat=compat) reset_deprecations() form = forms[0] for name, control_disabled, item_disabled in [ ("foo", False, False), ("bar", False, True), ("baz", False, True)]: control = form.find_control(name) self.assert_(bool(control.disabled) == control_disabled) hide_deprecations() item = control.get_item_attrs("2") self.assert_(bool(item.has_key("disabled")) == item_disabled) self.assert_(control.get_item_disabled("2") == item_disabled) def bad_assign(value, control=control): control.value = value if item_disabled: self.assertRaises(AttributeError, control.set, True, "2") self.assertRaises(AttributeError, bad_assign, ["2"]) if not control.get_item_disabled("1"): control.set(True, "1") else: control.value = ["1", "2", "3"] reset_deprecations() control = form.find_control("foo") hide_deprecations() control.set_item_disabled(False, "1") # missing disabled arg self.assertRaises(TypeError, control.set_item_disabled, "1") # by_label self.failIf(control.get_item_disabled('a', by_label=True)) self.assert_(not control.get_item_disabled("1")) control.set_item_disabled(True, 'a', by_label=True) self.assert_(control.get_item_disabled("1")) reset_deprecations() class ControlTests(unittest.TestCase): def testTextControl(self): attrs = {"type": "this is ignored", "name": "ath_Uname", "value": "", "maxlength": "20", "id": "foo"} c = _form.TextControl("texT", "ath_Uname", attrs) c.fixup() self.assert_(c.type == "text") self.assert_(c.name == "ath_Uname") self.assert_(c.id == "foo") self.assert_(c.value == "") self.assert_(str(c) == "") self.assert_(c.pairs() == [("ath_Uname", "")]) def bad_assign(c=c): c.type = "sometype" self.assertRaises(AttributeError, bad_assign) self.assert_(c.type == "text") def bad_assign(c=c): c.name = "somename" self.assertRaises(AttributeError, bad_assign) self.assert_(c.name == "ath_Uname") c.value = "2" self.assert_(c.value == "2") c.readonly = True self.assertRaises(AttributeError, c.clear) c.readonly = False c.clear() self.assert_(c.value is None) self.assert_(c.pairs() == []) c.value = "2" # reset value... self.assert_(str(c) == "") def bad_assign(c=c): c.value = ["foo"] self.assertRaises(TypeError, bad_assign) self.assert_(c.value == "2") self.assert_(not c.readonly) c.readonly = True def bad_assign(c=c): c.value = "foo" self.assertRaises(AttributeError, bad_assign) self.assert_(c.value == "2") c.disabled = True self.assert_(str(c) == "") c.readonly = False self.assert_(str(c) == "") self.assertRaises(AttributeError, bad_assign) self.assert_(c.value == "2") self.assert_(c.pairs() == []) c.disabled = False self.assert_(str(c) == "") self.assert_(c.attrs.has_key("maxlength")) for key in "name", "type", "value": self.assert_(c.attrs.has_key(key)) # initialisation of readonly and disabled attributes attrs["readonly"] = True c = _form.TextControl("text", "ath_Uname", attrs) def bad_assign(c=c): c.value = "foo" self.assertRaises(AttributeError, bad_assign) del attrs["readonly"] attrs["disabled"] = True c = _form.TextControl("text", "ath_Uname", attrs) def bad_assign(c=c): c.value = "foo" self.assertRaises(AttributeError, bad_assign) del attrs["disabled"] c = _form.TextControl("hidden", "ath_Uname", attrs) self.assert_(c.readonly) def bad_assign(c=c): c.value = "foo" self.assertRaises(AttributeError, bad_assign) def testFileControl(self): c = _form.FileControl("file", "test_file", {}) fp = StringIO() c.add_file(fp) fp2 = StringIO() c.add_file(fp2, None, "fp2 file test") self.assert_(str(c) == ', fp2 file test)>') c.readonly = True self.assertRaises(AttributeError, c.clear) c.readonly = False c.clear() self.assert_(str(c) == ')>') def testIsindexControl(self): attrs = {"type": "this is ignored", "prompt": ">>>"} c = _form.IsindexControl("isIndex", None, attrs) c.fixup() self.assert_(c.type == "isindex") self.assert_(c.name is None) self.assert_(c.value == "") self.assert_(str(c) == "") self.assert_(c.pairs() == []) def set_type(c=c): c.type = "sometype" self.assertRaises(AttributeError, set_type) self.assert_(c.type == "isindex") def set_name(c=c): c.name = "somename" self.assertRaises(AttributeError, set_name) def set_value(value, c=c): c.value = value self.assertRaises(TypeError, set_value, [None]) self.assert_(c.name is None) c.value = "2" self.assert_(c.value == "2") self.assert_(str(c) == "") c.disabled = True self.assert_(str(c) == "") self.assertRaises(AttributeError, set_value, "foo") self.assert_(c.value == "2") self.assert_(c.pairs() == []) c.readonly = True self.assert_(str(c) == "") self.assertRaises(AttributeError, set_value, "foo") c.disabled = False self.assert_(str(c) == "") self.assertRaises(AttributeError, set_value, "foo") c.readonly = False self.assert_(str(c) == "") self.assert_(c.attrs.has_key("type")) self.assert_(c.attrs.has_key("prompt")) self.assert_(c.attrs["prompt"] == ">>>") for key in "name", "value": self.assert_(not c.attrs.has_key(key)) c.value = "foo 1 bar 2" class FakeForm: action = "http://localhost/" form = FakeForm() self.assert_(c._click(form, (1,1), "request_data") == ("http://localhost/?foo+1+bar+2", None, [])) c.value = "foo 1 bar 2" c.readonly = True self.assertRaises(AttributeError, c.clear) c.readonly = False c.clear() self.assert_(c.value is None) def testIgnoreControl(self): attrs = {"type": "this is ignored"} c = _form.IgnoreControl("reset", None, attrs) self.assert_(c.type == "reset") self.assert_(c.value is None) self.assert_(str(c) == "=)>") def set_value(value, c=c): c.value = value self.assertRaises(AttributeError, set_value, "foo") self.assert_(c.value is None) # this is correct, but silly; basically nothing should happen c.clear() self.assert_(c.value is None) def testSubmitControl(self): attrs = {"type": "this is ignored", "name": "name_value", "value": "value_value", "img": "foo.gif"} c = _form.SubmitControl("submit", "name_value", attrs) self.assert_(c.type == "submit") self.assert_(c.name == "name_value") self.assert_(c.value == "value_value") self.assert_(str(c) == "") c.readonly = True self.assertRaises(AttributeError, c.clear) c.readonly = False c.clear() self.assert_(c.value is None) c.value = "value_value" c.readonly = True def set_value(value, c=c): c.value = value self.assertRaises(TypeError, set_value, ["foo"]) c.disabled = True self.assertRaises(AttributeError, set_value, "value_value") self.assert_(str(c) == "") c.disabled = False c.readonly = False set_value("value_value") self.assert_(str(c) == "") c.readonly = True # click on button form = _form.HTMLForm("http://foo.bar.com/") c.add_to_form(form) self.assert_(c.pairs() == []) pairs = c._click(form, (1,1), "pairs") request = c._click(form, (1,1), "request") data = c._click(form, (1,1), "request_data") self.assert_(c.pairs() == []) self.assert_(pairs == [("name_value", "value_value")]) self.assert_(request.get_full_url() == "http://foo.bar.com/?name_value=value_value") self.assert_(data == ("http://foo.bar.com/?name_value=value_value", None, [])) c.disabled = True pairs = c._click(form, (1,1), "pairs") request = c._click(form, (1,1), "request") data = c._click(form, (1,1), "request_data") self.assert_(pairs == []) # XXX not sure if should have '?' on end of this URL, or if it really matters... self.assert_(request.get_full_url() == "http://foo.bar.com/") self.assert_(data == ("http://foo.bar.com/", None, [])) def testImageControl(self): attrs = {"type": "this is ignored", "name": "name_value", "img": "foo.gif"} c = _form.ImageControl("image", "name_value", attrs, index=0) self.assert_(c.type == "image") self.assert_(c.name == "name_value") self.assert_(c.value == "") self.assert_(str(c) == "") c.readonly = True self.assertRaises(AttributeError, c.clear) c.readonly = False c.clear() self.assert_(c.value is None) c.value = "" # click, at coordinate (0, 55), on image form = _form.HTMLForm("http://foo.bar.com/") c.add_to_form(form) self.assert_(c.pairs() == []) request = c._click(form, (0, 55), "request") self.assert_(c.pairs() == []) self.assert_(request.get_full_url() == "http://foo.bar.com/?name_value.x=0&name_value.y=55") self.assert_(c._click(form, (0,55), return_type="request_data") == ("http://foo.bar.com/?name_value.x=0&name_value.y=55", None, [])) c.value = "blah" request = c._click(form, (0, 55), "request") self.assertEqual(request.get_full_url(), "http://foo.bar.com/?" "name_value.x=0&name_value.y=55&name_value=blah") c.disabled = True self.assertEqual(c.value, "blah") self.assert_(str(c) == "") def set_value(value, c=c): c.value = value self.assertRaises(AttributeError, set_value, "blah") self.assert_(c._click(form, (1,1), return_type="pairs") == []) c.readonly = True self.assert_(str(c) == "") self.assertRaises(AttributeError, set_value, "blah") self.assert_(c._click(form, (1,1), return_type="pairs") == []) c.disabled = c.readonly = False self.assert_(c._click(form, (1,1), return_type="pairs") == [("name_value.x", "1"), ("name_value.y", "1"), ('name_value', 'blah')]) def testCheckboxControl(self): attrs = {"type": "this is ignored", "name": "name_value", "value": "value_value", "alt": "some string"} form = DummyForm() c = _form.CheckboxControl("checkbox", "name_value", attrs) c.add_to_form(form) c.fixup() self.assert_(c.type == "checkbox") self.assert_(c.name == "name_value") self.assert_(c.value == []) hide_deprecations() self.assert_(c.possible_items() == ["value_value"]) reset_deprecations() def set_type(c=c): c.type = "sometype" self.assertRaises(AttributeError, set_type) self.assert_(c.type == "checkbox") def set_name(c=c): c.name = "somename" self.assertRaises(AttributeError, set_name) self.assert_(c.name == "name_value") # construct larger list from length-1 lists c = _form.CheckboxControl("checkbox", "name_value", attrs) attrs2 = attrs.copy() attrs2["value"] = "value_value2" c2 = _form.CheckboxControl("checkbox", "name_value", attrs2) c2.add_to_form(form) c.merge_control(c2) c.add_to_form(form) c.fixup() self.assert_(str(c) == "") hide_deprecations() self.assert_(c.possible_items() == ["value_value", "value_value2"]) attrs = c.get_item_attrs("value_value") for key in "alt", "name", "value", "type": self.assert_(attrs.has_key(key)) self.assertRaises(ItemNotFoundError, c.get_item_attrs, "oops") reset_deprecations() def set_value(value, c=c): c.value = value c.value = ["value_value", "value_value2"] self.assert_(c.value == ["value_value", "value_value2"]) c.value = ["value_value"] self.assertEqual(c.value, ["value_value"]) self.assertRaises(ItemNotFoundError, set_value, ["oops"]) self.assertRaises(TypeError, set_value, "value_value") c.value = ["value_value2"] self.assert_(c.value == ["value_value2"]) hide_deprecations() c.toggle("value_value") self.assert_(c.value == ["value_value", "value_value2"]) c.toggle("value_value2") reset_deprecations() self.assert_(c.value == ["value_value"]) hide_deprecations() self.assertRaises(ItemNotFoundError, c.toggle, "oops") reset_deprecations() self.assert_(c.value == ["value_value"]) c.readonly = True self.assertRaises(AttributeError, c.clear) c.readonly = False c.clear() self.assert_(c.value == []) # set hide_deprecations() c.set(True, "value_value") self.assert_(c.value == ["value_value"]) c.set(True, "value_value2") self.assert_(c.value == ["value_value", "value_value2"]) c.set(True, "value_value2") self.assert_(c.value == ["value_value", "value_value2"]) c.set(False, "value_value2") self.assert_(c.value == ["value_value"]) c.set(False, "value_value2") self.assert_(c.value == ["value_value"]) self.assertRaises(ItemNotFoundError, c.set, True, "oops") self.assertRaises(TypeError, c.set, True, ["value_value"]) self.assertRaises(ItemNotFoundError, c.set, False, "oops") self.assertRaises(TypeError, c.set, False, ["value_value"]) reset_deprecations() self.assert_(str(c) == "") c.disabled = True self.assertRaises(AttributeError, set_value, ["value_value"]) self.assert_(str(c) == "") self.assert_(c.value == ["value_value"]) self.assert_(c.pairs() == []) c.readonly = True self.assertRaises(AttributeError, set_value, ["value_value"]) self.assert_(str(c) == "") self.assert_(c.value == ["value_value"]) self.assert_(c.pairs() == []) c.disabled = False self.assert_(str(c) == "") self.assertRaises(AttributeError, set_value, ["value_value"]) self.assert_(c.value == ["value_value"]) self.assert_(c.pairs() == [("name_value", "value_value")]) c.readonly = False c.value = [] self.assert_(c.value == []) def testSelectControlMultiple(self): attrs = {"type": "this is ignored", "name": "name_value", "value": "value_value", "alt": "some string", "label": "contents_value", "contents": "contents_value", "__select": {"type": "this is ignored", "name": "select_name", "multiple": "", "alt": "alt_text"}} form = DummyForm() # with Netscape / IE default selection... c = _form.SelectControl("select", "select_name", attrs) c.add_to_form(form) c.fixup() self.assert_(c.type == "select") self.assert_(c.name == "select_name") self.assert_(c.value == []) hide_deprecations() self.assert_(c.possible_items() == ["value_value"]) reset_deprecations() self.assert_(c.attrs.has_key("name")) self.assert_(c.attrs.has_key("type")) self.assert_(c.attrs["alt"] == "alt_text") # ... and with RFC 1866 default selection c = _form.SelectControl("select", "select_name", attrs, select_default=True) c.add_to_form(form) c.fixup() self.assert_(c.value == ["value_value"]) # construct larger list from length-1 lists c = _form.SelectControl("select", "select_name", attrs) attrs2 = attrs.copy() attrs2["value"] = "value_value2" c2 = _form.SelectControl("select", "select_name", attrs2) c2.add_to_form(form) c.merge_control(c2) c.add_to_form(form) c.fixup() self.assert_(str(c) == "") hide_deprecations() self.assert_(c.possible_items() == ["value_value", "value_value2"]) # get_item_attrs attrs3 = c.get_item_attrs("value_value") reset_deprecations() self.assert_(attrs3.has_key("alt")) self.assert_(not attrs3.has_key("multiple")) # HTML attributes dictionary should have been copied by ListControl # constructor. attrs["new_attr"] = "new" attrs2["new_attr2"] = "new2" for key in ("new_attr", "new_attr2"): self.assert_(not attrs3.has_key(key)) hide_deprecations() self.assertRaises(ItemNotFoundError, c.get_item_attrs, "oops") reset_deprecations() c.value = ["value_value", "value_value2"] self.assert_(c.value == ["value_value", "value_value2"]) c.value = ["value_value"] self.assertEqual(c.value, ["value_value"]) def set_value(value, c=c): c.value = value self.assertRaises(ItemNotFoundError, set_value, ["oops"]) self.assertRaises(TypeError, set_value, "value_value") self.assertRaises(TypeError, set_value, None) c.value = ["value_value2"] self.assert_(c.value == ["value_value2"]) hide_deprecations() c.toggle("value_value") self.assert_(c.value == ["value_value", "value_value2"]) c.toggle("value_value2") self.assert_(c.value == ["value_value"]) self.assertRaises(ItemNotFoundError, c.toggle, "oops") self.assert_(c.value == ["value_value"]) reset_deprecations() c.readonly = True self.assertRaises(AttributeError, c.clear) c.readonly = False c.clear() self.assert_(c.value == []) # test ordering of items c.value = ["value_value2", "value_value"] self.assert_(c.value == ["value_value", "value_value2"]) # set hide_deprecations() c.set(True, "value_value") self.assert_(c.value == ["value_value", "value_value2"]) c.set(True, "value_value2") self.assert_(c.value == ["value_value", "value_value2"]) c.set(False, "value_value") self.assert_(c.value == ["value_value2"]) c.set(False, "value_value") self.assert_(c.value == ["value_value2"]) self.assertRaises(ItemNotFoundError, c.set, True, "oops") self.assertRaises(TypeError, c.set, True, ["value_value"]) self.assertRaises(ItemNotFoundError, c.set, False, "oops") self.assertRaises(TypeError, c.set, False, ["value_value"]) reset_deprecations() c.value = [] self.assert_(c.value == []) def testSelectControlMultiple_label(self): ## attrs = {"type": "ignored", "name": "year", "value": "0", "label": "2002", "contents": "current year", "__select": {"type": "this is ignored", "name": "select_name", "multiple": ""}} attrs2 = {"type": "ignored", "name": "year", "value": "1", "label": "2001", # label defaults to contents "contents": "2001", "__select": {"type": "this is ignored", "name": "select_name", "multiple": ""}} attrs3 = {"type": "ignored", "name": "year", "value": "2000", # value defaults to contents "label": "2000", # label defaults to contents "contents": "2000", "__select": {"type": "this is ignored", "name": "select_name", "multiple": ""}} c = _form.SelectControl("select", "select_name", attrs) c2 = _form.SelectControl("select", "select_name", attrs2) c3 = _form.SelectControl("select", "select_name", attrs3) form = DummyForm() c.merge_control(c2) c.merge_control(c3) c.add_to_form(form) c.fixup() hide_deprecations() self.assert_(c.possible_items() == ["0", "1", "2000"]) self.assert_(c.possible_items(by_label=True) == ["2002", "2001", "2000"]) self.assert_(c.value == []) c.toggle("2002", by_label=True) self.assert_(c.value == ["0"]) c.toggle("0") self.assert_(c.value == []) c.toggle("0") self.assert_(c.value == ["0"]) self.assert_(c.get_value_by_label() == ["2002"]) c.toggle("2002", by_label=True) self.assertRaises(ItemNotFoundError, c.toggle, "blah", by_label=True) self.assert_(c.value == []) c.toggle("2000") reset_deprecations() self.assert_(c.value == ["2000"]) self.assert_(c.get_value_by_label() == ["2000"]) def set_value(value, c=c): c.value = value self.assertRaises(ItemNotFoundError, set_value, ["2002"]) self.assertRaises(TypeError, set_value, "1") self.assertRaises(TypeError, set_value, None) self.assert_(c.value == ["2000"]) c.value = ["0"] self.assertEqual(c.value, ["0"]) c.value = [] self.assertRaises(TypeError, c.set_value_by_label, "2002") c.set_value_by_label(["2002"]) self.assert_(c.value == ["0"]) self.assert_(c.get_value_by_label() == ["2002"]) c.set_value_by_label(["2000"]) self.assert_(c.value == ["2000"]) self.assert_(c.get_value_by_label() == ["2000"]) c.set_value_by_label(["2000", "2002"]) self.assert_(c.value == ["0", "2000"]) self.assert_(c.get_value_by_label() == ["2002", "2000"]) c.readonly = True self.assertRaises(AttributeError, c.clear) c.readonly = False c.clear() self.assert_(c.value == []) c.set_value_by_label(["2000", "2002"]) hide_deprecations() c.set(False, "2002", by_label=True) self.assert_(c.get_value_by_label() == c.value == ["2000"]) c.set(False, "2002", by_label=True) self.assert_(c.get_value_by_label() == c.value == ["2000"]) c.set(True, "2002", by_label=True) self.assert_(c.get_value_by_label() == ["2002", "2000"]) self.assert_(c.value == ["0", "2000"]) c.set(False, "2000", by_label=True) self.assert_(c.get_value_by_label() == ["2002"]) self.assert_(c.value == ["0"]) c.set(True, "2001", by_label=True) self.assert_(c.get_value_by_label() == ["2002", "2001"]) self.assert_(c.value == ["0", "1"]) self.assertRaises(ItemNotFoundError, c.set, True, "blah", by_label=True) self.assertRaises(ItemNotFoundError, c.set, False, "blah", by_label=True) reset_deprecations() def testSelectControlSingle_label(self): ## attrs = {"type": "ignored", "name": "year", "value": "0", "label": "2002", "contents": "current year", "__select": {"type": "this is ignored", "name": "select_name"}} attrs2 = {"type": "ignored", "name": "year", "value": "1", "label": "2001", # label defaults to contents "contents": "2001", "__select": {"type": "this is ignored", "name": "select_name"}} attrs3 = {"type": "ignored", "name": "year", "value": "2000", # value defaults to contents "label": "2000", # label defaults to contents "contents": "2000", "__select": {"type": "this is ignored", "name": "select_name"}} c = _form.SelectControl("select", "select_name", attrs) c2 = _form.SelectControl("select", "select_name", attrs2) c3 = _form.SelectControl("select", "select_name", attrs3) form = DummyForm() c.merge_control(c2) c.merge_control(c3) c.add_to_form(form) c.fixup() hide_deprecations() self.assert_(c.possible_items() == ["0", "1", "2000"]) self.assert_(c.possible_items(by_label=True) == ["2002", "2001", "2000"]) reset_deprecations() def set_value(value, c=c): c.value = value self.assertRaises(ItemNotFoundError, set_value, ["2002"]) self.assertRaises(TypeError, set_value, "1") self.assertRaises(TypeError, set_value, None) self.assert_(c.value == ["0"]) c.value = [] self.assert_(c.value == []) c.value = ["0"] self.assert_(c.value == ["0"]) c.value = [] self.assertRaises(TypeError, c.set_value_by_label, "2002") self.assertRaises(ItemCountError, c.set_value_by_label, ["2000", "2001"]) self.assertRaises(ItemNotFoundError, c.set_value_by_label, ["foo"]) c.set_value_by_label(["2002"]) self.assert_(c.value == ["0"]) self.assert_(c.get_value_by_label() == ["2002"]) c.set_value_by_label(["2000"]) self.assert_(c.value == ["2000"]) self.assert_(c.get_value_by_label() == ["2000"]) c.readonly = True self.assertRaises(AttributeError, c.clear) c.readonly = False c.clear() self.assert_(c.value == []) def testSelectControlSingle(self): attrs = {"type": "this is ignored", "name": "name_value", "value": "value_value", "label": "contents_value", "contents": "contents_value", "__select": {"type": "this is ignored", "name": "select_name", "alt": "alt_text"}} # Netscape and IE behaviour... c = _form.SelectControl("select", "select_name", attrs) form = DummyForm() c.add_to_form(form) c.fixup() self.assert_(c.type == "select") self.assert_(c.name == "select_name") self.assert_(c.value == ["value_value"]) hide_deprecations() self.assert_(c.possible_items() == ["value_value"]) reset_deprecations() self.assert_(c.attrs.has_key("name")) self.assert_(c.attrs.has_key("type")) self.assert_(c.attrs["alt"] == "alt_text") # ...and RFC 1866 behaviour are identical (unlike multiple SELECT). c = _form.SelectControl("select", "select_name", attrs, select_default=1) c.add_to_form(form) c.fixup() self.assert_(c.value == ["value_value"]) # construct larger list from length-1 lists c = _form.SelectControl("select", "select_name", attrs) attrs2 = attrs.copy() attrs2["value"] = "value_value2" c2 = _form.SelectControl("select", "select_name", attrs2) c.merge_control(c2) c.add_to_form(form) c.fixup() self.assert_(str(c) == "") c.value = [] self.assert_(c.value == []) self.assert_(str(c) == "") c.value = ["value_value"] self.assert_(c.value == ["value_value"]) self.assert_(str(c) == "") hide_deprecations() self.assert_(c.possible_items() == ["value_value", "value_value2"]) reset_deprecations() def set_value(value, c=c): c.value = value self.assertRaises(ItemCountError, set_value, ["value_value", "value_value2"]) self.assertRaises(TypeError, set_value, "value_value") self.assertRaises(TypeError, set_value, None) c.value = ["value_value2"] self.assert_(c.value == ["value_value2"]) c.value = ["value_value"] self.assert_(c.value == ["value_value"]) self.assertRaises(ItemNotFoundError, set_value, ["oops"]) self.assert_(c.value == ["value_value"]) hide_deprecations() c.toggle("value_value") self.assertRaises(ItemNotFoundError, c.toggle, "oops") self.assertRaises(TypeError, c.toggle, ["oops"]) reset_deprecations() self.assert_(c.value == []) c.value = ["value_value"] self.assert_(c.value == ["value_value"]) # nothing selected is allowed c.value = [] self.assert_(c.value == []) hide_deprecations() c.set(True, "value_value") self.assert_(c.value == ["value_value"]) c.readonly = True self.assertRaises(AttributeError, c.clear) c.readonly = False c.clear() self.assert_(c.value == []) # set c.set(True, "value_value") self.assert_(c.value == ["value_value"]) c.set(True, "value_value") self.assert_(c.value == ["value_value"]) c.set(True, "value_value2") self.assert_(c.value == ["value_value2"]) c.set(False, "value_value") self.assert_("value_value2") c.set(False, "value_value2") self.assert_(c.value == []) c.set(False, "value_value2") self.assert_(c.value == []) self.assertRaises(ItemNotFoundError, c.set, True, "oops") self.assertRaises(TypeError, c.set, True, ["value_value"]) self.assertRaises(ItemNotFoundError, c.set, False, "oops") self.assertRaises(TypeError, c.set, False, ["value_value"]) reset_deprecations() def testRadioControl(self): attrs = {"type": "this is ignored", "name": "name_value", "value": "value_value", "id": "blah"} # Netscape and IE behaviour... c = _form.RadioControl("radio", "name_value", attrs) form = DummyForm() c.add_to_form(form) c.fixup() self.assert_(c.type == "radio") self.assert_(c.name == "name_value") self.assert_(c.id == "blah") self.assert_(c.value == []) hide_deprecations() self.assert_(c.possible_items() == ["value_value"]) reset_deprecations() # ...and RFC 1866 behaviour c = _form.RadioControl("radio", "name_value", attrs, select_default=True) c.add_to_form(form) c.fixup() self.assert_(c.value == ["value_value"]) # construct larger list from length-1 lists c = _form.RadioControl("radio", "name_value", attrs, select_default=True) attrs2 = attrs.copy() attrs2["value"] = "value_value2" c2 = _form.RadioControl("radio", "name_value", attrs2, select_default=True) c.merge_control(c2) c.add_to_form(form) c.fixup() self.assert_(str(c) == "") hide_deprecations() self.assert_(c.possible_items() == ["value_value", "value_value2"]) reset_deprecations() def set_value(value, c=c): c.value = value self.assertRaises(ItemCountError, set_value, ["value_value", "value_value2"]) self.assertRaises(TypeError, set_value, "value_value") self.assertEqual(c.value, ["value_value"]) c.value = ["value_value2"] self.assertEqual(c.value, ["value_value2"]) c.value = ["value_value"] self.assertEqual(c.value, ["value_value"]) self.assertRaises(ItemNotFoundError, set_value, ["oops"]) self.assertEqual(c.value, ["value_value"]) hide_deprecations() c.toggle("value_value") self.assertEqual(c.value, []) c.toggle("value_value") self.assertEqual(c.value, ["value_value"]) self.assertRaises(TypeError, c.toggle, ["value_value"]) self.assertEqual(c.value, ["value_value"]) # nothing selected is allowed c.value = [] self.assertEqual(c.value, []) c.set(True, "value_value") reset_deprecations() self.assertEqual(c.value, ["value_value"]) c.readonly = True self.assertRaises(AttributeError, c.clear) c.readonly = False c.clear() self.assertEqual(c.value, []) # set hide_deprecations() c.set(True, "value_value") self.assertEqual(c.value, ["value_value"]) c.set(True, "value_value") self.assertEqual(c.value, ["value_value"]) c.set(True, "value_value2") self.assertEqual(c.value, ["value_value2"]) c.set(False, "value_value") self.assert_("value_value2") c.set(False, "value_value2") self.assertEqual(c.value, []) c.set(False, "value_value2") self.assertEqual(c.value, []) self.assertRaises(ItemNotFoundError, c.set, True, "oops") self.assertRaises(TypeError, c.set, True, ["value_value"]) self.assertRaises(ItemNotFoundError, c.set, False, "oops") self.assertRaises(TypeError, c.set, False, ["value_value"]) reset_deprecations() # tests for multiple identical values attrs = {"type": "this is ignored", "name": "name_value", "value": "value_value", "id": "name_value_1"} c1 = _form.RadioControl("radio", "name_value", attrs) attrs = {"type": "this is ignored", "name": "name_value", "value": "value_value", "id": "name_value_2", "checked": "checked"} c2 = _form.RadioControl("radio", "name_value", attrs) attrs = {"type": "this is ignored", "name": "name_value", "value": "another_value", "id": "name_value_3", "__label": {"__text": "Third Option"}} c3 = _form.RadioControl("radio", "name_value", attrs) form = DummyForm() c1.merge_control(c2) c1.merge_control(c3) c1.add_to_form(form) c1.fixup() self.assertEqual(c1.value, ['value_value']) hide_deprecations() self.assertEqual( c1.possible_items(), ['value_value', 'value_value', 'another_value']) reset_deprecations() self.assertEqual(c1.value, ['value_value']) self.failIf(c1.items[0].selected) self.failUnless(c1.items[1].selected) self.failIf(c1.items[2].selected) c1.value = ['value_value'] # should be no change self.failUnless(c1.items[1].selected) self.assertEqual(c1.value, ['value_value']) c1.value = ['another_value'] self.failUnless(c1.items[2].selected) self.assertEqual(c1.value, ['another_value']) c1.value = ['value_value'] self.failUnless(c1.items[0].selected) self.assertEqual(c1.value, ['value_value']) # id labels form._id_to_labels['name_value_1'] = [ _form.Label({'for': 'name_value_1', '__text':'First Option'})] form._id_to_labels['name_value_2'] = [ _form.Label({'for': 'name_value_2', '__text':'Second Option'})] form._id_to_labels['name_value_3'] = [ _form.Label({'for': 'name_value_3', '__text':'Last Option'})] # notice __label above self.assertEqual([l.text for l in c1.items[0].get_labels()], ['First Option']) self.assertEqual([l.text for l in c1.items[1].get_labels()], ['Second Option']) self.assertEqual([l.text for l in c1.items[2].get_labels()], ['Third Option', 'Last Option']) self.assertEqual(c1.get_value_by_label(), ['First Option']) c1.set_value_by_label(['Second Option']) self.assertEqual(c1.get_value_by_label(), ['Second Option']) self.assertEqual(c1.value, ['value_value']) c1.set_value_by_label(['Third Option']) self.assertEqual(c1.get_value_by_label(), ['Third Option']) self.assertEqual(c1.value, ['another_value']) c1.items[1].selected = True self.assertEqual(c1.get_value_by_label(), ['Second Option']) self.assertEqual(c1.value, ['value_value']) c1.set_value_by_label(['Last Option']) # by second label self.assertEqual(c1.get_value_by_label(), ['Third Option']) self.assertEqual(c1.value, ['another_value']) c1.set_value_by_label(['irst']) # by substring self.assertEqual(c1.get_value_by_label(), ['First Option']) class FormTests(unittest.TestCase): base_uri = "http://auth.athensams.net/" def _get_test_file(self, filename): import test_form this_dir = os.path.dirname(test_form.__file__) path = os.path.join(this_dir, "test_form_data", filename) return open(path) def test_find_control(self): f = StringIO("""\
""") form = mechanize.ParseFile(f, "http://example.com/", backwards_compat=False)[0] for compat in True, False: form.backwards_compat = compat fc = form.find_control self.assertEqual(fc("form.title").id, "form.title") self.assertEqual(fc("form.title", nr=0).id, "form.title") if compat: self.assertEqual(fc("password").id, "pswd1") else: self.assertRaises(AmbiguityError, fc, "password") self.assertEqual(fc("password", id="pswd2").id, "pswd2") self.assertEqual(fc("password", nr=0).id, "pswd1") self.assertRaises(ControlNotFoundError, fc, "form.title", nr=1) self.assertRaises(ControlNotFoundError, fc, nr=50) self.assertRaises(ValueError, fc, nr=-1) self.assertRaises(ControlNotFoundError, fc, label="Bananas") # label self.assertEqual(fc(label="Title").id, "form.title") self.assertEqual(fc(label="Book Title").id, "form.title") self.assertRaises(ControlNotFoundError, fc, label=" Book Title ") self.assertRaises(ControlNotFoundError, fc, label="Bananas") self.assertRaises(ControlNotFoundError, fc, label="title") self.assertEqual(fc(label="Book", nr=0).id, "form.title") self.assertEqual(fc(label="Book", nr=1).id, "form.quality") if compat: self.assertEqual(fc(label="Book").id, "form.title") else: self.assertRaises(AmbiguityError, fc, label="Book") def test_find_nameless_control(self): data = """\
""" f = StringIO(data) form = mechanize.ParseFile(f, "http://example.com/", backwards_compat=False)[0] self.assertRaises( AmbiguityError, form.find_control, type="checkbox", name=mechanize.Missing) ctl = form.find_control(type="checkbox", name=mechanize.Missing, nr=1) self.assertEqual(ctl.id, "a") def test_deselect_disabled(self): def get_new_form(f, compat): f.seek(0) form = mechanize.ParseFile(f, "http://example.com/", backwards_compat=False)[0] form.backwards_compat = compat return form f = StringIO("""\
""") for compat in [False]:#True, False: def new_form(compat=compat, f=f, get_new_form=get_new_form): form = get_new_form(f, compat) ctl = form.find_control("p") a = ctl.get("a") return ctl, a ctl, a = new_form() ctl.value = ["b"] # :-(( if compat: # rationale: allowed to deselect, but not select, disabled # items ctl, a = new_form() self.assertRaises(AttributeError, setattr, a, "selected", True) self.assertRaises(AttributeError, setattr, ctl, "value", ["a"]) a.selected = False ctl, a = new_form() ctl.value = ["b"] self.assertEqual(a.selected, False) self.assertEqual(ctl.value, ["b"]) ctl, a = new_form() self.assertRaises(AttributeError, setattr, ctl, "value", ["a", "b"]) else: # rationale: Setting an individual item's selected state to its # present value is a no-op, as is setting the whole control # value where an item name doesn't appear in the new value, but # that item is disabled anyway (but an item name that does # appear in the new value is treated an explicit request that # that item name get sent to the server). However, if the # item's state does change, both selecting and deselecting are # disallowed for disabled items. ctl, a = new_form() self.assertRaises(AttributeError, setattr, a, "selected", True) ctl, a = new_form() self.assertRaises(AttributeError, setattr, ctl, "value", ["a"]) ctl, a = new_form() self.assertRaises(AttributeError, setattr, a, "selected", False) ctl.value = ["b"] self.assertEqual(a.selected, True) self.assertEqual(ctl.value, ["b"]) ctl, a = new_form() self.assertRaises(AttributeError, setattr, ctl, "value", ["a", "b"]) f = StringIO("""\
""") for compat in [False]:#True, False: def new_form(compat=compat, f=f, get_new_form=get_new_form): form = get_new_form(f, compat) ctl = form.find_control("p") a = ctl.get("a") return ctl, a ctl, a = new_form() ctl.value = ["b"] if compat: ctl, a = new_form() self.assertRaises(AttributeError, setattr, a, "selected", True) self.assertRaises(AttributeError, setattr, ctl, "value", ["a"]) a.selected = False ctl, a = new_form() ctl.value = ["b"] self.assertEqual(a.selected, False) self.assertEqual(ctl.value, ["b"]) ctl, a = new_form() self.assertRaises(ItemCountError, setattr, ctl, "value", ["a", "b"]) else: ctl, a = new_form() self.assertRaises(AttributeError, setattr, a, "selected", True) ctl, a = new_form() self.assertRaises(AttributeError, setattr, ctl, "value", ["a"]) ctl, a = new_form() self.assertRaises(AttributeError, setattr, a, "selected", False) ctl.value = ["b"] self.assertEqual(a.selected, False) self.assertEqual(ctl.value, ["b"]) ctl, a = new_form() self.assertRaises(ItemCountError, setattr, ctl, "value", ["a", "b"]) def test_click(self): file = StringIO( """
""") form = mechanize.ParseFile(file, "http://blah/", backwards_compat=False)[0] self.assertRaises(ControlNotFoundError, form.click, nr=2) self.assert_(form.click().get_full_url() == "http://blah/abc?foo=") self.assert_(form.click(name="bar").get_full_url() == "http://blah/abc?bar=") for method in ["GET", "POST"]: file = StringIO( """
""" % method) # " (this line is here for emacs) form = mechanize.ParseFile(file, "http://blah/", backwards_compat=False)[0] if method == "GET": url = "http://blah/abc?foo=" else: url = "http://blah/abc?bang=whizz" self.assert_(form.click().get_full_url() == url) def testAuth(self): fh = self._get_test_file("Auth.html") forms = mechanize.ParseFile(fh, self.base_uri, backwards_compat=False) self.assert_(len(forms) == 1) form = forms[0] self.assert_(form.action == "http://auth.athensams.net/" "?ath_returl=%22http%3A%2F%2Ftame.mimas.ac.uk%2Fisicgi" "%2FWOS-login.cgi%22&ath_dspid=MIMAS.WOS") self.assertRaises(ControlNotFoundError, lambda form=form: form.toggle("d'oh", "oops")) self.assertRaises(ControlNotFoundError, lambda form=form: form["oops"]) def bad_assign(form=form): form["oops"] = ["d'oh"] self.assertRaises(ControlNotFoundError, bad_assign) self.assertRaises(ValueError, form.find_control) keys = ["ath_uname", "ath_passwd"] values = ["", ""] types = ["text", "password"] for i in range(len(keys)): key = keys[i] c = form.find_control(key) self.assert_(c.value == values[i]) self.assert_(c.type == types[i]) c = form.find_control(type="image") self.assert_(c.name is None) self.assert_(c.value == "") self.assert_(c.type == "image") form["ath_uname"] = "jbloggs" form["ath_passwd"] = "foobar" self.assert_(form.click_pairs() == [("ath_uname", "jbloggs"), ("ath_passwd", "foobar")]) def testSearchType(self): fh = self._get_test_file("SearchType.html") forms = mechanize.ParseFile(fh, self.base_uri, backwards_compat=False) self.assert_(len(forms) == 1) form = forms[0] keys = ["SID", "SESSION_DIR", "Full Search", "Easy Search", "New Session", "Log off", "Form", "JavaScript"] values = ["PMrU0IJYy4MAAELSXic_E2011300_PMrU0IJYy4MAAELSXic-0", "", "", "", "", "", "Welcome", "No"] types = ["hidden", "hidden", "image", "image", "image", "image", "hidden", "hidden"] for i in range(len(keys)): key = keys[i] self.assert_(form.find_control(key).value == values[i]) self.assert_(form.find_control(key).type == types[i]) pairs = form.click_pairs("Full Search") self.assert_(pairs == [ ("SID", "PMrU0IJYy4MAAELSXic_E2011300_PMrU0IJYy4MAAELSXic-0"), ("SESSION_DIR", ""), ("Full Search.x", "1"), ("Full Search.y", "1"), ("Form", "Welcome"), ("JavaScript", "No")]) def testFullSearch(self): pass # XXX def testGeneralSearch(self): fh = self._get_test_file("GeneralSearch.html") forms = mechanize.ParseFile(fh, self.base_uri, backwards_compat=False) self.assert_(len(forms) == 1) form = forms[0] keys = ["SID", "SESSION_DIR", "Home", "Date & Database Limits", "Cited Ref Search", "Log off", "Search", "topic", "titleonly", "author", "journal", "address", "Search", "Save query", "Clear", "languagetype", "doctype", "Sort", "Form", "Func"] values = ["PMrU0IJYy4MAAELSXic_E2011300_PMrU0IJYy4MAAELSXic-0", "", "", "", "", "", "", "", [], "", "", "", "", "", "", ["All languages"], ["All document types"], ["Latest date"], "General", "Search"] types = ["hidden", "hidden", "image", "image", "image", "image", "image", "text", "checkbox", "text", "text", "text", "image", "image", "image", "select", "select", "select", "hidden", "hidden"] fc = form.find_control for i in range(len(keys)): name = keys[i] type = types[i] self.assertEqual(fc(name, nr=0).value, form.get_value(name, nr=0)) self.assertEqual(fc(name, nr=0).value, values[i]) self.assertEqual(fc(name, nr=0).type, type) self.assertEqual(fc(name, type, nr=0).name, name) self.assert_(fc(type="hidden", nr=0).name == "SID") self.assert_(fc(type="image", nr=0).name == "Home") self.assert_(fc(nr=6).name == "Search") self.assertRaises(ControlNotFoundError, fc, nr=50) self.assertRaises(ValueError, fc, nr=-1) self.assert_(fc("Search", "image", nr=0).name == "Search") self.assertRaises(ControlNotFoundError, fc, "Search", "hidden") s0 = fc("Search", "image", nr=0) s0b = fc("Search", "image", nr=0) s1 = fc("Search", "image", nr=1) self.assert_(s0.name == s1.name == "Search") self.assert_(s0 is s0b) self.assert_(s0 is not s1) self.assertRaises(ControlNotFoundError, fc, "Search", "image", nr=2) self.assert_(fc(type="text", nr=2).name == "journal") self.assert_(fc("Search", nr=0) is not fc("Search", nr=1)) form["topic"] = "foo" self.assert_(form["topic"] == "foo") form["author"] = "bar" form["journal"] = "" form["address"] = "baz" form["languagetype"] = ["English", "Catalan"] self.assert_(form["languagetype"] == ["English", "Catalan"]) form["titleonly"] = ["on"] self.assert_(form["titleonly"] == ["on"]) pairs = form.click_pairs("Search") self.assert_(pairs == [ ("SID", "PMrU0IJYy4MAAELSXic_E2011300_PMrU0IJYy4MAAELSXic-0"), ("SESSION_DIR", ""), ("Search.x", "1"), ("Search.y", "1"), ("topic", "foo"), ("titleonly", "on"), ("author", "bar"), ("journal", ""), ("address", "baz"), ("languagetype", "English"), ("languagetype", "Catalan"), ("doctype", "All document types"), ("Sort", "Latest date"), ("Form", "General"), ("Func", "Search")]) hide_deprecations() pvs = form.possible_items("languagetype") self.assert_(pvs[0] == "All languages") self.assert_(len(pvs) == 47) self.assertRaises( ItemNotFoundError, lambda form=form: form.toggle("d'oh", "languagetype")) form.toggle("English", "languagetype") self.assert_(form["languagetype"] == ["Catalan"]) self.assertRaises(TypeError, form.toggle, ["Catalan"], "languagetype") self.assertRaises(TypeError, form.toggle, "Catalan", ["languagetype"]) # XXX type, nr, by_label args self.assertRaises(ControlNotFoundError, form.set, True, "blah", "SID") # multiple select form["languagetype"] = [] self.assert_(form["languagetype"] == []) form.set(True, "Catalan", "languagetype") self.assert_(form["languagetype"] == ["Catalan"]) form.set(True, "English", "languagetype") self.assert_(form["languagetype"] == ["English", "Catalan"]) form.set(False, "English", "languagetype") self.assert_(form["languagetype"] == ["Catalan"]) form.set(False, "Catalan", "languagetype") self.assert_(form["languagetype"] == []) self.assertRaises(ItemNotFoundError, form.set, True, "doh", "languagetype") self.assertRaises(ItemNotFoundError, form.set, False, "doh", "languagetype") self.assertRaises(ControlNotFoundError, form.set, True, "blah", "oops") self.assertRaises(TypeError, form.set, True, ["Catalan"], "languagetype") self.assertRaises(TypeError, form.set, False, ["Catalan"], "languagetype") self.assertRaises(TypeError, form.set, True, "Catalan", ["languagetype"]) self.assertRaises(TypeError, form.set, False, "Catalan", ["languagetype"]) def setitem(name, value, form=form): form[name] = value form["languagetype"] = ["Catalan"] self.assert_(form["languagetype"] == ["Catalan"]) self.assertRaises(ItemNotFoundError, setitem, "languagetype", ["doh"]) self.assertRaises(ControlNotFoundError, setitem, "oops", ["blah"]) self.assertRaises(TypeError, setitem, ["languagetype"], "Catalan") # single select form["Sort"] = [] self.assert_(form["Sort"] == []) form.set(True, "Relevance", "Sort") self.assert_(form["Sort"] == ["Relevance"]) form.set(True, "Times Cited", "Sort") self.assert_(form["Sort"] == ["Times Cited"]) form.set(False, "Times Cited", "Sort") self.assert_(form["Sort"] == []) self.assertRaises(ItemNotFoundError, form.set, True, "doh", "Sort") self.assertRaises(ItemNotFoundError, form.set, False, "doh", "Sort") self.assertRaises(ControlNotFoundError, form.set, True, "blah", "oops") self.assertRaises(TypeError, form.set, True, ["Relevance"], "Sort") self.assertRaises(TypeError, form.set, False, ["Relevance"], "Sort") self.assertRaises(TypeError, form.set, True, "Relevance", ["Sort"]) self.assertRaises(TypeError, form.set, False, "Relevance", ["Sort"]) reset_deprecations() form["Sort"] = ["Relevance"] self.assert_(form["Sort"] == ["Relevance"]) self.assertRaises(ItemNotFoundError, setitem, "Sort", ["doh"]) self.assertRaises(ControlNotFoundError, setitem, "oops", ["blah"]) self.assertRaises(TypeError, setitem, ["Sort"], ["Relevance"]) def testSetValueByLabelIgnoringAmbiguity(self): # regression test: follow ClientForm 0.1 behaviour # also test that backwards_compat argument to ParseFile works f = StringIO("""\
""") for kwds, backwards_compat in [({}, True), ({"backwards_compat": True}, True), ({"backwards_compat": False}, False), ]: hide_deprecations() form = mechanize.ParseFile(f, "http://localhost/", **kwds)[0] reset_deprecations() f.seek(0) c = form.find_control("form.grocery") #for item in c.items: # print [label.text for label in item.get_labels()] c.set_value_by_label( ["Loaf of Bread", "Loaf of Bread", "Loaf of Challah"]) if backwards_compat: # select first item of ambiguous set self.assertEqual( c.get_value_by_label(), ["Loaf of Bread", "Loaf of Challah"]) self.assertEqual( [item.id for item in c.items if item.selected], ["1", None]) # disabled items still part of 'value by label' c.get(label="Loaf of Challah").disabled = True self.assertEqual( c.get_value_by_label(), ["Loaf of Bread", "Loaf of Challah"]) else: self.assertEqual( c.get_value_by_label(), ["Loaf of Bread", "Loaf of Bread", "Loaf of Challah"]) self.assertEqual( [item.id for item in c.items if item.selected], ["1", "2", None]) # disabled items NOT part of 'value by label' c.get(label="Challah").disabled = True self.assertEqual( c.get_value_by_label(), ["Loaf of Bread", "Loaf of Bread"]) def testClearValue(self): # regression test: follow ClientForm 0.1 behaviour # assigning [] to value is implemented as a special case f = StringIO("""\
""") for kwds, backwards_compat in [ ({}, True), ({"backwards_compat": True}, True), ({"backwards_compat": False}, False), ]: hide_deprecations() form = mechanize.ParseFile(f, "http://localhost/", **kwds)[0] reset_deprecations() f.seek(0) cc = form.find_control("s") if backwards_compat: self.assertEqual(cc.value, ["a", "b"]) cc.value = [] self.assertEqual( [ii.name for ii in cc.items if ii.selected], []) else: self.assertEqual(cc.value, ["b"]) cc.value = [] # first is disabled, so no need to deselect self.assertEqual( [ii.name for ii in cc.items if ii.selected], ["a"]) def testSearchByLabel(self): f = StringIO("""\
Quality
Genre
In this grocery list of requested food items, mark the items you intend to purchase:  |  |  |  |  |  |  |  |  |  |
""") form = mechanize.ParseFile(f, "http://localhost/", backwards_compat=False)[0] # basic tests self.assertEqual(form.find_control(label="Title").value, "The Grapes of Wrath") self.assertEqual(form.find_control(label="Submit").value, "Submit") self.assertEqual( form.find_control(label="Country").get( label="Britain").name, "EU: Great Britain") self.assertEqual( form.find_control(label="Origin").get( label="GB").name, "EU: Great Britain") self.assertEqual(form.find_control(label="Password").value, "123") self.assertEqual(form.find_control(label="Title").value, "The Grapes of Wrath") # Test item ambiguity, get, get_items, and set_value_by_label. # A form can be in two states: either ignoring ambiguity or being # careful about it. Currently, by default, a form's backwards_compat # attribute is True, so ambiguity is ignored. For instance, notice # that the form.grocery checkboxes include some loaves of bread and # a loaf of challah. The code just guesses what you mean: form.backwards_compat = True c = form.find_control("form.grocery") # label substring matching is turned off for compat mode self.assertRaises(ItemNotFoundError, c.get, label="Loaf") self.assertEqual(c.get(label="Loaf of Bread"), c.items[0]) c.set_value_by_label(["Loaf of Bread"]) self.assertEqual(c.get_value_by_label(), ["Loaf of Bread"]) self.assertEqual(c.items[0].id, "1") # However, if the form's backwards_compat attribute is False, Ambiguity # Errors may be raised. This is generally a preferred approach, but is # not backwards compatible. form.backwards_compat = False self.assertRaises(mechanize.AmbiguityError, c.get, label="Loaf") self.assertRaises( mechanize.AmbiguityError, c.set_value_by_label, ["Loaf"]) # If items have the same name (value), set_value_by_label will # be happy (since it is just setting the value anyway). c.set_value_by_label(["Loaf of Bread"]) self.assertEqual(c.get_value_by_label(), ["Loaf of Bread"]) c.set_value_by_label( ["Loaf of Bread", "Loaf of Bread", "Loaf of Challah"]) self.assertEqual( c.get_value_by_label(), ["Loaf of Bread", "Loaf of Bread", "Loaf of Challah"]) # "get" will still raise an exception, though. self.assertRaises( mechanize.AmbiguityError, c.get, label="Loaf of Bread") # If you want an item, you need to specify which one you want (or use # get_items to explicitly get all of them). self.assertEqual(c.get(label="Loaf of Bread", nr=0).selected, True) self.assertEqual(c.get(label="Loaf of Bread", nr=1).selected, True) self.assertEqual(c.get(label="Loaf of Bread", nr=2).selected, False) self.assertEqual(c.get(label="Loaf of Challah").selected, True) self.assertEqual( [i.selected for i in c.get_items(label="Loaf of Bread")], [True, True, False]) self.assertEqual( [i.selected for i in c.get_items(label="Loaf of Challah")], [True]) self.assertEqual( [i.name for i in c.get_items(label="Loaf")], ["bread", "bread", "bread", "challah"]) self.assertEqual( [i.get_labels()[0].text for i in c.get_items("bread")], ["Loaf of Bread", "Loaf of Bread", "Loaf of Bread"]) # test deprecation if warnings_imported: try: for c, f in ( (form.find_control("form.genre"), "western"), (form.find_control("form.country"), "zimbabwe"), (form.find_control("form.quality"), "good")): # warnings are nasty. :-( raise_deprecations() # clear onceregistry try: c.possible_items() except DeprecationWarning: pass else: self.fail("deprecation failed") try: c.toggle_single() except DeprecationWarning: pass else: self.fail("deprecation failed") try: c.set_single(True) except DeprecationWarning: pass else: self.fail("deprecation failed") try: c.toggle(f) except DeprecationWarning: pass else: self.fail("deprecation failed") try: c.get_item_disabled(f) except DeprecationWarning: pass else: self.fail("deprecation failed") try: c.set_item_disabled(True, f) except DeprecationWarning: pass else: self.fail("deprecation failed") try: c.get_item_attrs(True, f) except DeprecationWarning: pass else: self.fail("deprecation failed") finally: reset_deprecations() def testResults(self): fh = self._get_test_file("Results.html") forms = mechanize.ParseFile(fh, self.base_uri, backwards_compat=False) self.assert_(len(forms) == 1) form = forms[0] hide_deprecations() pvs = form.possible_items("marked_list_candidates") reset_deprecations() self.assert_(pvs == [ "000174872000059/1", "000174858300003/2", "000174827900006/3"]) def bad_setitem(form=form): form["marked_list_candidates"] = ["blah"] self.assertRaises(ItemNotFoundError, bad_setitem) form["marked_list_candidates"] = [pvs[0]] # I've removed most of the INPUT elements from this page, and # corrected an HTML error keys = ["Add marked records to list", "Add records on page to list", "Add all records retrieved to list", "marked_list_candidates", "Add marked records to list", "Add records on page to list", "Add all records retrieved to list" ] types = ["image", "image", "image", "checkbox", "image", "image", "image"] values = ["", "", "", [pvs[0]], "", "", "", ] for i in range(len(keys)): key = keys[i] control = form.find_control(key, nr=0) self.assert_(control.value == values[i]) self.assert_(control.type == types[i]) pairs = form.click_pairs("Add all records retrieved to list") self.assert_(pairs == [ ("Add all records retrieved to list.x", "1"), ("Add all records retrieved to list.y", "1"), ("marked_list_candidates", pvs[0])]) def testMarkedResults(self): fh = self._get_test_file("MarkedResults.html") forms = mechanize.ParseFile(fh, self.base_uri, backwards_compat=False) self.assert_(len(forms) == 1) form = forms[0] pairs = form.click_pairs() # I've removed most of the INPUT elements from this page, and # corrected an HTML error self.assert_(pairs == [ ("Add marked records to list.x", "1"), ("Add marked records to list.y", "1"), ("marked_list_candidates", "000174872000059/1"), ("marked_list_candidates", "000174858300003/2"), ("marked_list_candidates", "000174827900006/3") ]) def testMarkedRecords(self): pass # XXX def make_form(html): global_form, form = mechanize.ParseFileEx(StringIO(html), "http://example.com/") assert len(global_form.controls) == 0 return form def make_form_global(html): return get1(mechanize.ParseFileEx(StringIO(html), "http://example.com/")) class MoreFormTests(unittest.TestCase): def test_interspersed_controls(self): # must preserve item ordering even across controls f = StringIO("""\
""") form = mechanize.ParseFile(f, "http://blah/", backwards_compat=False)[0] form["murphy"] = ["a", "b", "c"] form["woof"] = ["d"] self.assertEqual(form.click_pairs(), [ ("murphy", "a"), ("woof", "d"), ("murphy", "b"), ("murphy", "c"), ]) form.method = "POST" form.enctype = "multipart/form-data" lines = [line for line in form.click_request_data()[1].split("\r\n") if line != '' and not line.startswith("--")] self.assertEqual( lines, ['Content-Disposition: form-data; name="murphy"', 'a', 'Content-Disposition: form-data; name="woof"', 'd', 'Content-Disposition: form-data; name="murphy"', 'b', 'Content-Disposition: form-data; name="murphy"', 'c', ] ) def make_form(self): f = StringIO("""\
""") return mechanize.ParseFile(f, "http://blah/", backwards_compat=False)[0] def test_value(self): form = self.make_form() form.set_value(["v3"], type="select", kind="multilist") self.assert_(form.get_value("d") == ["v3"]) hide_deprecations() form.set_value(["l2"], type="select", kind="multilist", by_label=True) self.assert_(form.get_value("d", by_label=True) == ["l2"]) self.assert_(form.get_value( "b", "radio", "singlelist", None, 0, False) == []) form.set_value(["One"], "b", by_label=True) self.assertEqual( form.get_value("b", "radio", "singlelist", None, 0, False), ["1"]) form.set_value(["Three"], "b", by_label=True) reset_deprecations() self.assertEqual( form.get_value("b", "radio", "singlelist", None, 0, False), ["3"]) def test_id(self): form = self.make_form() self.assert_(form.find_control("c").id == "cselect") self.assert_(form.find_control("a").id == "1a") self.assert_(form.find_control("b").id is None) self.assert_(form.find_control(id="cselect").id == "cselect") self.assertRaises(ControlNotFoundError, form.find_control, id="coption1") self.assert_(form.find_control(id="1a").id == "1a") self.assertRaises(ControlNotFoundError, form.find_control, id="1") def test_single(self): form = self.make_form() hide_deprecations() self.assertRaises(ItemCountError, form.set_single, True, "d") form.set_single(True, 'e', by_label=True) self.assertEqual(form.get_value("e"), ["1"]) form.set_single(False, 'e', by_label=True) self.assertEqual(form.get_value("e"), []) form.toggle_single("e", "checkbox", "list", nr=0) self.assert_("1" in form.get_value("e")) form.set_single(False, "e", "checkbox", "list", nr=0) self.assert_("1" not in form.get_value("e")) form.set_single(True, "e", "checkbox", "list", nr=0) self.assert_("1" in form.get_value("e")) reset_deprecations() def test_possible_items(self): form = self.make_form() hide_deprecations() self.assert_(form.possible_items("c") == ["1", "2", "3"]) self.assert_(form.possible_items("d", by_label=True) == ["l1", "l2", "l3"]) self.assert_(form.possible_items("a") == ["1", "2", "3"]) self.assertEqual(form.possible_items('e', by_label=True), [None]) self.assertEqual(form.possible_items('a', by_label=True), ['One', 'Two', 'Three']) self.assertEqual(form.possible_items('b', by_label=True), ['One', 'Two', 'Three', 'Four']) reset_deprecations() def test_set_all_readonly(self): form = self.make_form() form.set_all_readonly(True) for c in form.controls: self.assert_(c.readonly) form.set_all_readonly(False) for c in form.controls: self.assert_(not c.readonly) def test_clear_all(self): form = self.make_form() form.set_all_readonly(True) self.assertRaises(AttributeError, form.clear_all) form.set_all_readonly(False) form.clear_all() for c in form.controls: self.assert_(not c.value) def test_clear(self): form = self.make_form() form.set_all_readonly(True) self.assertRaises(AttributeError, form.clear, "b") form.set_all_readonly(False) form["b"] = ["1"] self.assertEqual(form["b"], ["1"]) form.clear("b") self.assertEqual(form["b"], []) def test_attrs(self): form = self.make_form() self.assert_(form.attrs["blah"] == "nonsense") self.assert_(form.attrs["name"] == "formname") a = form.find_control("a") self.assertRaises(AttributeError, getattr, a, 'attrs') hide_deprecations() self.assert_(a.get_item_attrs("1")["blah"] == "spam") self.assert_(a.get_item_attrs("2")["blah"] == "eggs") self.assert_(not a.get_item_attrs("3").has_key("blah")) c = form.find_control("c") self.assert_(c.attrs["blah"] == "foo") self.assert_(c.get_item_attrs("1")["blah"] == "bar") self.assert_(c.get_item_attrs("2")["blah"] == "baz") self.assert_(not c.get_item_attrs("3").has_key("blah")) reset_deprecations() def test_select_control_nr_and_label(self): for compat in [False, True]: self._test_select_control_nr_and_label(compat) def _test_select_control_nr_and_label(self, compat): f = StringIO("""\
""") if compat: hide_deprecations() form = mechanize.ParseFile(f, "http://example.com/", backwards_compat=compat)[0] if compat: reset_deprecations() ctl = form.find_control("form.grocery") # ordinary case self.assertEqual(ctl.get("p", nr=1).id, "3") # nr too high self.assertRaises(ItemNotFoundError, ctl.get, "p", nr=50) # first having label "a" self.assertEqual(ctl.get(label="a", nr=0).id, "1") # second having label "a"... item = ctl.get(label="a", nr=1) # ...as opposed to second with label attribute "a"! -- each item # has multiple labels accessible by .get_labels(), but only one # label HTML-attribute self.assertEqual(item.id, "2") self.assertEqual(item.attrs.get("label"), "b") # ! # third having label "a" (but only the second whose label is "a") self.assertEqual(ctl.get(label="a", nr=1).id, "2") # nr too high again self.assertRaises(ItemNotFoundError, ctl.get, label="a", nr=3) self.assertEqual(ctl.get(id="2").id, "2") self.assertRaises(ItemNotFoundError, ctl.get, id="4") self.assertRaises(ItemNotFoundError, ctl.get, id="4") def test_label_whitespace(self): for compat in [False, True]: f = StringIO("""\
""") if compat: hide_deprecations() form = mechanize.ParseFile(f, "http://example.com/", backwards_compat=compat)[0] ctl = form.find_control("eg") p = ctl.get("p") q = ctl.get("q") self.assertEqual(p.get_labels()[0].text, (compat and "a b c" or "a b c")) self.assertEqual(q.get_labels()[0].text, "b") if compat: reset_deprecations() def test_nameless_list_control(self): # ListControls are built up from elements that match by name and type # attributes. Nameless controls cause some tricky cases. We should # get a new control for nameless controls. for data in [ """\
""", """\
""", """\
""", ]: f = StringIO(data) form = mechanize.ParseFile(f, "http://example.com/", backwards_compat=False)[0] bar = form.find_control(type="checkbox", id="a") # should have value "on", but not be successful self.assertEqual([item.name for item in bar.items], ["on"]) self.assertEqual(bar.value, []) self.assertEqual(form.click_pairs(), []) def test_action_with_fragment(self): for method in ["GET", "POST"]: data = ('
' '
' % method ) f = StringIO(data) form = mechanize.ParseFile(f, "http://example.com/", backwards_compat=False)[0] self.assertEqual( form.click().get_full_url(), "http://example.com/"+(method=="GET" and "?s=" or ""), ) data = '
' f = StringIO(data) form = mechanize.ParseFile(f, "http://example.com/", backwards_compat=False)[0] form.find_control(type="isindex").value = "blah" self.assertEqual(form.click(type="isindex").get_full_url(), "http://example.com/?blah") def test_click_empty_form_by_label(self): # http://github.com/jjlee/mechanize/issues#issue/16 form = make_form_global("") assert len(form.controls) == 0 self.assertRaises(mechanize.ControlNotFoundError, form.click, label="no control has this label") class ContentTypeTests(unittest.TestCase): def test_content_type(self): class OldStyleRequest: def __init__(self, url, data=None, hdrs=None): self.ah = self.auh = False def add_header(self, key, val): self.ah = True class NewStyleRequest(OldStyleRequest): def add_unredirected_header(self, key, val): self.auh = True class FakeForm(_form.HTMLForm): def __init__(self, hdr): self.hdr = hdr def _request_data(self): return "http://example.com", "", [(self.hdr, "spam")] for request_class, hdr, auh in [ (OldStyleRequest, "Foo", False), (NewStyleRequest, "Foo", False), (OldStyleRequest, "Content-type", False), (NewStyleRequest, "Content-type", True), ]: form = FakeForm(hdr) req = form._switch_click("request", request_class) self.assertEqual(req.auh, auh) self.assertEqual(req.ah, not auh) class FunctionTests(unittest.TestCase): def test_normalize_line_endings(self): def check(text, expected, self=self): got = _form.normalize_line_endings(text) self.assertEqual(got, expected) # unix check("foo\nbar", "foo\r\nbar") check("foo\nbar\n", "foo\r\nbar\r\n") # mac check("foo\rbar", "foo\r\nbar") check("foo\rbar\r", "foo\r\nbar\r\n") # dos check("foo\r\nbar", "foo\r\nbar") check("foo\r\nbar\r\n", "foo\r\nbar\r\n") # inconsistent -- we just blithely convert anything that looks like a # line ending to the DOS convention, following Firefox's behaviour when # normalizing textarea content check("foo\r\nbar\nbaz\rblah\r\n", "foo\r\nbar\r\nbaz\r\nblah\r\n") # pathological ;-O check("\r\n\n\r\r\r\n", "\r\n"*5) class CaseInsensitiveDict: def __init__(self, items): self._dict = {} for key, val in items: self._dict[string.lower(key)] = val def __getitem__(self, key): return self._dict[key] def __getattr__(self, name): return getattr(self._dict, name) class UploadTests(_testcase.TestCase): def test_choose_boundary(self): bndy = _form.choose_boundary() ii = string.find(bndy, '.') self.assert_(ii < 0) def make_form(self): html = """\

""" return mechanize.ParseFile(StringIO(html), "http://localhost/cgi-bin/upload.cgi", backwards_compat=False)[0] def test_file_request(self): import cgi # fill in a file upload form... form = self.make_form() form["user"] = "john" data_control = form.find_control("data") data = "blah\nbaz\n" data_control.add_file(StringIO(data)) #print "data_control._upload_data", data_control._upload_data req = form.click() self.assertTrue(get_header(req, "Content-type").startswith( "multipart/form-data; boundary=")) #print "req.get_data()\n>>%s<<" % req.get_data() # ...and check the resulting request is understood by cgi module fs = cgi.FieldStorage(StringIO(req.get_data()), CaseInsensitiveDict(header_items(req)), environ={"REQUEST_METHOD": "POST"}) self.assert_(fs["user"].value == "john") self.assert_(fs["data"].value == data) self.assertEquals(fs["data"].filename, "") def test_file_request_with_filename(self): import cgi # fill in a file upload form... form = self.make_form() form["user"] = "john" data_control = form.find_control("data") data = "blah\nbaz\n" data_control.add_file(StringIO(data), filename="afilename") req = form.click() self.assert_(get_header(req, "Content-type").startswith( "multipart/form-data; boundary=")) # ...and check the resulting request is understood by cgi module fs = cgi.FieldStorage(StringIO(req.get_data()), CaseInsensitiveDict(header_items(req)), environ={"REQUEST_METHOD": "POST"}) self.assert_(fs["user"].value == "john") self.assert_(fs["data"].value == data) self.assert_(fs["data"].filename == "afilename") def test_multipart_file_request(self): import cgi # fill in a file upload form... form = self.make_form() form["user"] = "john" data_control = form.find_control("data") data = "blah\nbaz\n" data_control.add_file(StringIO(data), filename="filenamea") more_data = "rhubarb\nrhubarb\n" data_control.add_file(StringIO(more_data)) yet_more_data = "rheum\nrhaponicum\n" data_control.add_file(StringIO(yet_more_data), filename="filenamec") req = form.click() self.assertTrue(get_header(req, "Content-type").startswith( "multipart/form-data; boundary=")) #print "req.get_data()\n>>%s<<" % req.get_data() # ...and check the resulting request is understood by cgi module fs = cgi.FieldStorage(StringIO(req.get_data()), CaseInsensitiveDict(header_items(req)), environ={"REQUEST_METHOD": "POST"}) self.assert_(fs["user"].value == "john") fss = fs["data"][None] filenames = "filenamea", "", "filenamec" datas = data, more_data, yet_more_data for i in range(len(fss)): fs = fss[i] filename = filenames[i] data = datas[i] self.assert_(fs.filename == filename) self.assert_(fs.value == data) def test_upload_data(self): form = self.make_form() data = form.click().get_data() self.assertTrue(data.startswith("--")) def test_empty_upload(self): # no controls except for INPUT/SUBMIT forms = mechanize.ParseFile(StringIO("""
"""), ".", backwards_compat=False) form = forms[0] data = form.click().get_data() lines = string.split(data, "\r\n") self.assertTrue(lines[0].startswith("--")) self.assertEqual(lines[1], 'Content-Disposition: form-data; name="submit"') self.assertEqual(lines[2], "") self.assertEqual(lines[3], "") self.assertTrue(lines[4].startswith("--")) def test_no_files(self): # no files uploaded self.monkey_patch(_form, "choose_boundary", lambda: "123") forms = mechanize.ParseFileEx(StringIO("""
"""), ".") form = forms[1] data = form.click().get_data() self.assertEquals(data, """\ --123\r Content-Disposition: form-data; name="spam"; filename=""\r Content-Type: application/octet-stream\r \r \r --123--\r """) if __name__ == "__main__": unittest.main() mechanize-0.2.5/test/test_request.doctest0000644000175000017500000000431311545150644017242 0ustar johnjohn>>> from mechanize import Request >>> Request("http://example.com/foo#frag").get_selector() '/foo' >>> Request("http://example.com?query").get_selector() '/?query' >>> Request("http://example.com").get_selector() '/' Request Headers Dictionary -------------------------- The Request.headers dictionary is not a documented interface. It should stay that way, because the complete set of headers are only accessible through the .get_header(), .has_header(), .header_items() interface. However, .headers pre-dates those methods, and so real code will be using the dictionary. The introduction in 2.4 of those methods was a mistake for the same reason: code that previously saw all (urllib2 user)-provided headers in .headers now sees only a subset (and the function interface is ugly and incomplete). A better change would have been to replace .headers dict with a dict subclass (or UserDict.DictMixin instance?) that preserved the .headers interface and also provided access to the "unredirected" headers. It's probably too late to fix that, though. Check .capitalize() case normalization: >>> url = "http://example.com" >>> Request(url, headers={"Spam-eggs": "blah"}).headers["Spam-eggs"] 'blah' >>> Request(url, headers={"spam-EggS": "blah"}).headers["Spam-eggs"] 'blah' Currently, Request(url, "Spam-eggs").headers["Spam-Eggs"] raises KeyError, but that could be changed in future. Request Headers Methods ----------------------- Note the case normalization of header names here, to .capitalize()-case. This should be preserved for backwards-compatibility. (In the HTTP case, normalization to .title()-case is done by urllib2 before sending headers to httplib). >>> url = "http://example.com" >>> r = Request(url, headers={"Spam-eggs": "blah"}) >>> r.has_header("Spam-eggs") True >>> r.header_items() [('Spam-eggs', 'blah')] >>> r.add_header("Foo-Bar", "baz") >>> items = r.header_items() >>> items.sort() >>> items [('Foo-bar', 'baz'), ('Spam-eggs', 'blah')] Note that e.g. r.has_header("spam-EggS") is currently False, and r.get_header("spam-EggS") returns None, but that could be changed in future. >>> r.has_header("Not-there") False >>> print r.get_header("Not-there") None >>> r.get_header("Not-there", "default") 'default' mechanize-0.2.5/test/test_rfc3986.doctest0000644000175000017500000000630211545150644016656 0ustar johnjohn>>> from mechanize._rfc3986 import urlsplit, urljoin, remove_dot_segments Some common cases >>> urlsplit("http://example.com/spam/eggs/spam.html?apples=pears&a=b#foo") ('http', 'example.com', '/spam/eggs/spam.html', 'apples=pears&a=b', 'foo') >>> urlsplit("http://example.com/spam.html#foo") ('http', 'example.com', '/spam.html', None, 'foo') >>> urlsplit("ftp://example.com/foo.gif") ('ftp', 'example.com', '/foo.gif', None, None) >>> urlsplit('ftp://joe:password@example.com:port') ('ftp', 'joe:password@example.com:port', '', None, None) >>> urlsplit("mailto:jjl@pobox.com") ('mailto', None, 'jjl@pobox.com', None, None) The five path productions path-abempty: >>> urlsplit("http://www.example.com") ('http', 'www.example.com', '', None, None) >>> urlsplit("http://www.example.com/foo") ('http', 'www.example.com', '/foo', None, None) path-absolute: >>> urlsplit("a:/") ('a', None, '/', None, None) >>> urlsplit("a:/b:/c/") ('a', None, '/b:/c/', None, None) path-noscheme: >>> urlsplit("a:b/:c/") ('a', None, 'b/:c/', None, None) path-rootless: >>> urlsplit("a:b:/c/") ('a', None, 'b:/c/', None, None) path-empty: >>> urlsplit("quack:") ('quack', None, '', None, None) >>> remove_dot_segments("/a/b/c/./../../g") '/a/g' >>> remove_dot_segments("mid/content=5/../6") 'mid/6' >>> remove_dot_segments("/b/c/.") '/b/c/' >>> remove_dot_segments("/b/c/./.") '/b/c/' >>> remove_dot_segments(".") '' >>> remove_dot_segments("/.") '/' >>> remove_dot_segments("./") '' >>> remove_dot_segments("/..") '/' >>> remove_dot_segments("/../") '/' Examples from RFC 3986 section 5.4 Normal Examples >>> base = "http://a/b/c/d;p?q" >>> def join(uri): return urljoin(base, uri) >>> join("g:h") 'g:h' >>> join("g") 'http://a/b/c/g' >>> join("./g") 'http://a/b/c/g' >>> join("g/") 'http://a/b/c/g/' >>> join("/g") 'http://a/g' >>> join("//g") 'http://g' >>> join("?y") 'http://a/b/c/d;p?y' >>> join("g?y") 'http://a/b/c/g?y' >>> join("#s") 'http://a/b/c/d;p?q#s' >>> join("g#s") 'http://a/b/c/g#s' >>> join("g?y#s") 'http://a/b/c/g?y#s' >>> join(";x") 'http://a/b/c/;x' >>> join("g;x") 'http://a/b/c/g;x' >>> join("g;x?y#s") 'http://a/b/c/g;x?y#s' >>> join("") 'http://a/b/c/d;p?q' >>> join(".") 'http://a/b/c/' >>> join("./") 'http://a/b/c/' >>> join("..") 'http://a/b/' >>> join("../") 'http://a/b/' >>> join("../g") 'http://a/b/g' >>> join("../..") 'http://a/' >>> join("../../") 'http://a/' >>> join("../../g") 'http://a/g' Abnormal Examples >>> join("../../../g") 'http://a/g' >>> join("../../../../g") 'http://a/g' >>> join("/./g") 'http://a/g' >>> join("/../g") 'http://a/g' >>> join("g.") 'http://a/b/c/g.' >>> join(".g") 'http://a/b/c/.g' >>> join("g..") 'http://a/b/c/g..' >>> join("..g") 'http://a/b/c/..g' >>> join("./../g") 'http://a/b/g' >>> join("./g/.") 'http://a/b/c/g/' >>> join("g/./h") 'http://a/b/c/g/h' >>> join("g/../h") 'http://a/b/c/h' >>> join("g;x=1/./y") 'http://a/b/c/g;x=1/y' >>> join("g;x=1/../y") 'http://a/b/c/y' >>> join("g?y/./x") 'http://a/b/c/g?y/./x' >>> join("g?y/../x") 'http://a/b/c/g?y/../x' >>> join("g#s/./x") 'http://a/b/c/g#s/./x' >>> join("g#s/../x") 'http://a/b/c/g#s/../x' >>> join("http:g") 'http://a/b/c/g' Additional urljoin tests, not taken from RFC: >>> join("/..") 'http://a/' >>> join("/../") 'http://a/' mechanize-0.2.5/test/test_html.py0000644000175000017500000001247211545150644015506 0ustar johnjohn#!/usr/bin/env python from unittest import TestCase import mechanize import mechanize._form from mechanize._response import test_html_response class RegressionTests(TestCase): def test_close_base_tag(self): # any document containing a tag used to cause an exception br = mechanize.Browser() response = test_html_response("") br.set_response(response) list(br.links()) def test_bad_base_tag(self): # a document with a base tag with no href used to cause an exception for factory in [mechanize.DefaultFactory(), mechanize.RobustFactory()]: br = mechanize.Browser(factory=factory) response = test_html_response( "eg") br.set_response(response) list(br.links()) def test_robust_form_parser_uses_beautifulsoup(self): factory = mechanize.RobustFormsFactory() self.assertIs(factory.form_parser_class, mechanize._form.RobustFormParser) def test_form_parser_does_not_use_beautifulsoup(self): factory = mechanize.FormsFactory() self.assertIs(factory.form_parser_class, mechanize._form.FormParser) def _make_forms_from_bad_html(self, factory): bad_html = "" factory.set_response(test_html_response(bad_html), "utf-8") return list(factory.forms()) def test_robust_form_parser_does_not_raise_on_bad_html(self): self._make_forms_from_bad_html(mechanize.RobustFormsFactory()) def test_form_parser_fails_on_bad_html(self): self.assertRaises( mechanize.ParseError, self._make_forms_from_bad_html, mechanize.FormsFactory()) class CachingGeneratorFunctionTests(TestCase): def _get_simple_cgenf(self, log): from mechanize._html import CachingGeneratorFunction todo = [] for ii in range(2): def work(ii=ii): log.append(ii) return ii todo.append(work) def genf(): for a in todo: yield a() return CachingGeneratorFunction(genf()) def test_cache(self): log = [] cgenf = self._get_simple_cgenf(log) for repeat in range(2): for ii, jj in zip(cgenf(), range(2)): self.assertEqual(ii, jj) self.assertEqual(log, range(2)) # work only done once def test_interleaved(self): log = [] cgenf = self._get_simple_cgenf(log) cgen = cgenf() self.assertEqual(cgen.next(), 0) self.assertEqual(log, [0]) cgen2 = cgenf() self.assertEqual(cgen2.next(), 0) self.assertEqual(log, [0]) self.assertEqual(cgen.next(), 1) self.assertEqual(log, [0, 1]) self.assertEqual(cgen2.next(), 1) self.assertEqual(log, [0, 1]) self.assertRaises(StopIteration, cgen.next) self.assertRaises(StopIteration, cgen2.next) class UnescapeTests(TestCase): def test_unescape_charref(self): from mechanize._html import unescape_charref mdash_utf8 = u"\u2014".encode("utf-8") for ref, codepoint, utf8, latin1 in [ ("38", 38, u"&".encode("utf-8"), "&"), ("x2014", 0x2014, mdash_utf8, "—"), ("8212", 8212, mdash_utf8, "—"), ]: self.assertEqual(unescape_charref(ref, None), unichr(codepoint)) self.assertEqual(unescape_charref(ref, 'latin-1'), latin1) self.assertEqual(unescape_charref(ref, 'utf-8'), utf8) def test_unescape(self): import htmlentitydefs from mechanize._html import unescape data = "& < — — —" mdash_utf8 = u"\u2014".encode("utf-8") ue = unescape(data, htmlentitydefs.name2codepoint, "utf-8") self.assertEqual("& < %s %s %s" % ((mdash_utf8,)*3), ue) for text, expect in [ ("&a&", "&a&"), ("a&", "a&"), ]: got = unescape(text, htmlentitydefs.name2codepoint, "latin-1") self.assertEqual(got, expect) class EncodingFinderTests(TestCase): def make_response(self, encodings): return mechanize._response.test_response( headers=[("Content-type", "text/html; charset=\"%s\"" % encoding) for encoding in encodings]) def test_known_encoding(self): encoding_finder = mechanize._html.EncodingFinder("default") response = self.make_response(["utf-8"]) self.assertEqual(encoding_finder.encoding(response), "utf-8") def test_unknown_encoding(self): encoding_finder = mechanize._html.EncodingFinder("default") response = self.make_response(["bogus"]) self.assertEqual(encoding_finder.encoding(response), "default") def test_precedence(self): encoding_finder = mechanize._html.EncodingFinder("default") response = self.make_response(["latin-1", "utf-8"]) self.assertEqual(encoding_finder.encoding(response), "latin-1") def test_fallback(self): encoding_finder = mechanize._html.EncodingFinder("default") response = self.make_response(["bogus", "utf-8"]) self.assertEqual(encoding_finder.encoding(response), "utf-8") if __name__ == "__main__": import unittest unittest.main() mechanize-0.2.5/test/test_headers.py0000644000175000017500000001256711545150644016162 0ustar johnjohn"""Tests for ClientCookie._HeadersUtil.""" import mechanize._headersutil from mechanize._testcase import TestCase class IsHtmlTests(TestCase): def test_is_html(self): def check(headers, extension, is_html): url = "http://example.com/foo" + extension self.assertEqual( mechanize._headersutil.is_html(headers, url, allow_xhtml), is_html) for allow_xhtml in False, True: check(["text/html"], ".html", True), check(["text/html", "text/plain"], ".html", True) # Content-type takes priority over file extension from URL check(["text/html"], ".txt", True) check(["text/plain"], ".html", False) # use extension if no Content-Type check([], ".html", True) check([], ".gif", False) # don't regard XHTML as HTML (unless user explicitly asks for it), # since we don't yet handle XML properly check([], ".xhtml", allow_xhtml) check(["text/xhtml"], ".xhtml", allow_xhtml) # header with empty value check([""], ".txt", False) class HeaderTests(TestCase): def test_parse_ns_headers_expires(self): from mechanize._headersutil import parse_ns_headers # quotes should be stripped assert parse_ns_headers(['foo=bar; expires=01 Jan 2040 22:23:32 GMT']) == \ [[('foo', 'bar'), ('expires', 2209069412L), ('version', '0')]] assert parse_ns_headers(['foo=bar; expires="01 Jan 2040 22:23:32 GMT"']) == \ [[('foo', 'bar'), ('expires', 2209069412L), ('version', '0')]] def test_parse_ns_headers_version(self): from mechanize._headersutil import parse_ns_headers # quotes should be stripped expected = [[('foo', 'bar'), ('version', '1')]] for hdr in [ 'foo=bar; version="1"', 'foo=bar; Version="1"', ]: self.assertEquals(parse_ns_headers([hdr]), expected) def test_parse_ns_headers_special_names(self): # names such as 'expires' are not special in first name=value pair # of Set-Cookie: header from mechanize._headersutil import parse_ns_headers # Cookie with name 'expires' hdr = 'expires=01 Jan 2040 22:23:32 GMT' expected = [[("expires", "01 Jan 2040 22:23:32 GMT"), ("version", "0")]] self.assertEquals(parse_ns_headers([hdr]), expected) def test_join_header_words(self): from mechanize._headersutil import join_header_words assert join_header_words([[ ("foo", None), ("bar", "baz"), (None, "value") ]]) == "foo; bar=baz; value" assert join_header_words([[]]) == "" def test_split_header_words(self): from mechanize._headersutil import split_header_words tests = [ ("foo", [[("foo", None)]]), ("foo=bar", [[("foo", "bar")]]), (" foo ", [[("foo", None)]]), (" foo= ", [[("foo", "")]]), (" foo=", [[("foo", "")]]), (" foo= ; ", [[("foo", "")]]), (" foo= ; bar= baz ", [[("foo", ""), ("bar", "baz")]]), ("foo=bar bar=baz", [[("foo", "bar"), ("bar", "baz")]]), # doesn't really matter if this next fails, but it works ATM ("foo= bar=baz", [[("foo", "bar=baz")]]), ("foo=bar;bar=baz", [[("foo", "bar"), ("bar", "baz")]]), ('foo bar baz', [[("foo", None), ("bar", None), ("baz", None)]]), ("a, b, c", [[("a", None)], [("b", None)], [("c", None)]]), (r'foo; bar=baz, spam=, foo="\,\;\"", bar= ', [[("foo", None), ("bar", "baz")], [("spam", "")], [("foo", ',;"')], [("bar", "")]]), ] for arg, expect in tests: try: result = split_header_words([arg]) except: import traceback, StringIO f = StringIO.StringIO() traceback.print_exc(None, f) result = "(error -- traceback follows)\n\n%s" % f.getvalue() assert result == expect, """ When parsing: '%s' Expected: '%s' Got: '%s' """ % (arg, expect, result) def test_roundtrip(self): from mechanize._headersutil import split_header_words, join_header_words tests = [ ("foo", "foo"), ("foo=bar", "foo=bar"), (" foo ", "foo"), ("foo=", 'foo=""'), ("foo=bar bar=baz", "foo=bar; bar=baz"), ("foo=bar;bar=baz", "foo=bar; bar=baz"), ('foo bar baz', "foo; bar; baz"), (r'foo="\"" bar="\\"', r'foo="\""; bar="\\"'), ('foo,,,bar', 'foo, bar'), ('foo=bar,bar=baz', 'foo=bar, bar=baz'), ('text/html; charset=iso-8859-1', 'text/html; charset="iso-8859-1"'), ('foo="bar"; port="80,81"; discard, bar=baz', 'foo=bar; port="80,81"; discard, bar=baz'), (r'Basic realm="\"foo\\\\bar\""', r'Basic; realm="\"foo\\\\bar\""') ] for arg, expect in tests: input = split_header_words([arg]) res = join_header_words(input) assert res == expect, """ When parsing: '%s' Expected: '%s' Got: '%s' Input was: '%s'""" % (arg, expect, res, input) if __name__ == "__main__": import unittest unittest.main() mechanize-0.2.5/test/test_history.doctest0000644000175000017500000000043011545150644017247 0ustar johnjohn>>> from mechanize import History If nothing has been added, .close should work. >>> history = History() >>> history.close() Under some circumstances response can be None, in that case this method should not raise an exception. >>> history.add(None, None) >>> history.close() mechanize-0.2.5/test/test_password_manager.special_doctest0000644000175000017500000001067311545150644022614 0ustar johnjohnFeatures common to HTTPPasswordMgr and HTTPProxyPasswordMgr =========================================================== (mgr_class gets here through globs argument) >>> mgr = mgr_class() >>> add = mgr.add_password >>> add("Some Realm", "http://example.com/", "joe", "password") >>> add("Some Realm", "http://example.com/ni", "ni", "ni") >>> add("c", "http://example.com/foo", "foo", "ni") >>> add("c", "http://example.com/bar", "bar", "nini") >>> add("b", "http://example.com/", "first", "blah") >>> add("b", "http://example.com/", "second", "spam") >>> add("a", "http://example.com", "1", "a") >>> add("Some Realm", "http://c.example.com:3128", "3", "c") >>> add("Some Realm", "d.example.com", "4", "d") >>> add("Some Realm", "e.example.com:3128", "5", "e") >>> mgr.find_user_password("Some Realm", "example.com") ('joe', 'password') >>> mgr.find_user_password("Some Realm", "http://example.com") ('joe', 'password') >>> mgr.find_user_password("Some Realm", "http://example.com/") ('joe', 'password') >>> mgr.find_user_password("Some Realm", "http://example.com/spam") ('joe', 'password') >>> mgr.find_user_password("Some Realm", "http://example.com/spam/spam") ('joe', 'password') >>> mgr.find_user_password("c", "http://example.com/foo") ('foo', 'ni') >>> mgr.find_user_password("c", "http://example.com/bar") ('bar', 'nini') Actually, this is really undefined ATM #Currently, we use the highest-level path where more than one match: # #>>> mgr.find_user_password("Some Realm", "http://example.com/ni") #('joe', 'password') Use latest add_password() in case of conflict: >>> mgr.find_user_password("b", "http://example.com/") ('second', 'spam') No special relationship between a.example.com and example.com: >>> mgr.find_user_password("a", "http://example.com/") ('1', 'a') >>> mgr.find_user_password("a", "http://a.example.com/") (None, None) Ports: >>> mgr.find_user_password("Some Realm", "c.example.com") (None, None) >>> mgr.find_user_password("Some Realm", "c.example.com:3128") ('3', 'c') >>> mgr.find_user_password("Some Realm", "http://c.example.com:3128") ('3', 'c') >>> mgr.find_user_password("Some Realm", "d.example.com") ('4', 'd') >>> mgr.find_user_password("Some Realm", "e.example.com:3128") ('5', 'e') Default port tests ------------------ >>> mgr = mgr_class() >>> add = mgr.add_password The point to note here is that we can't guess the default port if there's no scheme. This applies to both add_password and find_user_password. >>> add("f", "http://g.example.com:80", "10", "j") >>> add("g", "http://h.example.com", "11", "k") >>> add("h", "i.example.com:80", "12", "l") >>> add("i", "j.example.com", "13", "m") >>> mgr.find_user_password("f", "g.example.com:100") (None, None) >>> mgr.find_user_password("f", "g.example.com:80") ('10', 'j') >>> mgr.find_user_password("f", "g.example.com") (None, None) >>> mgr.find_user_password("f", "http://g.example.com:100") (None, None) >>> mgr.find_user_password("f", "http://g.example.com:80") ('10', 'j') >>> mgr.find_user_password("f", "http://g.example.com") ('10', 'j') >>> mgr.find_user_password("g", "h.example.com") ('11', 'k') >>> mgr.find_user_password("g", "h.example.com:80") ('11', 'k') >>> mgr.find_user_password("g", "http://h.example.com:80") ('11', 'k') >>> mgr.find_user_password("h", "i.example.com") (None, None) >>> mgr.find_user_password("h", "i.example.com:80") ('12', 'l') >>> mgr.find_user_password("h", "http://i.example.com:80") ('12', 'l') >>> mgr.find_user_password("i", "j.example.com") ('13', 'm') >>> mgr.find_user_password("i", "j.example.com:80") (None, None) >>> mgr.find_user_password("i", "http://j.example.com") ('13', 'm') >>> mgr.find_user_password("i", "http://j.example.com:80") (None, None) Features specific to HTTPProxyPasswordMgr ========================================= Default realm: >>> mgr = mechanize.HTTPProxyPasswordMgr() >>> add = mgr.add_password >>> mgr.find_user_password("d", "f.example.com") (None, None) >>> add(None, "f.example.com", "6", "f") >>> mgr.find_user_password("d", "f.example.com") ('6', 'f') Default host/port: >>> mgr.find_user_password("e", "g.example.com") (None, None) >>> add("e", None, "7", "g") >>> mgr.find_user_password("e", "g.example.com") ('7', 'g') Default realm and host/port: >>> mgr.find_user_password("f", "h.example.com") (None, None) >>> add(None, None, "8", "h") >>> mgr.find_user_password("f", "h.example.com") ('8', 'h') Default realm beats default host/port: >>> add("d", None, "9", "i") >>> mgr.find_user_password("d", "f.example.com") ('6', 'f') mechanize-0.2.5/test/test_robotfileparser.doctest0000644000175000017500000000041011545150644020746 0ustar johnjohn>>> from mechanize._http import MechanizeRobotFileParser Calling .set_opener() without args sets a default opener. >>> rfp = MechanizeRobotFileParser() >>> rfp.set_opener() >>> rfp._opener # doctest: +ELLIPSIS mechanize-0.2.5/test/test_pullparser.py0000644000175000017500000003173511545150644016736 0ustar johnjohn#!/usr/bin/env python from unittest import TestCase def peek_token(p): tok = p.get_token() p.unget_token(tok) return tok class PullParserTests(TestCase): from mechanize._pullparser import PullParser, TolerantPullParser PARSERS = [(PullParser, False), (TolerantPullParser, True)] def data_and_file(self): from StringIO import StringIO data = """ Title

This is a data blah & a & that was an entityref and this a is a charref. . """ #" f = StringIO(data) return data, f def test_encoding(self): #from mechanize import _pullparser #for pc, tolerant in [(_pullparser.PullParser, False)]:#PullParserTests.PARSERS: for pc, tolerant in PullParserTests.PARSERS: self._test_encoding(pc, tolerant) def _test_encoding(self, parser_class, tolerant): from StringIO import StringIO datas = ["ф", "ф"] def get_text(data, encoding): p = _get_parser(data, encoding) p.get_tag("a") return p.get_text() def get_attr(data, encoding, et_name, attr_name): p = _get_parser(data, encoding) while True: tag = p.get_tag(et_name) attrs = tag.attrs if attrs is not None: break return dict(attrs)[attr_name] def _get_parser(data, encoding): f = StringIO(data) p = parser_class(f, encoding=encoding) #print 'p._entitydefs>>%s<<' % p._entitydefs['—'] return p for data in datas: self.assertEqual(get_text(data, "KOI8-R"), "\xc6") self.assertEqual(get_text(data, "UTF-8"), "\xd1\x84") self.assertEqual(get_text("", "UTF-8"), u"\u2014".encode('utf8')) self.assertEqual( get_attr('blah', "UTF-8", "a", "name"), u"\u2014".encode('utf8')) self.assertEqual(get_text("", "ascii"), "—") # response = urllib.addinfourl(f, {"content-type": "text/html; charset=XXX"}, req.get_full_url()) def test_get_token(self): for pc, tolerant in PullParserTests.PARSERS: self._test_get_token(pc, tolerant) def _test_get_token(self, parser_class, tolerant): data, f = self.data_and_file() p = parser_class(f) from mechanize._pullparser import NoMoreTokensError self.assertEqual( p.get_token(), ("decl", '''DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"''', None)) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertEqual(p.get_token(), ("starttag", "html", [])) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertEqual(p.get_token(), ("starttag", "head", [])) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertEqual(p.get_token(), ("starttag", "title", [("an", "attr")])) self.assertEqual(p.get_token(), ("data", "Title", None)) self.assertEqual(p.get_token(), ("endtag", "title", None)) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertEqual(p.get_token(), ("endtag", "head", None)) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertEqual(p.get_token(), ("starttag", "body", [])) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertEqual(p.get_token(), ("starttag", "p", [])) self.assertEqual(p.get_token(), ("data", "This is a data ", None)) self.assertEqual(p.get_token(), ("starttag", "img", [("alt", "blah & a")])) self.assertEqual(p.get_token(), ("data", " ", None)) self.assertEqual(p.get_token(), ("entityref", "amp", None)) self.assertEqual(p.get_token(), ("data", " that was an entityref and this ", None)) self.assertEqual(p.get_token(), ("charref", "097", None)) self.assertEqual(p.get_token(), ("data", " is\na charref. ", None)) self.assertEqual(p.get_token(), ("starttag", "blah", [("foo", "bing"), ("blam", "wallop")])) self.assertEqual(p.get_token(), ("data", ".\n", None)) self.assertEqual(p.get_token(), ( "comment", " comment blah blah\n" "still a comment , blah and a space at the end \n", None)) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertEqual(p.get_token(), ("decl", "rheum", None)) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertEqual(p.get_token(), ("pi", "rhaponicum", None)) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertEqual(p.get_token(), ( (tolerant and "starttag" or "startendtag"), "randomtag", [("spam", "eggs")])) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertEqual(p.get_token(), ("endtag", "body", None)) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertEqual(p.get_token(), ("endtag", "html", None)) self.assertEqual(p.get_token(), ("data", "\n", None)) self.assertRaises(NoMoreTokensError, p.get_token) # print "token", p.get_token() # sys.exit() def test_unget_token(self): for pc, tolerant in PullParserTests.PARSERS: self._test_unget_token(pc, tolerant) def _test_unget_token(self, parser_class, tolerant): data, f = self.data_and_file() p = parser_class(f) p.get_token() tok = p.get_token() self.assertEqual(tok, ("data", "\n", None)) p.unget_token(tok) self.assertEqual(p.get_token(), ("data", "\n", None)) tok = p.get_token() self.assertEqual(tok, ("starttag", "html", [])) p.unget_token(tok) self.assertEqual(tok, ("starttag", "html", [])) def test_get_tag(self): for pc, tolerant in PullParserTests.PARSERS: self._test_get_tag(pc, tolerant) def _test_get_tag(self, parser_class, tolerant): from mechanize._pullparser import NoMoreTokensError data, f = self.data_and_file() p = parser_class(f) self.assertEqual(p.get_tag(), ("starttag", "html", [])) self.assertEqual(p.get_tag("blah", "body", "title"), ("starttag", "title", [("an", "attr")])) self.assertEqual(p.get_tag(), ("endtag", "title", None)) self.assertEqual(p.get_tag("randomtag"), ((tolerant and "starttag" or "startendtag"), "randomtag", [("spam", "eggs")])) self.assertEqual(p.get_tag(), ("endtag", "body", None)) self.assertEqual(p.get_tag(), ("endtag", "html", None)) self.assertRaises(NoMoreTokensError, p.get_tag) # print "tag", p.get_tag() # sys.exit() def test_get_text(self): for pc, tolerant in PullParserTests.PARSERS: self._test_get_text(pc, tolerant) def _test_get_text(self, parser_class, tolerant): from mechanize._pullparser import NoMoreTokensError data, f = self.data_and_file() p = parser_class(f) self.assertEqual(p.get_text(), "\n") self.assertEqual(peek_token(p).data, "html") self.assertEqual(p.get_text(), "") self.assertEqual(peek_token(p).data, "html"); p.get_token() self.assertEqual(p.get_text(), "\n"); p.get_token() self.assertEqual(p.get_text(), "\n"); p.get_token() self.assertEqual(p.get_text(), "Title"); p.get_token() self.assertEqual(p.get_text(), "\n"); p.get_token() self.assertEqual(p.get_text(), "\n"); p.get_token() self.assertEqual(p.get_text(), "\n"); p.get_token() self.assertEqual(p.get_text(), "This is a data blah & a[IMG]"); p.get_token() self.assertEqual(p.get_text(), " & that was an entityref " "and this a is\na charref. "); p.get_token() self.assertEqual(p.get_text(), ".\n\n\n\n"); p.get_token() self.assertEqual(p.get_text(), "\n"); p.get_token() self.assertEqual(p.get_text(), "\n"); p.get_token() self.assertEqual(p.get_text(), "\n"); p.get_token() # no more tokens, so we just get empty string self.assertEqual(p.get_text(), "") self.assertEqual(p.get_text(), "") self.assertRaises(NoMoreTokensError, p.get_token) #print "text", `p.get_text()` #sys.exit() def test_get_text_2(self): for pc, tolerant in PullParserTests.PARSERS: self._test_get_text_2(pc, tolerant) def _test_get_text_2(self, parser_class, tolerant): # more complicated stuff # endat data, f = self.data_and_file() p = parser_class(f) self.assertEqual(p.get_text(endat=("endtag", "html")), u"\n\n\nTitle\n\n\nThis is a data blah & a[IMG]" " & that was an entityref and this a is\na charref. ." "\n\n\n\n\n\n") f.close() data, f = self.data_and_file() p = parser_class(f) self.assertEqual(p.get_text(endat=("endtag", "title")), "\n\n\nTitle") self.assertEqual(p.get_text(endat=("starttag", "img")), "\n\n\nThis is a data blah & a[IMG]") f.close() # textify arg data, f = self.data_and_file() p = parser_class(f, textify={"title": "an", "img": lambda x: "YYY"}) self.assertEqual(p.get_text(endat=("endtag", "title")), "\n\n\nattr[TITLE]Title") self.assertEqual(p.get_text(endat=("starttag", "img")), "\n\n\nThis is a data YYY") f.close() # get_compressed_text data, f = self.data_and_file() p = parser_class(f) self.assertEqual(p.get_compressed_text(endat=("endtag", "html")), u"Title This is a data blah & a[IMG]" " & that was an entityref and this a is a charref. .") f.close() def test_tags(self): for pc, tolerant in PullParserTests.PARSERS: self._test_tags(pc, tolerant) def _test_tags(self, parser_class, tolerant): # no args data, f = self.data_and_file() p = parser_class(f) expected_tag_names = [ "html", "head", "title", "title", "head", "body", "p", "img", "blah", "randomtag", "body", "html" ] for i, token in enumerate(p.tags()): self.assertEquals(token.data, expected_tag_names[i]) f.close() # tag name args data, f = self.data_and_file() p = parser_class(f) expected_tokens = [ ("starttag", "head", []), ("endtag", "head", None), ("starttag", "p", []), ] for i, token in enumerate(p.tags("head", "p")): self.assertEquals(token, expected_tokens[i]) f.close() def test_tokens(self): for pc, tolerant in PullParserTests.PARSERS: self._test_tokens(pc, tolerant) def _test_tokens(self, parser_class, tolerant): # no args data, f = self.data_and_file() p = parser_class(f) expected_token_types = [ "decl", "data", "starttag", "data", "starttag", "data", "starttag", "data", "endtag", "data", "endtag", "data", "starttag", "data", "starttag", "data", "starttag", "data", "entityref", "data", "charref", "data", "starttag", "data", "comment", "data", "decl", "data", "pi", "data", (tolerant and "starttag" or "startendtag"), "data", "endtag", "data", "endtag", "data" ] for i, token in enumerate(p.tokens()): self.assertEquals(token.type, expected_token_types[i]) f.close() # token type args data, f = self.data_and_file() p = parser_class(f) expected_tokens = [ ("entityref", "amp", None), ("charref", "097", None), ] for i, token in enumerate(p.tokens("charref", "entityref")): self.assertEquals(token, expected_tokens[i]) f.close() def test_token_eq(self): from mechanize._pullparser import Token for (a, b) in [ (Token('endtag', 'html', None), ('endtag', 'html', None)), (Token('endtag', 'html', {'woof': 'bark'}), ('endtag', 'html', {'woof': 'bark'})), ]: self.assertEquals(a, a) self.assertEquals(a, b) self.assertEquals(b, a) if __name__ == "__main__": import unittest unittest.main() mechanize-0.2.5/test/test_forms.doctest0000644000175000017500000000373411545150644016706 0ustar johnjohnIntegration regression test for case where ClientForm handled RFC 3986 url unparsing incorrectly (it was using "" in place of None for fragment, due to continuing to support use of stdlib module urlparse as well as mechanize._rfc3986). Fixed in ClientForm r33622 . >>> import mechanize >>> from mechanize._response import test_response >>> def forms(): ... forms = [] ... for method in ["GET", "POST"]: ... data = ('

' ... '
' % method ... ) ... br = mechanize.Browser() ... response = test_response(data, [("content-type", "text/html")]) ... br.set_response(response) ... br.select_form(nr=0) ... forms.append(br.form) ... return forms >>> getform, postform = forms() >>> getform.click().get_full_url() 'http://example.com/?s=' >>> postform.click().get_full_url() 'http://example.com/' >>> data = '
' >>> br = mechanize.Browser() >>> response = test_response(data, [("content-type", "text/html")]) >>> br.set_response(response) >>> br.select_form(nr=0) >>> br.find_control(type="isindex").value = "blah" >>> br.click(type="isindex").get_full_url() 'http://example.com/?blah' If something (e.g. calling .forms() triggers parsing, and parsing fails, the next attempt should not succeed! This used to happen because the response held by LinksFactory etc was stale, since it had already been .read(). Fixed by calling Factory.set_response() on error. >>> import mechanize >>> br = mechanize.Browser() >>> r = mechanize._response.test_html_response("""\ ...
... ... ...
... """) >>> br.set_response(r) >>> try: ... br.select_form(nr=0) ... except mechanize.ParseError: ... pass >>> br.select_form(nr=0) # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): ParseError: expected name token mechanize-0.2.5/test/test_opener.py0000644000175000017500000002355711545150644016040 0ustar johnjohn#!/usr/bin/env python import os import math import stat import unittest import mechanize import mechanize._response as _response import mechanize._sockettimeout as _sockettimeout def killfile(filename): try: os.remove(filename) except OSError: if os.name=='nt': try: os.chmod(filename, stat.S_IWRITE) os.remove(filename) except OSError: pass class CloseVerifier(object): def __init__(self): self.count = 0 def opened(self): self.count += 1 def closed(self): self.count -= 1 def verify(self, assert_equals): assert_equals(self.count, 0) class ResponseCloseWrapper(object): def __init__(self, response, closed_callback, read): self._response = response self._closed_callback = closed_callback if read is None: self.read = response.read else: self.read = read def __getattr__(self, name): return getattr(self._response, name) def close(self): self._closed_callback() class ResponseCloseVerifier(CloseVerifier): def __init__(self, read=None): CloseVerifier.__init__(self) self._read = read def open(self): self.opened() response = _response.test_response("spam") return ResponseCloseWrapper(response, self.closed, self._read) class URLOpener(mechanize.OpenerDirector): def __init__(self, urlopen): self._urlopen = urlopen def open(self, *args, **kwds): return self._urlopen() class FakeFile(object): def __init__(self, closed_callback): self._closed_callback = closed_callback def write(self, *args, **kwds): pass def close(self): self._closed_callback() class FakeFilesystem(CloseVerifier): def open(self, path, mode="r"): self.opened() return FakeFile(self.closed) class OpenerTests(unittest.TestCase): def _check_retrieve(self, urlopen): opener = URLOpener(urlopen=urlopen) fs = FakeFilesystem() try: filename, headers = opener.retrieve("http://example.com", "dummy filename", open=fs.open) except mechanize.URLError: pass fs.verify(self.assertEquals) def test_retrieve_closes_on_success(self): response_verifier = ResponseCloseVerifier() self._check_retrieve(urlopen=response_verifier.open) response_verifier.verify(self.assertEquals) def test_retrieve_closes_on_failure(self): def fail_to_open(): raise mechanize.URLError("dummy reason") self._check_retrieve(fail_to_open) def test_retrieve_closes_on_read_failure(self): def fail_to_read(*args, **kwds): raise mechanize.URLError("dummy reason") response_verifier = ResponseCloseVerifier(read=fail_to_read) self._check_retrieve(urlopen=response_verifier.open) response_verifier.verify(self.assertEquals) def test_retrieve(self): # The .retrieve() method deals with a number of different cases. In # each case, .read() should be called the expected number of times, the # progress callback should be called as expected, and we should end up # with a filename and some headers. class Opener(mechanize.OpenerDirector): def __init__(self, content_length=None): mechanize.OpenerDirector.__init__(self) self.calls = [] self.block_size = mechanize.OpenerDirector.BLOCK_SIZE self.nr_blocks = 2.5 self.data = int((self.block_size/8)*self.nr_blocks)*"01234567" self.total_size = len(self.data) self._content_length = content_length def open(self, fullurl, data=None, timeout=_sockettimeout._GLOBAL_DEFAULT_TIMEOUT): self.calls.append((fullurl, data, timeout)) headers = [("Foo", "Bar")] if self._content_length is not None: if self._content_length is True: content_length = str(len(self.data)) else: content_length = str(self._content_length) headers.append(("content-length", content_length)) return _response.test_response(self.data, headers) class CallbackVerifier: def __init__(self, testcase, total_size, block_size): self.count = 0 self._testcase = testcase self._total_size = total_size self._block_size = block_size def callback(self, block_nr, block_size, total_size): self._testcase.assertEqual(block_nr, self.count) self._testcase.assertEqual(block_size, self._block_size) self._testcase.assertEqual(total_size, self._total_size) self.count += 1 # ensure we start without the test file present tfn = "mechanize_test_73940ukewrl.txt" killfile(tfn) # case 1: filename supplied op = Opener() verif = CallbackVerifier(self, -1, op.block_size) url = "http://example.com/" filename, headers = op.retrieve( url, tfn, reporthook=verif.callback) try: self.assertEqual(filename, tfn) self.assertEqual(headers["foo"], 'Bar') self.assertEqual(open(filename, "rb").read(), op.data) self.assertEqual(len(op.calls), 1) self.assertEqual(verif.count, math.ceil(op.nr_blocks) + 1) op.close() # .close()ing the opener does NOT remove non-temporary files self.assert_(os.path.isfile(filename)) finally: killfile(filename) # case 2: no filename supplied, use a temporary file op = Opener(content_length=True) # We asked the Opener to add a content-length header to the response # this time. Verify the total size passed to the callback is that case # is according to the content-length (rather than -1). verif = CallbackVerifier(self, op.total_size, op.block_size) url = "http://example.com/" filename, headers = op.retrieve(url, reporthook=verif.callback) self.assertNotEqual(filename, tfn) # (some temp filename instead) self.assertEqual(headers["foo"], 'Bar') self.assertEqual(open(filename, "rb").read(), op.data) self.assertEqual(len(op.calls), 1) # .close()ing the opener removes temporary files self.assert_(os.path.exists(filename)) op.close() self.failIf(os.path.exists(filename)) self.assertEqual(verif.count, math.ceil(op.nr_blocks) + 1) # case 3: "file:" URL with no filename supplied # we DON'T create a temporary file, since there's a file there already op = Opener() verif = CallbackVerifier(self, -1, op.block_size) tifn = "input_for_"+tfn try: f = open(tifn, 'wb') try: f.write(op.data) finally: f.close() url = "file://" + tifn filename, headers = op.retrieve(url, reporthook=verif.callback) self.assertEqual(filename, None) # this may change self.assertEqual(headers["foo"], 'Bar') self.assertEqual(open(tifn, "rb").read(), op.data) # no .read()s took place, since we already have the disk file, # and we weren't asked to write it to another filename self.assertEqual(verif.count, 0) op.close() # .close()ing the opener does NOT remove the file! self.assert_(os.path.isfile(tifn)) finally: killfile(tifn) # case 4: "file:" URL and filename supplied # we DO create a new file in this case op = Opener() verif = CallbackVerifier(self, -1, op.block_size) tifn = "input_for_"+tfn try: f = open(tifn, 'wb') try: f.write(op.data) finally: f.close() url = "file://" + tifn try: filename, headers = op.retrieve( url, tfn, reporthook=verif.callback) self.assertEqual(filename, tfn) self.assertEqual(headers["foo"], 'Bar') self.assertEqual(open(tifn, "rb").read(), op.data) self.assertEqual(verif.count, math.ceil(op.nr_blocks) + 1) op.close() # .close()ing the opener does NOT remove non-temporary files self.assert_(os.path.isfile(tfn)) finally: killfile(tfn) finally: killfile(tifn) # Content-Length mismatch with real file length gives URLError big = 1024*32 op = Opener(content_length=big) verif = CallbackVerifier(self, big, op.block_size) url = "http://example.com/" try: try: op.retrieve(url, reporthook=verif.callback) except mechanize.ContentTooShortError, exc: filename, headers = exc.result self.assertNotEqual(filename, tfn) self.assertEqual(headers["foo"], 'Bar') # We still read and wrote to disk everything available, despite # the exception. self.assertEqual(open(filename, "rb").read(), op.data) self.assertEqual(len(op.calls), 1) self.assertEqual(verif.count, math.ceil(op.nr_blocks) + 1) # cleanup should still take place self.assert_(os.path.isfile(filename)) op.close() self.failIf(os.path.isfile(filename)) else: self.fail() finally: killfile(filename) mechanize-0.2.5/test/test_performance.py0000644000175000017500000000501511545150644017036 0ustar johnjohnimport os import time import sys import unittest import mechanize from mechanize._testcase import TestCase, TempDirMaker from mechanize._rfc3986 import urljoin KB = 1024 MB = 1024**2 GB = 1024**3 def time_it(operation): t = time.time() operation() return time.time() - t def write_data(filename, nr_bytes): block_size = 4096 block = "01234567" * (block_size // 8) fh = open(filename, "w") try: for i in range(nr_bytes // block_size): fh.write(block) finally: fh.close() def time_retrieve_local_file(temp_maker, size, retrieve_fn): temp_dir = temp_maker.make_temp_dir() filename = os.path.join(temp_dir, "data") write_data(filename, size) def operation(): retrieve_fn(urljoin("file://", filename), os.path.join(temp_dir, "retrieved")) return time_it(operation) class PerformanceTests(TestCase): def test_retrieve_local_file(self): def retrieve(url, filename): br = mechanize.Browser() br.retrieve(url, filename) size = 100 * MB # size = 1 * KB desired_rate = 2*MB # per second desired_time = size / float(desired_rate) fudge_factor = 2. self.assert_less_than( time_retrieve_local_file(self, size, retrieve), desired_time * fudge_factor) def show_plot(rows): import matplotlib.pyplot figure = matplotlib.pyplot.figure() axes = figure.add_subplot(111) axes.plot([row[0] for row in rows], [row[1] for row in rows]) matplotlib.pyplot.show() def power_2_range(start, stop): n = start while n <= stop: yield n n *= 2 def performance_plot(): def retrieve(url, filename): br = mechanize.Browser() br.retrieve(url, filename) # import urllib2 # def retrieve(url, filename): # urllib2.urlopen(url).read() # from mechanize import _useragent # ua = _useragent.UserAgent() # ua.set_seekable_responses(True) # ua.set_handle_equiv(False) # def retrieve(url, filename): # ua.retrieve(url, filename) rows = [] for size in power_2_range(256 * KB, 256 * MB): temp_maker = TempDirMaker() try: elapsed = time_retrieve_local_file(temp_maker, size, retrieve) finally: temp_maker.tear_down() rows.append((size//float(MB), elapsed)) show_plot(rows) if __name__ == "__main__": args = sys.argv[1:] if "--plot" in args: performance_plot() else: unittest.main() mechanize-0.2.5/test/test_html.doctest0000644000175000017500000001421711545150644016522 0ustar johnjohn>>> import mechanize >>> from mechanize._response import test_html_response >>> from mechanize._html import LinksFactory, FormsFactory, TitleFactory, \ ... MechanizeBs, \ ... RobustLinksFactory, RobustFormsFactory, RobustTitleFactory mechanize.ParseError should be raised on parsing erroneous HTML. For backwards compatibility, mechanize.ParseError derives from exception classes that mechanize used to raise, prior to version 0.1.6. >>> import sgmllib >>> import HTMLParser >>> issubclass(mechanize.ParseError, sgmllib.SGMLParseError) True >>> issubclass(mechanize.ParseError, HTMLParser.HTMLParseError) True >>> def create_response(error=True): ... extra = "" ... if error: ... extra = "" ... html = """\ ... ... ... Title ... %s ... ... ...

Hello world ... ... ... """ % extra ... return test_html_response(html) >>> f = LinksFactory() >>> f.set_response(create_response(), "http://example.com", "latin-1") >>> list(f.links()) # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): ParseError: >>> f = FormsFactory() >>> f.set_response(create_response(), "latin-1") >>> list(f.forms()) # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): ParseError: >>> f = TitleFactory() >>> f.set_response(create_response(), "latin-1") >>> f.title() # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): ParseError: Accessing attributes on Factory may also raise ParseError >>> def factory_getattr(attr_name): ... fact = mechanize.DefaultFactory() ... fact.set_response(create_response()) ... getattr(fact, attr_name) >>> factory_getattr("title") # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): ParseError: >>> factory_getattr("global_form") # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): ParseError: BeautifulSoup ParseErrors: XXX If I could come up with examples that break links and forms parsing, I'd uncomment these! >>> def create_soup(html): ... r = test_html_response(html) ... return MechanizeBs("latin-1", r.read()) #>>> f = RobustLinksFactory() #>>> html = """\ #... #... #... #... yada yada rhubarb """, {"content-type": "text/html"}) b.add_handler(make_mock_handler()([("http_open", r)])) r = b.open(url) exp_links = [ # base_url, url, text, tag, attrs Link(url, "http://example.com/foo/bar.html", "", "a", [("href", "http://example.com/foo/bar.html"), ("name", "apples")]), Link(url, "spam", "", "a", [("href", "spam"), ("name", "pears")]), Link(url, "blah", None, "area", [("href", "blah"), ("name", "foo")]), Link(url, "src", None, "frame", [("name", "name"), ("href", "href"), ("src", "src")]), Link(url, "src", None, "iframe", [("name", "name2"), ("href", "href"), ("src", "src")]), Link(url, "one", "yada yada", "a", [("name", "name3"), ("href", "one")]), Link(url, "two", "rhubarb", "a", [("name", "pears"), ("href", "two"), ("weird", "stuff")]), Link(url, "foo", None, "iframe", [("src", "foo")]), ] links = list(b.links()) self.assertEqual(len(links), len(exp_links)) for got, expect in zip(links, exp_links): self.assertEqual(got, expect) # nr l = b.find_link() self.assertEqual(l.url, "http://example.com/foo/bar.html") l = b.find_link(nr=1) self.assertEqual(l.url, "spam") # text l = b.find_link(text="yada yada") self.assertEqual(l.url, "one") self.assertRaises(mechanize.LinkNotFoundError, b.find_link, text="da ya") l = b.find_link(text_regex=re.compile("da ya")) self.assertEqual(l.url, "one") l = b.find_link(text_regex="da ya") self.assertEqual(l.url, "one") # name l = b.find_link(name="name3") self.assertEqual(l.url, "one") l = b.find_link(name_regex=re.compile("oo")) self.assertEqual(l.url, "blah") l = b.find_link(name_regex="oo") self.assertEqual(l.url, "blah") # url l = b.find_link(url="spam") self.assertEqual(l.url, "spam") l = b.find_link(url_regex=re.compile("pam")) self.assertEqual(l.url, "spam") l = b.find_link(url_regex="pam") self.assertEqual(l.url, "spam") # tag l = b.find_link(tag="area") self.assertEqual(l.url, "blah") # predicate l = b.find_link(predicate= lambda l: dict(l.attrs).get("weird") == "stuff") self.assertEqual(l.url, "two") # combinations l = b.find_link(name="pears", nr=1) self.assertEqual(l.text, "rhubarb") l = b.find_link(url="src", nr=0, name="name2") self.assertEqual(l.tag, "iframe") self.assertEqual(l.url, "src") self.assertRaises(mechanize.LinkNotFoundError, b.find_link, url="src", nr=1, name="name2") l = b.find_link(tag="a", predicate= lambda l: dict(l.attrs).get("weird") == "stuff") self.assertEqual(l.url, "two") # .links() self.assertEqual(list(b.links(url="src")), [ Link(url, url="src", text=None, tag="frame", attrs=[("name", "name"), ("href", "href"), ("src", "src")]), Link(url, url="src", text=None, tag="iframe", attrs=[("name", "name2"), ("href", "href"), ("src", "src")]), ]) def test_base_uri(self): url = "http://example.com/" for html, urls in [ ( """ """, [ "http://www.python.org/foo/bar/baz.html", "http://www.python.org/bar/baz.html", "http://example.com/bar%20%2f%2Fblah;/baz@~._-.html", ]), ( """ """, [ "http://example.com/bar/baz.html", "http://example.com/bar/baz.html", "http://example.com/bar/baz.html", ] ), ]: b = TestBrowser() r = MockResponse(url, html, {"content-type": "text/html"}) b.add_handler(make_mock_handler()([("http_open", r)])) r = b.open(url) self.assertEqual([link.absolute_url for link in b.links()], urls) def test_set_cookie(self): class CookieTestBrowser(TestBrowser): default_features = list(TestBrowser.default_features)+["_cookies"] # have to be visiting HTTP/HTTPS URL url = "ftp://example.com/" br = CookieTestBrowser() r = mechanize.make_response( "Title", [("content-type", "text/html")], url, 200, "OK", ) br.add_handler(make_mock_handler()([("http_open", r)])) handler = br._ua_handlers["_cookies"] cj = handler.cookiejar self.assertRaises(mechanize.BrowserStateError, br.set_cookie, "foo=bar") self.assertEqual(len(cj), 0) url = "http://example.com/" br = CookieTestBrowser() r = mechanize.make_response( "Title", [("content-type", "text/html")], url, 200, "OK", ) br.add_handler(make_mock_handler()([("http_open", r)])) handler = br._ua_handlers["_cookies"] cj = handler.cookiejar # have to be visiting a URL self.assertRaises(mechanize.BrowserStateError, br.set_cookie, "foo=bar") self.assertEqual(len(cj), 0) # normal case br.open(url) br.set_cookie("foo=bar") self.assertEqual(len(cj), 1) self.assertEqual(cj._cookies["example.com"]["/"]["foo"].value, "bar") class ResponseTests(TestCase): def test_set_response(self): import copy from mechanize import response_seek_wrapper br = TestBrowser() url = "http://example.com/" html = """click me""" headers = {"content-type": "text/html"} r = response_seek_wrapper(MockResponse(url, html, headers)) br.add_handler(make_mock_handler()([("http_open", r)])) r = br.open(url) self.assertEqual(r.read(), html) r.seek(0) self.assertEqual(copy.copy(r).read(), html) self.assertEqual(list(br.links())[0].url, "spam") newhtml = """click me""" r.set_data(newhtml) self.assertEqual(r.read(), newhtml) self.assertEqual(br.response().read(), html) br.response().set_data(newhtml) self.assertEqual(br.response().read(), html) self.assertEqual(list(br.links())[0].url, "spam") r.seek(0) br.set_response(r) self.assertEqual(br.response().read(), newhtml) self.assertEqual(list(br.links())[0].url, "eggs") def test_str(self): import mimetools from mechanize import _response br = TestBrowser() self.assertEqual( str(br), "" ) fp = StringIO.StringIO('

') headers = mimetools.Message( StringIO.StringIO("Content-type: text/html")) response = _response.response_seek_wrapper( _response.closeable_response( fp, headers, "http://example.com/", 200, "OK")) br.set_response(response) self.assertEqual( str(br), "" ) br.select_form(nr=0) self.assertEqual( str(br), """\ =)>> >""") if __name__ == "__main__": import unittest unittest.main() mechanize-0.2.5/test/test_cookie.py0000644000175000017500000000361611545150644016013 0ustar johnjohnimport mechanize._clientcookie import mechanize._testcase def cookie_args( version=1, name="spam", value="eggs", port="80", port_specified=True, domain="example.com", domain_specified=False, domain_initial_dot=False, path="/", path_specified=False, secure=False, expires=0, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False, ): return locals() def make_cookie(*args, **kwds): return mechanize._clientcookie.Cookie(**cookie_args(*args, **kwds)) class Test(mechanize._testcase.TestCase): def test_equality(self): # not using assertNotEqual here since operator used varies across # Python versions self.assertEqual(make_cookie(), make_cookie()) self.assertFalse(make_cookie(name="ham") == make_cookie()) def test_inequality(self): # not using assertNotEqual here since operator used varies across # Python versions self.assertTrue(make_cookie(name="ham") != make_cookie()) self.assertFalse(make_cookie() != make_cookie()) def test_all_state_included(self): def non_equal_value(value): if value is None: new_value = "80" elif isinstance(value, basestring): new_value = value + "1" elif isinstance(value, bool): new_value = not value elif isinstance(value, dict): new_value = dict(value) new_value["spam"] = "eggs" elif isinstance(value, int): new_value = value + 1 else: assert False, value assert new_value != value, value return new_value cookie = make_cookie() for arg, default_value in cookie_args().iteritems(): new_value = non_equal_value(default_value) self.assertNotEqual(make_cookie(**{arg: new_value}), cookie) mechanize-0.2.5/test/test_unittest.py0000644000175000017500000042454511545150644016431 0ustar johnjohn"""Test script for unittest. By Collin Winter Still need testing: TestCase.{assert,fail}* methods (some are tested implicitly) """ from StringIO import StringIO import __builtin__ import os import re import sys import unittest from unittest import TestCase, TestProgram import types from copy import deepcopy from cStringIO import StringIO import pickle ### Support code ################################################################ class LoggingResult(unittest.TestResult): def __init__(self, log): self._events = log super(LoggingResult, self).__init__() def startTest(self, test): self._events.append('startTest') super(LoggingResult, self).startTest(test) def startTestRun(self): self._events.append('startTestRun') super(LoggingResult, self).startTestRun() def stopTest(self, test): self._events.append('stopTest') super(LoggingResult, self).stopTest(test) def stopTestRun(self): self._events.append('stopTestRun') super(LoggingResult, self).stopTestRun() def addFailure(self, *args): self._events.append('addFailure') super(LoggingResult, self).addFailure(*args) def addSuccess(self, *args): self._events.append('addSuccess') super(LoggingResult, self).addSuccess(*args) def addError(self, *args): self._events.append('addError') super(LoggingResult, self).addError(*args) def addSkip(self, *args): self._events.append('addSkip') super(LoggingResult, self).addSkip(*args) def addExpectedFailure(self, *args): self._events.append('addExpectedFailure') super(LoggingResult, self).addExpectedFailure(*args) def addUnexpectedSuccess(self, *args): self._events.append('addUnexpectedSuccess') super(LoggingResult, self).addUnexpectedSuccess(*args) class TestEquality(object): """Used as a mixin for TestCase""" # Check for a valid __eq__ implementation def test_eq(self): for obj_1, obj_2 in self.eq_pairs: self.assertEqual(obj_1, obj_2) self.assertEqual(obj_2, obj_1) # Check for a valid __ne__ implementation def test_ne(self): for obj_1, obj_2 in self.ne_pairs: self.assertNotEqual(obj_1, obj_2) self.assertNotEqual(obj_2, obj_1) class TestHashing(object): """Used as a mixin for TestCase""" # Check for a valid __hash__ implementation def test_hash(self): for obj_1, obj_2 in self.eq_pairs: try: if not hash(obj_1) == hash(obj_2): self.fail("%r and %r do not hash equal" % (obj_1, obj_2)) except KeyboardInterrupt: raise except Exception, e: self.fail("Problem hashing %r and %r: %s" % (obj_1, obj_2, e)) for obj_1, obj_2 in self.ne_pairs: try: if hash(obj_1) == hash(obj_2): self.fail("%s and %s hash equal, but shouldn't" % (obj_1, obj_2)) except KeyboardInterrupt: raise except Exception, e: self.fail("Problem hashing %s and %s: %s" % (obj_1, obj_2, e)) # List subclass we can add attributes to. class MyClassSuite(list): def __init__(self, tests): super(MyClassSuite, self).__init__(tests) ################################################################ ### /Support code class Test_TestLoader(TestCase): ### Tests for TestLoader.loadTestsFromTestCase ################################################################ # "Return a suite of all tests cases contained in the TestCase-derived # class testCaseClass" def test_loadTestsFromTestCase(self): class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass def foo_bar(self): pass tests = unittest.TestSuite([Foo('test_1'), Foo('test_2')]) loader = unittest.TestLoader() self.assertEqual(loader.loadTestsFromTestCase(Foo), tests) # "Return a suite of all tests cases contained in the TestCase-derived # class testCaseClass" # # Make sure it does the right thing even if no tests were found def test_loadTestsFromTestCase__no_matches(self): class Foo(unittest.TestCase): def foo_bar(self): pass empty_suite = unittest.TestSuite() loader = unittest.TestLoader() self.assertEqual(loader.loadTestsFromTestCase(Foo), empty_suite) # "Return a suite of all tests cases contained in the TestCase-derived # class testCaseClass" # # What happens if loadTestsFromTestCase() is given an object # that isn't a subclass of TestCase? Specifically, what happens # if testCaseClass is a subclass of TestSuite? # # This is checked for specifically in the code, so we better add a # test for it. def test_loadTestsFromTestCase__TestSuite_subclass(self): class NotATestCase(unittest.TestSuite): pass loader = unittest.TestLoader() try: loader.loadTestsFromTestCase(NotATestCase) except TypeError: pass else: self.fail('Should raise TypeError') # "Return a suite of all tests cases contained in the TestCase-derived # class testCaseClass" # # Make sure loadTestsFromTestCase() picks up the default test method # name (as specified by TestCase), even though the method name does # not match the default TestLoader.testMethodPrefix string def test_loadTestsFromTestCase__default_method_name(self): class Foo(unittest.TestCase): def runTest(self): pass loader = unittest.TestLoader() # This has to be false for the test to succeed self.assertFalse('runTest'.startswith(loader.testMethodPrefix)) suite = loader.loadTestsFromTestCase(Foo) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), [Foo('runTest')]) ################################################################ ### /Tests for TestLoader.loadTestsFromTestCase ### Tests for TestLoader.loadTestsFromModule ################################################################ # "This method searches `module` for classes derived from TestCase" def test_loadTestsFromModule__TestCase_subclass(self): m = types.ModuleType('m') class MyTestCase(unittest.TestCase): def test(self): pass m.testcase_1 = MyTestCase loader = unittest.TestLoader() suite = loader.loadTestsFromModule(m) self.assertTrue(isinstance(suite, loader.suiteClass)) expected = [loader.suiteClass([MyTestCase('test')])] self.assertEqual(list(suite), expected) # "This method searches `module` for classes derived from TestCase" # # What happens if no tests are found (no TestCase instances)? def test_loadTestsFromModule__no_TestCase_instances(self): m = types.ModuleType('m') loader = unittest.TestLoader() suite = loader.loadTestsFromModule(m) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), []) # "This method searches `module` for classes derived from TestCase" # # What happens if no tests are found (TestCases instances, but no tests)? def test_loadTestsFromModule__no_TestCase_tests(self): m = types.ModuleType('m') class MyTestCase(unittest.TestCase): pass m.testcase_1 = MyTestCase loader = unittest.TestLoader() suite = loader.loadTestsFromModule(m) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), [loader.suiteClass()]) # "This method searches `module` for classes derived from TestCase"s # # What happens if loadTestsFromModule() is given something other # than a module? # # XXX Currently, it succeeds anyway. This flexibility # should either be documented or loadTestsFromModule() should # raise a TypeError # # XXX Certain people are using this behaviour. We'll add a test for it def test_loadTestsFromModule__not_a_module(self): class MyTestCase(unittest.TestCase): def test(self): pass class NotAModule(object): test_2 = MyTestCase loader = unittest.TestLoader() suite = loader.loadTestsFromModule(NotAModule) reference = [unittest.TestSuite([MyTestCase('test')])] self.assertEqual(list(suite), reference) # Check that loadTestsFromModule honors (or not) a module # with a load_tests function. def test_loadTestsFromModule__load_tests(self): m = types.ModuleType('m') class MyTestCase(unittest.TestCase): def test(self): pass m.testcase_1 = MyTestCase load_tests_args = [] def load_tests(loader, tests, pattern): load_tests_args.extend((loader, tests, pattern)) return tests m.load_tests = load_tests loader = unittest.TestLoader() suite = loader.loadTestsFromModule(m) self.assertEquals(load_tests_args, [loader, suite, None]) load_tests_args = [] suite = loader.loadTestsFromModule(m, use_load_tests=False) self.assertEquals(load_tests_args, []) ################################################################ ### /Tests for TestLoader.loadTestsFromModule() ### Tests for TestLoader.loadTestsFromName() ################################################################ # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # # Is ValueError raised in response to an empty name? def test_loadTestsFromName__empty_name(self): loader = unittest.TestLoader() try: loader.loadTestsFromName('') except ValueError, e: self.assertEqual(str(e), "Empty module name") else: self.fail("TestLoader.loadTestsFromName failed to raise ValueError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # # What happens when the name contains invalid characters? def test_loadTestsFromName__malformed_name(self): loader = unittest.TestLoader() # XXX Should this raise ValueError or ImportError? try: loader.loadTestsFromName('abc () //') except ValueError: pass except ImportError: pass else: self.fail("TestLoader.loadTestsFromName failed to raise ValueError") # "The specifier name is a ``dotted name'' that may resolve ... to a # module" # # What happens when a module by that name can't be found? def test_loadTestsFromName__unknown_module_name(self): loader = unittest.TestLoader() try: loader.loadTestsFromName('sdasfasfasdf') except ImportError, e: self.assertEqual(str(e), "No module named sdasfasfasdf") else: self.fail("TestLoader.loadTestsFromName failed to raise ImportError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # # What happens when the module is found, but the attribute can't? def test_loadTestsFromName__unknown_attr_name(self): loader = unittest.TestLoader() try: loader.loadTestsFromName('unittest.sdasfasfasdf') except AttributeError, e: self.assertEqual(str(e), "'module' object has no attribute 'sdasfasfasdf'") else: self.fail("TestLoader.loadTestsFromName failed to raise AttributeError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # # What happens when we provide the module, but the attribute can't be # found? def test_loadTestsFromName__relative_unknown_name(self): loader = unittest.TestLoader() try: loader.loadTestsFromName('sdasfasfasdf', unittest) except AttributeError, e: self.assertEqual(str(e), "'module' object has no attribute 'sdasfasfasdf'") else: self.fail("TestLoader.loadTestsFromName failed to raise AttributeError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # ... # "The method optionally resolves name relative to the given module" # # Does loadTestsFromName raise ValueError when passed an empty # name relative to a provided module? # # XXX Should probably raise a ValueError instead of an AttributeError def test_loadTestsFromName__relative_empty_name(self): loader = unittest.TestLoader() try: loader.loadTestsFromName('', unittest) except AttributeError, e: pass else: self.fail("Failed to raise AttributeError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # ... # "The method optionally resolves name relative to the given module" # # What happens when an impossible name is given, relative to the provided # `module`? def test_loadTestsFromName__relative_malformed_name(self): loader = unittest.TestLoader() # XXX Should this raise AttributeError or ValueError? try: loader.loadTestsFromName('abc () //', unittest) except ValueError: pass except AttributeError: pass else: self.fail("TestLoader.loadTestsFromName failed to raise ValueError") # "The method optionally resolves name relative to the given module" # # Does loadTestsFromName raise TypeError when the `module` argument # isn't a module object? # # XXX Accepts the not-a-module object, ignorning the object's type # This should raise an exception or the method name should be changed # # XXX Some people are relying on this, so keep it for now def test_loadTestsFromName__relative_not_a_module(self): class MyTestCase(unittest.TestCase): def test(self): pass class NotAModule(object): test_2 = MyTestCase loader = unittest.TestLoader() suite = loader.loadTestsFromName('test_2', NotAModule) reference = [MyTestCase('test')] self.assertEqual(list(suite), reference) # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # # Does it raise an exception if the name resolves to an invalid # object? def test_loadTestsFromName__relative_bad_object(self): m = types.ModuleType('m') m.testcase_1 = object() loader = unittest.TestLoader() try: loader.loadTestsFromName('testcase_1', m) except TypeError: pass else: self.fail("Should have raised TypeError") # "The specifier name is a ``dotted name'' that may # resolve either to ... a test case class" def test_loadTestsFromName__relative_TestCase_subclass(self): m = types.ModuleType('m') class MyTestCase(unittest.TestCase): def test(self): pass m.testcase_1 = MyTestCase loader = unittest.TestLoader() suite = loader.loadTestsFromName('testcase_1', m) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), [MyTestCase('test')]) # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." def test_loadTestsFromName__relative_TestSuite(self): m = types.ModuleType('m') class MyTestCase(unittest.TestCase): def test(self): pass m.testsuite = unittest.TestSuite([MyTestCase('test')]) loader = unittest.TestLoader() suite = loader.loadTestsFromName('testsuite', m) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), [MyTestCase('test')]) # "The specifier name is a ``dotted name'' that may resolve ... to # ... a test method within a test case class" def test_loadTestsFromName__relative_testmethod(self): m = types.ModuleType('m') class MyTestCase(unittest.TestCase): def test(self): pass m.testcase_1 = MyTestCase loader = unittest.TestLoader() suite = loader.loadTestsFromName('testcase_1.test', m) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), [MyTestCase('test')]) # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # # Does loadTestsFromName() raise the proper exception when trying to # resolve "a test method within a test case class" that doesn't exist # for the given name (relative to a provided module)? def test_loadTestsFromName__relative_invalid_testmethod(self): m = types.ModuleType('m') class MyTestCase(unittest.TestCase): def test(self): pass m.testcase_1 = MyTestCase loader = unittest.TestLoader() try: loader.loadTestsFromName('testcase_1.testfoo', m) except AttributeError, e: self.assertEqual(str(e), "type object 'MyTestCase' has no attribute 'testfoo'") else: self.fail("Failed to raise AttributeError") # "The specifier name is a ``dotted name'' that may resolve ... to # ... a callable object which returns a ... TestSuite instance" def test_loadTestsFromName__callable__TestSuite(self): m = types.ModuleType('m') testcase_1 = unittest.FunctionTestCase(lambda: None) testcase_2 = unittest.FunctionTestCase(lambda: None) def return_TestSuite(): return unittest.TestSuite([testcase_1, testcase_2]) m.return_TestSuite = return_TestSuite loader = unittest.TestLoader() suite = loader.loadTestsFromName('return_TestSuite', m) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), [testcase_1, testcase_2]) # "The specifier name is a ``dotted name'' that may resolve ... to # ... a callable object which returns a TestCase ... instance" def test_loadTestsFromName__callable__TestCase_instance(self): m = types.ModuleType('m') testcase_1 = unittest.FunctionTestCase(lambda: None) def return_TestCase(): return testcase_1 m.return_TestCase = return_TestCase loader = unittest.TestLoader() suite = loader.loadTestsFromName('return_TestCase', m) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), [testcase_1]) # "The specifier name is a ``dotted name'' that may resolve ... to # ... a callable object which returns a TestCase ... instance" #***************************************************************** #Override the suiteClass attribute to ensure that the suiteClass #attribute is used def test_loadTestsFromName__callable__TestCase_instance_ProperSuiteClass(self): class SubTestSuite(unittest.TestSuite): pass m = types.ModuleType('m') testcase_1 = unittest.FunctionTestCase(lambda: None) def return_TestCase(): return testcase_1 m.return_TestCase = return_TestCase loader = unittest.TestLoader() loader.suiteClass = SubTestSuite suite = loader.loadTestsFromName('return_TestCase', m) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), [testcase_1]) # "The specifier name is a ``dotted name'' that may resolve ... to # ... a test method within a test case class" #***************************************************************** #Override the suiteClass attribute to ensure that the suiteClass #attribute is used def test_loadTestsFromName__relative_testmethod_ProperSuiteClass(self): class SubTestSuite(unittest.TestSuite): pass m = types.ModuleType('m') class MyTestCase(unittest.TestCase): def test(self): pass m.testcase_1 = MyTestCase loader = unittest.TestLoader() loader.suiteClass=SubTestSuite suite = loader.loadTestsFromName('testcase_1.test', m) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), [MyTestCase('test')]) # "The specifier name is a ``dotted name'' that may resolve ... to # ... a callable object which returns a TestCase or TestSuite instance" # # What happens if the callable returns something else? def test_loadTestsFromName__callable__wrong_type(self): m = types.ModuleType('m') def return_wrong(): return 6 m.return_wrong = return_wrong loader = unittest.TestLoader() try: suite = loader.loadTestsFromName('return_wrong', m) except TypeError: pass else: self.fail("TestLoader.loadTestsFromName failed to raise TypeError") # "The specifier can refer to modules and packages which have not been # imported; they will be imported as a side-effect" def test_loadTestsFromName__module_not_loaded(self): # We're going to try to load this module as a side-effect, so it # better not be loaded before we try. # # Why pick audioop? Google shows it isn't used very often, so there's # a good chance that it won't be imported when this test is run module_name = 'audioop' import sys if module_name in sys.modules: del sys.modules[module_name] loader = unittest.TestLoader() try: suite = loader.loadTestsFromName(module_name) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), []) # audioop should now be loaded, thanks to loadTestsFromName() self.assertTrue(module_name in sys.modules) finally: if module_name in sys.modules: del sys.modules[module_name] ################################################################ ### Tests for TestLoader.loadTestsFromName() ### Tests for TestLoader.loadTestsFromNames() ################################################################ # "Similar to loadTestsFromName(), but takes a sequence of names rather # than a single name." # # What happens if that sequence of names is empty? def test_loadTestsFromNames__empty_name_list(self): loader = unittest.TestLoader() suite = loader.loadTestsFromNames([]) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), []) # "Similar to loadTestsFromName(), but takes a sequence of names rather # than a single name." # ... # "The method optionally resolves name relative to the given module" # # What happens if that sequence of names is empty? # # XXX Should this raise a ValueError or just return an empty TestSuite? def test_loadTestsFromNames__relative_empty_name_list(self): loader = unittest.TestLoader() suite = loader.loadTestsFromNames([], unittest) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), []) # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # # Is ValueError raised in response to an empty name? def test_loadTestsFromNames__empty_name(self): loader = unittest.TestLoader() try: loader.loadTestsFromNames(['']) except ValueError, e: self.assertEqual(str(e), "Empty module name") else: self.fail("TestLoader.loadTestsFromNames failed to raise ValueError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # # What happens when presented with an impossible module name? def test_loadTestsFromNames__malformed_name(self): loader = unittest.TestLoader() # XXX Should this raise ValueError or ImportError? try: loader.loadTestsFromNames(['abc () //']) except ValueError: pass except ImportError: pass else: self.fail("TestLoader.loadTestsFromNames failed to raise ValueError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # # What happens when no module can be found for the given name? def test_loadTestsFromNames__unknown_module_name(self): loader = unittest.TestLoader() try: loader.loadTestsFromNames(['sdasfasfasdf']) except ImportError, e: self.assertEqual(str(e), "No module named sdasfasfasdf") else: self.fail("TestLoader.loadTestsFromNames failed to raise ImportError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # # What happens when the module can be found, but not the attribute? def test_loadTestsFromNames__unknown_attr_name(self): loader = unittest.TestLoader() try: loader.loadTestsFromNames(['unittest.sdasfasfasdf', 'unittest']) except AttributeError, e: self.assertEqual(str(e), "'module' object has no attribute 'sdasfasfasdf'") else: self.fail("TestLoader.loadTestsFromNames failed to raise AttributeError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # ... # "The method optionally resolves name relative to the given module" # # What happens when given an unknown attribute on a specified `module` # argument? def test_loadTestsFromNames__unknown_name_relative_1(self): loader = unittest.TestLoader() try: loader.loadTestsFromNames(['sdasfasfasdf'], unittest) except AttributeError, e: self.assertEqual(str(e), "'module' object has no attribute 'sdasfasfasdf'") else: self.fail("TestLoader.loadTestsFromName failed to raise AttributeError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # ... # "The method optionally resolves name relative to the given module" # # Do unknown attributes (relative to a provided module) still raise an # exception even in the presence of valid attribute names? def test_loadTestsFromNames__unknown_name_relative_2(self): loader = unittest.TestLoader() try: loader.loadTestsFromNames(['TestCase', 'sdasfasfasdf'], unittest) except AttributeError, e: self.assertEqual(str(e), "'module' object has no attribute 'sdasfasfasdf'") else: self.fail("TestLoader.loadTestsFromName failed to raise AttributeError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # ... # "The method optionally resolves name relative to the given module" # # What happens when faced with the empty string? # # XXX This currently raises AttributeError, though ValueError is probably # more appropriate def test_loadTestsFromNames__relative_empty_name(self): loader = unittest.TestLoader() try: loader.loadTestsFromNames([''], unittest) except AttributeError: pass else: self.fail("Failed to raise ValueError") # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # ... # "The method optionally resolves name relative to the given module" # # What happens when presented with an impossible attribute name? def test_loadTestsFromNames__relative_malformed_name(self): loader = unittest.TestLoader() # XXX Should this raise AttributeError or ValueError? try: loader.loadTestsFromNames(['abc () //'], unittest) except AttributeError: pass except ValueError: pass else: self.fail("TestLoader.loadTestsFromNames failed to raise ValueError") # "The method optionally resolves name relative to the given module" # # Does loadTestsFromNames() make sure the provided `module` is in fact # a module? # # XXX This validation is currently not done. This flexibility should # either be documented or a TypeError should be raised. def test_loadTestsFromNames__relative_not_a_module(self): class MyTestCase(unittest.TestCase): def test(self): pass class NotAModule(object): test_2 = MyTestCase loader = unittest.TestLoader() suite = loader.loadTestsFromNames(['test_2'], NotAModule) reference = [unittest.TestSuite([MyTestCase('test')])] self.assertEqual(list(suite), reference) # "The specifier name is a ``dotted name'' that may resolve either to # a module, a test case class, a TestSuite instance, a test method # within a test case class, or a callable object which returns a # TestCase or TestSuite instance." # # Does it raise an exception if the name resolves to an invalid # object? def test_loadTestsFromNames__relative_bad_object(self): m = types.ModuleType('m') m.testcase_1 = object() loader = unittest.TestLoader() try: loader.loadTestsFromNames(['testcase_1'], m) except TypeError: pass else: self.fail("Should have raised TypeError") # "The specifier name is a ``dotted name'' that may resolve ... to # ... a test case class" def test_loadTestsFromNames__relative_TestCase_subclass(self): m = types.ModuleType('m') class MyTestCase(unittest.TestCase): def test(self): pass m.testcase_1 = MyTestCase loader = unittest.TestLoader() suite = loader.loadTestsFromNames(['testcase_1'], m) self.assertTrue(isinstance(suite, loader.suiteClass)) expected = loader.suiteClass([MyTestCase('test')]) self.assertEqual(list(suite), [expected]) # "The specifier name is a ``dotted name'' that may resolve ... to # ... a TestSuite instance" def test_loadTestsFromNames__relative_TestSuite(self): m = types.ModuleType('m') class MyTestCase(unittest.TestCase): def test(self): pass m.testsuite = unittest.TestSuite([MyTestCase('test')]) loader = unittest.TestLoader() suite = loader.loadTestsFromNames(['testsuite'], m) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), [m.testsuite]) # "The specifier name is a ``dotted name'' that may resolve ... to ... a # test method within a test case class" def test_loadTestsFromNames__relative_testmethod(self): m = types.ModuleType('m') class MyTestCase(unittest.TestCase): def test(self): pass m.testcase_1 = MyTestCase loader = unittest.TestLoader() suite = loader.loadTestsFromNames(['testcase_1.test'], m) self.assertTrue(isinstance(suite, loader.suiteClass)) ref_suite = unittest.TestSuite([MyTestCase('test')]) self.assertEqual(list(suite), [ref_suite]) # "The specifier name is a ``dotted name'' that may resolve ... to ... a # test method within a test case class" # # Does the method gracefully handle names that initially look like they # resolve to "a test method within a test case class" but don't? def test_loadTestsFromNames__relative_invalid_testmethod(self): m = types.ModuleType('m') class MyTestCase(unittest.TestCase): def test(self): pass m.testcase_1 = MyTestCase loader = unittest.TestLoader() try: loader.loadTestsFromNames(['testcase_1.testfoo'], m) except AttributeError, e: self.assertEqual(str(e), "type object 'MyTestCase' has no attribute 'testfoo'") else: self.fail("Failed to raise AttributeError") # "The specifier name is a ``dotted name'' that may resolve ... to # ... a callable object which returns a ... TestSuite instance" def test_loadTestsFromNames__callable__TestSuite(self): m = types.ModuleType('m') testcase_1 = unittest.FunctionTestCase(lambda: None) testcase_2 = unittest.FunctionTestCase(lambda: None) def return_TestSuite(): return unittest.TestSuite([testcase_1, testcase_2]) m.return_TestSuite = return_TestSuite loader = unittest.TestLoader() suite = loader.loadTestsFromNames(['return_TestSuite'], m) self.assertTrue(isinstance(suite, loader.suiteClass)) expected = unittest.TestSuite([testcase_1, testcase_2]) self.assertEqual(list(suite), [expected]) # "The specifier name is a ``dotted name'' that may resolve ... to # ... a callable object which returns a TestCase ... instance" def test_loadTestsFromNames__callable__TestCase_instance(self): m = types.ModuleType('m') testcase_1 = unittest.FunctionTestCase(lambda: None) def return_TestCase(): return testcase_1 m.return_TestCase = return_TestCase loader = unittest.TestLoader() suite = loader.loadTestsFromNames(['return_TestCase'], m) self.assertTrue(isinstance(suite, loader.suiteClass)) ref_suite = unittest.TestSuite([testcase_1]) self.assertEqual(list(suite), [ref_suite]) # "The specifier name is a ``dotted name'' that may resolve ... to # ... a callable object which returns a TestCase or TestSuite instance" # # Are staticmethods handled correctly? def test_loadTestsFromNames__callable__call_staticmethod(self): m = types.ModuleType('m') class Test1(unittest.TestCase): def test(self): pass testcase_1 = Test1('test') class Foo(unittest.TestCase): @staticmethod def foo(): return testcase_1 m.Foo = Foo loader = unittest.TestLoader() suite = loader.loadTestsFromNames(['Foo.foo'], m) self.assertTrue(isinstance(suite, loader.suiteClass)) ref_suite = unittest.TestSuite([testcase_1]) self.assertEqual(list(suite), [ref_suite]) # "The specifier name is a ``dotted name'' that may resolve ... to # ... a callable object which returns a TestCase or TestSuite instance" # # What happens when the callable returns something else? def test_loadTestsFromNames__callable__wrong_type(self): m = types.ModuleType('m') def return_wrong(): return 6 m.return_wrong = return_wrong loader = unittest.TestLoader() try: suite = loader.loadTestsFromNames(['return_wrong'], m) except TypeError: pass else: self.fail("TestLoader.loadTestsFromNames failed to raise TypeError") # "The specifier can refer to modules and packages which have not been # imported; they will be imported as a side-effect" def test_loadTestsFromNames__module_not_loaded(self): # We're going to try to load this module as a side-effect, so it # better not be loaded before we try. # # Why pick audioop? Google shows it isn't used very often, so there's # a good chance that it won't be imported when this test is run module_name = 'audioop' import sys if module_name in sys.modules: del sys.modules[module_name] loader = unittest.TestLoader() try: suite = loader.loadTestsFromNames([module_name]) self.assertTrue(isinstance(suite, loader.suiteClass)) self.assertEqual(list(suite), [unittest.TestSuite()]) # audioop should now be loaded, thanks to loadTestsFromName() self.assertTrue(module_name in sys.modules) finally: if module_name in sys.modules: del sys.modules[module_name] ################################################################ ### /Tests for TestLoader.loadTestsFromNames() ### Tests for TestLoader.getTestCaseNames() ################################################################ # "Return a sorted sequence of method names found within testCaseClass" # # Test.foobar is defined to make sure getTestCaseNames() respects # loader.testMethodPrefix def test_getTestCaseNames(self): class Test(unittest.TestCase): def test_1(self): pass def test_2(self): pass def foobar(self): pass loader = unittest.TestLoader() self.assertEqual(loader.getTestCaseNames(Test), ['test_1', 'test_2']) # "Return a sorted sequence of method names found within testCaseClass" # # Does getTestCaseNames() behave appropriately if no tests are found? def test_getTestCaseNames__no_tests(self): class Test(unittest.TestCase): def foobar(self): pass loader = unittest.TestLoader() self.assertEqual(loader.getTestCaseNames(Test), []) # "Return a sorted sequence of method names found within testCaseClass" # # Are not-TestCases handled gracefully? # # XXX This should raise a TypeError, not return a list # # XXX It's too late in the 2.5 release cycle to fix this, but it should # probably be revisited for 2.6 def test_getTestCaseNames__not_a_TestCase(self): class BadCase(int): def test_foo(self): pass loader = unittest.TestLoader() names = loader.getTestCaseNames(BadCase) self.assertEqual(names, ['test_foo']) # "Return a sorted sequence of method names found within testCaseClass" # # Make sure inherited names are handled. # # TestP.foobar is defined to make sure getTestCaseNames() respects # loader.testMethodPrefix def test_getTestCaseNames__inheritance(self): class TestP(unittest.TestCase): def test_1(self): pass def test_2(self): pass def foobar(self): pass class TestC(TestP): def test_1(self): pass def test_3(self): pass loader = unittest.TestLoader() names = ['test_1', 'test_2', 'test_3'] self.assertEqual(loader.getTestCaseNames(TestC), names) ################################################################ ### /Tests for TestLoader.getTestCaseNames() ### Tests for TestLoader.testMethodPrefix ################################################################ # "String giving the prefix of method names which will be interpreted as # test methods" # # Implicit in the documentation is that testMethodPrefix is respected by # all loadTestsFrom* methods. def test_testMethodPrefix__loadTestsFromTestCase(self): class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass def foo_bar(self): pass tests_1 = unittest.TestSuite([Foo('foo_bar')]) tests_2 = unittest.TestSuite([Foo('test_1'), Foo('test_2')]) loader = unittest.TestLoader() loader.testMethodPrefix = 'foo' self.assertEqual(loader.loadTestsFromTestCase(Foo), tests_1) loader.testMethodPrefix = 'test' self.assertEqual(loader.loadTestsFromTestCase(Foo), tests_2) # "String giving the prefix of method names which will be interpreted as # test methods" # # Implicit in the documentation is that testMethodPrefix is respected by # all loadTestsFrom* methods. def test_testMethodPrefix__loadTestsFromModule(self): m = types.ModuleType('m') class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass def foo_bar(self): pass m.Foo = Foo tests_1 = [unittest.TestSuite([Foo('foo_bar')])] tests_2 = [unittest.TestSuite([Foo('test_1'), Foo('test_2')])] loader = unittest.TestLoader() loader.testMethodPrefix = 'foo' self.assertEqual(list(loader.loadTestsFromModule(m)), tests_1) loader.testMethodPrefix = 'test' self.assertEqual(list(loader.loadTestsFromModule(m)), tests_2) # "String giving the prefix of method names which will be interpreted as # test methods" # # Implicit in the documentation is that testMethodPrefix is respected by # all loadTestsFrom* methods. def test_testMethodPrefix__loadTestsFromName(self): m = types.ModuleType('m') class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass def foo_bar(self): pass m.Foo = Foo tests_1 = unittest.TestSuite([Foo('foo_bar')]) tests_2 = unittest.TestSuite([Foo('test_1'), Foo('test_2')]) loader = unittest.TestLoader() loader.testMethodPrefix = 'foo' self.assertEqual(loader.loadTestsFromName('Foo', m), tests_1) loader.testMethodPrefix = 'test' self.assertEqual(loader.loadTestsFromName('Foo', m), tests_2) # "String giving the prefix of method names which will be interpreted as # test methods" # # Implicit in the documentation is that testMethodPrefix is respected by # all loadTestsFrom* methods. def test_testMethodPrefix__loadTestsFromNames(self): m = types.ModuleType('m') class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass def foo_bar(self): pass m.Foo = Foo tests_1 = unittest.TestSuite([unittest.TestSuite([Foo('foo_bar')])]) tests_2 = unittest.TestSuite([Foo('test_1'), Foo('test_2')]) tests_2 = unittest.TestSuite([tests_2]) loader = unittest.TestLoader() loader.testMethodPrefix = 'foo' self.assertEqual(loader.loadTestsFromNames(['Foo'], m), tests_1) loader.testMethodPrefix = 'test' self.assertEqual(loader.loadTestsFromNames(['Foo'], m), tests_2) # "The default value is 'test'" def test_testMethodPrefix__default_value(self): loader = unittest.TestLoader() self.assertTrue(loader.testMethodPrefix == 'test') ################################################################ ### /Tests for TestLoader.testMethodPrefix ### Tests for TestLoader.sortTestMethodsUsing ################################################################ # "Function to be used to compare method names when sorting them in # getTestCaseNames() and all the loadTestsFromX() methods" def test_sortTestMethodsUsing__loadTestsFromTestCase(self): def reversed_cmp(x, y): return -cmp(x, y) class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass loader = unittest.TestLoader() loader.sortTestMethodsUsing = reversed_cmp tests = loader.suiteClass([Foo('test_2'), Foo('test_1')]) self.assertEqual(loader.loadTestsFromTestCase(Foo), tests) # "Function to be used to compare method names when sorting them in # getTestCaseNames() and all the loadTestsFromX() methods" def test_sortTestMethodsUsing__loadTestsFromModule(self): def reversed_cmp(x, y): return -cmp(x, y) m = types.ModuleType('m') class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass m.Foo = Foo loader = unittest.TestLoader() loader.sortTestMethodsUsing = reversed_cmp tests = [loader.suiteClass([Foo('test_2'), Foo('test_1')])] self.assertEqual(list(loader.loadTestsFromModule(m)), tests) # "Function to be used to compare method names when sorting them in # getTestCaseNames() and all the loadTestsFromX() methods" def test_sortTestMethodsUsing__loadTestsFromName(self): def reversed_cmp(x, y): return -cmp(x, y) m = types.ModuleType('m') class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass m.Foo = Foo loader = unittest.TestLoader() loader.sortTestMethodsUsing = reversed_cmp tests = loader.suiteClass([Foo('test_2'), Foo('test_1')]) self.assertEqual(loader.loadTestsFromName('Foo', m), tests) # "Function to be used to compare method names when sorting them in # getTestCaseNames() and all the loadTestsFromX() methods" def test_sortTestMethodsUsing__loadTestsFromNames(self): def reversed_cmp(x, y): return -cmp(x, y) m = types.ModuleType('m') class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass m.Foo = Foo loader = unittest.TestLoader() loader.sortTestMethodsUsing = reversed_cmp tests = [loader.suiteClass([Foo('test_2'), Foo('test_1')])] self.assertEqual(list(loader.loadTestsFromNames(['Foo'], m)), tests) # "Function to be used to compare method names when sorting them in # getTestCaseNames()" # # Does it actually affect getTestCaseNames()? def test_sortTestMethodsUsing__getTestCaseNames(self): def reversed_cmp(x, y): return -cmp(x, y) class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass loader = unittest.TestLoader() loader.sortTestMethodsUsing = reversed_cmp test_names = ['test_2', 'test_1'] self.assertEqual(loader.getTestCaseNames(Foo), test_names) # "The default value is the built-in cmp() function" def test_sortTestMethodsUsing__default_value(self): loader = unittest.TestLoader() self.assertTrue(loader.sortTestMethodsUsing is cmp) # "it can be set to None to disable the sort." # # XXX How is this different from reassigning cmp? Are the tests returned # in a random order or something? This behaviour should die def test_sortTestMethodsUsing__None(self): class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass loader = unittest.TestLoader() loader.sortTestMethodsUsing = None test_names = ['test_2', 'test_1'] self.assertEqual(set(loader.getTestCaseNames(Foo)), set(test_names)) ################################################################ ### /Tests for TestLoader.sortTestMethodsUsing ### Tests for TestLoader.suiteClass ################################################################ # "Callable object that constructs a test suite from a list of tests." def test_suiteClass__loadTestsFromTestCase(self): class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass def foo_bar(self): pass tests = [Foo('test_1'), Foo('test_2')] loader = unittest.TestLoader() loader.suiteClass = list self.assertEqual(loader.loadTestsFromTestCase(Foo), tests) # It is implicit in the documentation for TestLoader.suiteClass that # all TestLoader.loadTestsFrom* methods respect it. Let's make sure def test_suiteClass__loadTestsFromModule(self): m = types.ModuleType('m') class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass def foo_bar(self): pass m.Foo = Foo tests = [[Foo('test_1'), Foo('test_2')]] loader = unittest.TestLoader() loader.suiteClass = list self.assertEqual(loader.loadTestsFromModule(m), tests) # It is implicit in the documentation for TestLoader.suiteClass that # all TestLoader.loadTestsFrom* methods respect it. Let's make sure def test_suiteClass__loadTestsFromName(self): m = types.ModuleType('m') class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass def foo_bar(self): pass m.Foo = Foo tests = [Foo('test_1'), Foo('test_2')] loader = unittest.TestLoader() loader.suiteClass = list self.assertEqual(loader.loadTestsFromName('Foo', m), tests) # It is implicit in the documentation for TestLoader.suiteClass that # all TestLoader.loadTestsFrom* methods respect it. Let's make sure def test_suiteClass__loadTestsFromNames(self): m = types.ModuleType('m') class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass def foo_bar(self): pass m.Foo = Foo tests = [[Foo('test_1'), Foo('test_2')]] loader = unittest.TestLoader() loader.suiteClass = list self.assertEqual(loader.loadTestsFromNames(['Foo'], m), tests) # "The default value is the TestSuite class" def test_suiteClass__default_value(self): loader = unittest.TestLoader() self.assertTrue(loader.suiteClass is unittest.TestSuite) ################################################################ ### /Tests for TestLoader.suiteClass ### Support code for Test_TestSuite ################################################################ class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass def test_3(self): pass def runTest(self): pass def _mk_TestSuite(*names): return unittest.TestSuite(Foo(n) for n in names) ################################################################ ### /Support code for Test_TestSuite class Test_TestSuite(TestCase, TestEquality): ### Set up attributes needed by inherited tests ################################################################ # Used by TestEquality.test_eq eq_pairs = [(unittest.TestSuite(), unittest.TestSuite()) ,(unittest.TestSuite(), unittest.TestSuite([])) ,(_mk_TestSuite('test_1'), _mk_TestSuite('test_1'))] # Used by TestEquality.test_ne ne_pairs = [(unittest.TestSuite(), _mk_TestSuite('test_1')) ,(unittest.TestSuite([]), _mk_TestSuite('test_1')) ,(_mk_TestSuite('test_1', 'test_2'), _mk_TestSuite('test_1', 'test_3')) ,(_mk_TestSuite('test_1'), _mk_TestSuite('test_2'))] ################################################################ ### /Set up attributes needed by inherited tests ### Tests for TestSuite.__init__ ################################################################ # "class TestSuite([tests])" # # The tests iterable should be optional def test_init__tests_optional(self): suite = unittest.TestSuite() self.assertEqual(suite.countTestCases(), 0) # "class TestSuite([tests])" # ... # "If tests is given, it must be an iterable of individual test cases # or other test suites that will be used to build the suite initially" # # TestSuite should deal with empty tests iterables by allowing the # creation of an empty suite def test_init__empty_tests(self): suite = unittest.TestSuite([]) self.assertEqual(suite.countTestCases(), 0) # "class TestSuite([tests])" # ... # "If tests is given, it must be an iterable of individual test cases # or other test suites that will be used to build the suite initially" # # TestSuite should allow any iterable to provide tests def test_init__tests_from_any_iterable(self): def tests(): yield unittest.FunctionTestCase(lambda: None) yield unittest.FunctionTestCase(lambda: None) suite_1 = unittest.TestSuite(tests()) self.assertEqual(suite_1.countTestCases(), 2) suite_2 = unittest.TestSuite(suite_1) self.assertEqual(suite_2.countTestCases(), 2) suite_3 = unittest.TestSuite(set(suite_1)) self.assertEqual(suite_3.countTestCases(), 2) # "class TestSuite([tests])" # ... # "If tests is given, it must be an iterable of individual test cases # or other test suites that will be used to build the suite initially" # # Does TestSuite() also allow other TestSuite() instances to be present # in the tests iterable? def test_init__TestSuite_instances_in_tests(self): def tests(): ftc = unittest.FunctionTestCase(lambda: None) yield unittest.TestSuite([ftc]) yield unittest.FunctionTestCase(lambda: None) suite = unittest.TestSuite(tests()) self.assertEqual(suite.countTestCases(), 2) ################################################################ ### /Tests for TestSuite.__init__ # Container types should support the iter protocol def test_iter(self): test1 = unittest.FunctionTestCase(lambda: None) test2 = unittest.FunctionTestCase(lambda: None) suite = unittest.TestSuite((test1, test2)) self.assertEqual(list(suite), [test1, test2]) # "Return the number of tests represented by the this test object. # ...this method is also implemented by the TestSuite class, which can # return larger [greater than 1] values" # # Presumably an empty TestSuite returns 0? def test_countTestCases_zero_simple(self): suite = unittest.TestSuite() self.assertEqual(suite.countTestCases(), 0) # "Return the number of tests represented by the this test object. # ...this method is also implemented by the TestSuite class, which can # return larger [greater than 1] values" # # Presumably an empty TestSuite (even if it contains other empty # TestSuite instances) returns 0? def test_countTestCases_zero_nested(self): class Test1(unittest.TestCase): def test(self): pass suite = unittest.TestSuite([unittest.TestSuite()]) self.assertEqual(suite.countTestCases(), 0) # "Return the number of tests represented by the this test object. # ...this method is also implemented by the TestSuite class, which can # return larger [greater than 1] values" def test_countTestCases_simple(self): test1 = unittest.FunctionTestCase(lambda: None) test2 = unittest.FunctionTestCase(lambda: None) suite = unittest.TestSuite((test1, test2)) self.assertEqual(suite.countTestCases(), 2) # "Return the number of tests represented by the this test object. # ...this method is also implemented by the TestSuite class, which can # return larger [greater than 1] values" # # Make sure this holds for nested TestSuite instances, too def test_countTestCases_nested(self): class Test1(unittest.TestCase): def test1(self): pass def test2(self): pass test2 = unittest.FunctionTestCase(lambda: None) test3 = unittest.FunctionTestCase(lambda: None) child = unittest.TestSuite((Test1('test2'), test2)) parent = unittest.TestSuite((test3, child, Test1('test1'))) self.assertEqual(parent.countTestCases(), 4) # "Run the tests associated with this suite, collecting the result into # the test result object passed as result." # # And if there are no tests? What then? def test_run__empty_suite(self): events = [] result = LoggingResult(events) suite = unittest.TestSuite() suite.run(result) self.assertEqual(events, []) # "Note that unlike TestCase.run(), TestSuite.run() requires the # "result object to be passed in." def test_run__requires_result(self): suite = unittest.TestSuite() try: suite.run() except TypeError: pass else: self.fail("Failed to raise TypeError") # "Run the tests associated with this suite, collecting the result into # the test result object passed as result." def test_run(self): events = [] result = LoggingResult(events) class LoggingCase(unittest.TestCase): def run(self, result): events.append('run %s' % self._testMethodName) def test1(self): pass def test2(self): pass tests = [LoggingCase('test1'), LoggingCase('test2')] unittest.TestSuite(tests).run(result) self.assertEqual(events, ['run test1', 'run test2']) # "Add a TestCase ... to the suite" def test_addTest__TestCase(self): class Foo(unittest.TestCase): def test(self): pass test = Foo('test') suite = unittest.TestSuite() suite.addTest(test) self.assertEqual(suite.countTestCases(), 1) self.assertEqual(list(suite), [test]) # "Add a ... TestSuite to the suite" def test_addTest__TestSuite(self): class Foo(unittest.TestCase): def test(self): pass suite_2 = unittest.TestSuite([Foo('test')]) suite = unittest.TestSuite() suite.addTest(suite_2) self.assertEqual(suite.countTestCases(), 1) self.assertEqual(list(suite), [suite_2]) # "Add all the tests from an iterable of TestCase and TestSuite # instances to this test suite." # # "This is equivalent to iterating over tests, calling addTest() for # each element" def test_addTests(self): class Foo(unittest.TestCase): def test_1(self): pass def test_2(self): pass test_1 = Foo('test_1') test_2 = Foo('test_2') inner_suite = unittest.TestSuite([test_2]) def gen(): yield test_1 yield test_2 yield inner_suite suite_1 = unittest.TestSuite() suite_1.addTests(gen()) self.assertEqual(list(suite_1), list(gen())) # "This is equivalent to iterating over tests, calling addTest() for # each element" suite_2 = unittest.TestSuite() for t in gen(): suite_2.addTest(t) self.assertEqual(suite_1, suite_2) # "Add all the tests from an iterable of TestCase and TestSuite # instances to this test suite." # # What happens if it doesn't get an iterable? def test_addTest__noniterable(self): suite = unittest.TestSuite() try: suite.addTests(5) except TypeError: pass else: self.fail("Failed to raise TypeError") def test_addTest__noncallable(self): suite = unittest.TestSuite() self.assertRaises(TypeError, suite.addTest, 5) def test_addTest__casesuiteclass(self): suite = unittest.TestSuite() self.assertRaises(TypeError, suite.addTest, Test_TestSuite) self.assertRaises(TypeError, suite.addTest, unittest.TestSuite) def test_addTests__string(self): suite = unittest.TestSuite() self.assertRaises(TypeError, suite.addTests, "foo") class Test_FunctionTestCase(TestCase): # "Return the number of tests represented by the this test object. For # TestCase instances, this will always be 1" def test_countTestCases(self): test = unittest.FunctionTestCase(lambda: None) self.assertEqual(test.countTestCases(), 1) # "When a setUp() method is defined, the test runner will run that method # prior to each test. Likewise, if a tearDown() method is defined, the # test runner will invoke that method after each test. In the example, # setUp() was used to create a fresh sequence for each test." # # Make sure the proper call order is maintained, even if setUp() raises # an exception. def test_run_call_order__error_in_setUp(self): events = [] result = LoggingResult(events) def setUp(): events.append('setUp') raise RuntimeError('raised by setUp') def test(): events.append('test') def tearDown(): events.append('tearDown') expected = ['startTest', 'setUp', 'addError', 'stopTest'] unittest.FunctionTestCase(test, setUp, tearDown).run(result) self.assertEqual(events, expected) # "When a setUp() method is defined, the test runner will run that method # prior to each test. Likewise, if a tearDown() method is defined, the # test runner will invoke that method after each test. In the example, # setUp() was used to create a fresh sequence for each test." # # Make sure the proper call order is maintained, even if the test raises # an error (as opposed to a failure). def test_run_call_order__error_in_test(self): events = [] result = LoggingResult(events) def setUp(): events.append('setUp') def test(): events.append('test') raise RuntimeError('raised by test') def tearDown(): events.append('tearDown') expected = ['startTest', 'setUp', 'test', 'addError', 'tearDown', 'stopTest'] unittest.FunctionTestCase(test, setUp, tearDown).run(result) self.assertEqual(events, expected) # "When a setUp() method is defined, the test runner will run that method # prior to each test. Likewise, if a tearDown() method is defined, the # test runner will invoke that method after each test. In the example, # setUp() was used to create a fresh sequence for each test." # # Make sure the proper call order is maintained, even if the test signals # a failure (as opposed to an error). def test_run_call_order__failure_in_test(self): events = [] result = LoggingResult(events) def setUp(): events.append('setUp') def test(): events.append('test') self.fail('raised by test') def tearDown(): events.append('tearDown') expected = ['startTest', 'setUp', 'test', 'addFailure', 'tearDown', 'stopTest'] unittest.FunctionTestCase(test, setUp, tearDown).run(result) self.assertEqual(events, expected) # "When a setUp() method is defined, the test runner will run that method # prior to each test. Likewise, if a tearDown() method is defined, the # test runner will invoke that method after each test. In the example, # setUp() was used to create a fresh sequence for each test." # # Make sure the proper call order is maintained, even if tearDown() raises # an exception. def test_run_call_order__error_in_tearDown(self): events = [] result = LoggingResult(events) def setUp(): events.append('setUp') def test(): events.append('test') def tearDown(): events.append('tearDown') raise RuntimeError('raised by tearDown') expected = ['startTest', 'setUp', 'test', 'tearDown', 'addError', 'stopTest'] unittest.FunctionTestCase(test, setUp, tearDown).run(result) self.assertEqual(events, expected) # "Return a string identifying the specific test case." # # Because of the vague nature of the docs, I'm not going to lock this # test down too much. Really all that can be asserted is that the id() # will be a string (either 8-byte or unicode -- again, because the docs # just say "string") def test_id(self): test = unittest.FunctionTestCase(lambda: None) self.assertTrue(isinstance(test.id(), basestring)) # "Returns a one-line description of the test, or None if no description # has been provided. The default implementation of this method returns # the first line of the test method's docstring, if available, or None." def test_shortDescription__no_docstring(self): test = unittest.FunctionTestCase(lambda: None) self.assertEqual(test.shortDescription(), None) # "Returns a one-line description of the test, or None if no description # has been provided. The default implementation of this method returns # the first line of the test method's docstring, if available, or None." def test_shortDescription__singleline_docstring(self): desc = "this tests foo" test = unittest.FunctionTestCase(lambda: None, description=desc) self.assertEqual(test.shortDescription(), "this tests foo") class Test_TestResult(TestCase): # Note: there are not separate tests for TestResult.wasSuccessful(), # TestResult.errors, TestResult.failures, TestResult.testsRun or # TestResult.shouldStop because these only have meaning in terms of # other TestResult methods. # # Accordingly, tests for the aforenamed attributes are incorporated # in with the tests for the defining methods. ################################################################ def test_init(self): result = unittest.TestResult() self.assertTrue(result.wasSuccessful()) self.assertEqual(len(result.errors), 0) self.assertEqual(len(result.failures), 0) self.assertEqual(result.testsRun, 0) self.assertEqual(result.shouldStop, False) # "This method can be called to signal that the set of tests being # run should be aborted by setting the TestResult's shouldStop # attribute to True." def test_stop(self): result = unittest.TestResult() result.stop() self.assertEqual(result.shouldStop, True) # "Called when the test case test is about to be run. The default # implementation simply increments the instance's testsRun counter." def test_startTest(self): class Foo(unittest.TestCase): def test_1(self): pass test = Foo('test_1') result = unittest.TestResult() result.startTest(test) self.assertTrue(result.wasSuccessful()) self.assertEqual(len(result.errors), 0) self.assertEqual(len(result.failures), 0) self.assertEqual(result.testsRun, 1) self.assertEqual(result.shouldStop, False) result.stopTest(test) # "Called after the test case test has been executed, regardless of # the outcome. The default implementation does nothing." def test_stopTest(self): class Foo(unittest.TestCase): def test_1(self): pass test = Foo('test_1') result = unittest.TestResult() result.startTest(test) self.assertTrue(result.wasSuccessful()) self.assertEqual(len(result.errors), 0) self.assertEqual(len(result.failures), 0) self.assertEqual(result.testsRun, 1) self.assertEqual(result.shouldStop, False) result.stopTest(test) # Same tests as above; make sure nothing has changed self.assertTrue(result.wasSuccessful()) self.assertEqual(len(result.errors), 0) self.assertEqual(len(result.failures), 0) self.assertEqual(result.testsRun, 1) self.assertEqual(result.shouldStop, False) # "Called before and after tests are run. The default implementation does nothing." def test_startTestRun_stopTestRun(self): result = unittest.TestResult() result.startTestRun() result.stopTestRun() # "addSuccess(test)" # ... # "Called when the test case test succeeds" # ... # "wasSuccessful() - Returns True if all tests run so far have passed, # otherwise returns False" # ... # "testsRun - The total number of tests run so far." # ... # "errors - A list containing 2-tuples of TestCase instances and # formatted tracebacks. Each tuple represents a test which raised an # unexpected exception. Contains formatted # tracebacks instead of sys.exc_info() results." # ... # "failures - A list containing 2-tuples of TestCase instances and # formatted tracebacks. Each tuple represents a test where a failure was # explicitly signalled using the TestCase.fail*() or TestCase.assert*() # methods. Contains formatted tracebacks instead # of sys.exc_info() results." def test_addSuccess(self): class Foo(unittest.TestCase): def test_1(self): pass test = Foo('test_1') result = unittest.TestResult() result.startTest(test) result.addSuccess(test) result.stopTest(test) self.assertTrue(result.wasSuccessful()) self.assertEqual(len(result.errors), 0) self.assertEqual(len(result.failures), 0) self.assertEqual(result.testsRun, 1) self.assertEqual(result.shouldStop, False) # "addFailure(test, err)" # ... # "Called when the test case test signals a failure. err is a tuple of # the form returned by sys.exc_info(): (type, value, traceback)" # ... # "wasSuccessful() - Returns True if all tests run so far have passed, # otherwise returns False" # ... # "testsRun - The total number of tests run so far." # ... # "errors - A list containing 2-tuples of TestCase instances and # formatted tracebacks. Each tuple represents a test which raised an # unexpected exception. Contains formatted # tracebacks instead of sys.exc_info() results." # ... # "failures - A list containing 2-tuples of TestCase instances and # formatted tracebacks. Each tuple represents a test where a failure was # explicitly signalled using the TestCase.fail*() or TestCase.assert*() # methods. Contains formatted tracebacks instead # of sys.exc_info() results." def test_addFailure(self): import sys class Foo(unittest.TestCase): def test_1(self): pass test = Foo('test_1') try: test.fail("foo") except: exc_info_tuple = sys.exc_info() result = unittest.TestResult() result.startTest(test) result.addFailure(test, exc_info_tuple) result.stopTest(test) self.assertFalse(result.wasSuccessful()) self.assertEqual(len(result.errors), 0) self.assertEqual(len(result.failures), 1) self.assertEqual(result.testsRun, 1) self.assertEqual(result.shouldStop, False) test_case, formatted_exc = result.failures[0] self.assertTrue(test_case is test) self.assertTrue(isinstance(formatted_exc, str)) # "addError(test, err)" # ... # "Called when the test case test raises an unexpected exception err # is a tuple of the form returned by sys.exc_info(): # (type, value, traceback)" # ... # "wasSuccessful() - Returns True if all tests run so far have passed, # otherwise returns False" # ... # "testsRun - The total number of tests run so far." # ... # "errors - A list containing 2-tuples of TestCase instances and # formatted tracebacks. Each tuple represents a test which raised an # unexpected exception. Contains formatted # tracebacks instead of sys.exc_info() results." # ... # "failures - A list containing 2-tuples of TestCase instances and # formatted tracebacks. Each tuple represents a test where a failure was # explicitly signalled using the TestCase.fail*() or TestCase.assert*() # methods. Contains formatted tracebacks instead # of sys.exc_info() results." def test_addError(self): import sys class Foo(unittest.TestCase): def test_1(self): pass test = Foo('test_1') try: raise TypeError() except: exc_info_tuple = sys.exc_info() result = unittest.TestResult() result.startTest(test) result.addError(test, exc_info_tuple) result.stopTest(test) self.assertFalse(result.wasSuccessful()) self.assertEqual(len(result.errors), 1) self.assertEqual(len(result.failures), 0) self.assertEqual(result.testsRun, 1) self.assertEqual(result.shouldStop, False) test_case, formatted_exc = result.errors[0] self.assertTrue(test_case is test) self.assertTrue(isinstance(formatted_exc, str)) ### Support code for Test_TestCase ################################################################ class Foo(unittest.TestCase): def runTest(self): pass def test1(self): pass class Bar(Foo): def test2(self): pass class LoggingTestCase(unittest.TestCase): """A test case which logs its calls.""" def __init__(self, events): super(LoggingTestCase, self).__init__('test') self.events = events def setUp(self): if self.__class__ is LoggingTestCase: # evade test discovery raise unittest.SkipTest self.events.append('setUp') def test(self): self.events.append('test') def tearDown(self): self.events.append('tearDown') class ResultWithNoStartTestRunStopTestRun(object): """An object honouring TestResult before startTestRun/stopTestRun.""" def __init__(self): self.failures = [] self.errors = [] self.testsRun = 0 self.skipped = [] self.expectedFailures = [] self.unexpectedSuccesses = [] self.shouldStop = False def startTest(self, test): pass def stopTest(self, test): pass def addError(self, test): pass def addFailure(self, test): pass def addSuccess(self, test): pass def wasSuccessful(self): return True ################################################################ ### /Support code for Test_TestCase class Test_TestCase(TestCase, TestEquality, TestHashing): ### Set up attributes used by inherited tests ################################################################ # Used by TestHashing.test_hash and TestEquality.test_eq eq_pairs = [(Foo('test1'), Foo('test1'))] # Used by TestEquality.test_ne ne_pairs = [(Foo('test1'), Foo('runTest')) ,(Foo('test1'), Bar('test1')) ,(Foo('test1'), Bar('test2'))] ################################################################ ### /Set up attributes used by inherited tests # "class TestCase([methodName])" # ... # "Each instance of TestCase will run a single test method: the # method named methodName." # ... # "methodName defaults to "runTest"." # # Make sure it really is optional, and that it defaults to the proper # thing. def test_init__no_test_name(self): class Test(unittest.TestCase): def runTest(self): raise MyException() def test(self): pass self.assertEqual(Test().id()[-13:], '.Test.runTest') # "class TestCase([methodName])" # ... # "Each instance of TestCase will run a single test method: the # method named methodName." def test_init__test_name__valid(self): class Test(unittest.TestCase): def runTest(self): raise MyException() def test(self): pass self.assertEqual(Test('test').id()[-10:], '.Test.test') # "class TestCase([methodName])" # ... # "Each instance of TestCase will run a single test method: the # method named methodName." def test_init__test_name__invalid(self): class Test(unittest.TestCase): def runTest(self): raise MyException() def test(self): pass try: Test('testfoo') except ValueError: pass else: self.fail("Failed to raise ValueError") # "Return the number of tests represented by the this test object. For # TestCase instances, this will always be 1" def test_countTestCases(self): class Foo(unittest.TestCase): def test(self): pass self.assertEqual(Foo('test').countTestCases(), 1) # "Return the default type of test result object to be used to run this # test. For TestCase instances, this will always be # unittest.TestResult; subclasses of TestCase should # override this as necessary." def test_defaultTestResult(self): class Foo(unittest.TestCase): def runTest(self): pass result = Foo().defaultTestResult() self.assertEqual(type(result), unittest.TestResult) # "When a setUp() method is defined, the test runner will run that method # prior to each test. Likewise, if a tearDown() method is defined, the # test runner will invoke that method after each test. In the example, # setUp() was used to create a fresh sequence for each test." # # Make sure the proper call order is maintained, even if setUp() raises # an exception. def test_run_call_order__error_in_setUp(self): events = [] result = LoggingResult(events) class Foo(LoggingTestCase): def setUp(self): super(Foo, self).setUp() raise RuntimeError('raised by Foo.setUp') Foo(events).run(result) expected = ['startTest', 'setUp', 'addError', 'stopTest'] self.assertEqual(events, expected) # "With a temporary result stopTestRun is called when setUp errors. def test_run_call_order__error_in_setUp_default_result(self): events = [] class Foo(LoggingTestCase): def defaultTestResult(self): return LoggingResult(self.events) def setUp(self): super(Foo, self).setUp() raise RuntimeError('raised by Foo.setUp') Foo(events).run() expected = ['startTestRun', 'startTest', 'setUp', 'addError', 'stopTest', 'stopTestRun'] self.assertEqual(events, expected) # "When a setUp() method is defined, the test runner will run that method # prior to each test. Likewise, if a tearDown() method is defined, the # test runner will invoke that method after each test. In the example, # setUp() was used to create a fresh sequence for each test." # # Make sure the proper call order is maintained, even if the test raises # an error (as opposed to a failure). def test_run_call_order__error_in_test(self): events = [] result = LoggingResult(events) class Foo(LoggingTestCase): def test(self): super(Foo, self).test() raise RuntimeError('raised by Foo.test') expected = ['startTest', 'setUp', 'test', 'addError', 'tearDown', 'stopTest'] Foo(events).run(result) self.assertEqual(events, expected) # "With a default result, an error in the test still results in stopTestRun # being called." def test_run_call_order__error_in_test_default_result(self): events = [] class Foo(LoggingTestCase): def defaultTestResult(self): return LoggingResult(self.events) def test(self): super(Foo, self).test() raise RuntimeError('raised by Foo.test') expected = ['startTestRun', 'startTest', 'setUp', 'test', 'addError', 'tearDown', 'stopTest', 'stopTestRun'] Foo(events).run() self.assertEqual(events, expected) # "When a setUp() method is defined, the test runner will run that method # prior to each test. Likewise, if a tearDown() method is defined, the # test runner will invoke that method after each test. In the example, # setUp() was used to create a fresh sequence for each test." # # Make sure the proper call order is maintained, even if the test signals # a failure (as opposed to an error). def test_run_call_order__failure_in_test(self): events = [] result = LoggingResult(events) class Foo(LoggingTestCase): def test(self): super(Foo, self).test() self.fail('raised by Foo.test') expected = ['startTest', 'setUp', 'test', 'addFailure', 'tearDown', 'stopTest'] Foo(events).run(result) self.assertEqual(events, expected) # "When a test fails with a default result stopTestRun is still called." def test_run_call_order__failure_in_test_default_result(self): class Foo(LoggingTestCase): def defaultTestResult(self): return LoggingResult(self.events) def test(self): super(Foo, self).test() self.fail('raised by Foo.test') expected = ['startTestRun', 'startTest', 'setUp', 'test', 'addFailure', 'tearDown', 'stopTest', 'stopTestRun'] events = [] Foo(events).run() self.assertEqual(events, expected) # "When a setUp() method is defined, the test runner will run that method # prior to each test. Likewise, if a tearDown() method is defined, the # test runner will invoke that method after each test. In the example, # setUp() was used to create a fresh sequence for each test." # # Make sure the proper call order is maintained, even if tearDown() raises # an exception. def test_run_call_order__error_in_tearDown(self): events = [] result = LoggingResult(events) class Foo(LoggingTestCase): def tearDown(self): super(Foo, self).tearDown() raise RuntimeError('raised by Foo.tearDown') Foo(events).run(result) expected = ['startTest', 'setUp', 'test', 'tearDown', 'addError', 'stopTest'] self.assertEqual(events, expected) # "When tearDown errors with a default result stopTestRun is still called." def test_run_call_order__error_in_tearDown_default_result(self): class Foo(LoggingTestCase): def defaultTestResult(self): return LoggingResult(self.events) def tearDown(self): super(Foo, self).tearDown() raise RuntimeError('raised by Foo.tearDown') events = [] Foo(events).run() expected = ['startTestRun', 'startTest', 'setUp', 'test', 'tearDown', 'addError', 'stopTest', 'stopTestRun'] self.assertEqual(events, expected) # "TestCase.run() still works when the defaultTestResult is a TestResult # that does not support startTestRun and stopTestRun. def test_run_call_order_default_result(self): class Foo(unittest.TestCase): def defaultTestResult(self): return ResultWithNoStartTestRunStopTestRun() def test(self): pass Foo('test').run() # "This class attribute gives the exception raised by the test() method. # If a test framework needs to use a specialized exception, possibly to # carry additional information, it must subclass this exception in # order to ``play fair'' with the framework. The initial value of this # attribute is AssertionError" def test_failureException__default(self): class Foo(unittest.TestCase): def test(self): pass self.assertTrue(Foo('test').failureException is AssertionError) # "This class attribute gives the exception raised by the test() method. # If a test framework needs to use a specialized exception, possibly to # carry additional information, it must subclass this exception in # order to ``play fair'' with the framework." # # Make sure TestCase.run() respects the designated failureException def test_failureException__subclassing__explicit_raise(self): events = [] result = LoggingResult(events) class Foo(unittest.TestCase): def test(self): raise RuntimeError() failureException = RuntimeError self.assertTrue(Foo('test').failureException is RuntimeError) Foo('test').run(result) expected = ['startTest', 'addFailure', 'stopTest'] self.assertEqual(events, expected) # "This class attribute gives the exception raised by the test() method. # If a test framework needs to use a specialized exception, possibly to # carry additional information, it must subclass this exception in # order to ``play fair'' with the framework." # # Make sure TestCase.run() respects the designated failureException def test_failureException__subclassing__implicit_raise(self): events = [] result = LoggingResult(events) class Foo(unittest.TestCase): def test(self): self.fail("foo") failureException = RuntimeError self.assertTrue(Foo('test').failureException is RuntimeError) Foo('test').run(result) expected = ['startTest', 'addFailure', 'stopTest'] self.assertEqual(events, expected) # "The default implementation does nothing." def test_setUp(self): class Foo(unittest.TestCase): def runTest(self): pass # ... and nothing should happen Foo().setUp() # "The default implementation does nothing." def test_tearDown(self): class Foo(unittest.TestCase): def runTest(self): pass # ... and nothing should happen Foo().tearDown() # "Return a string identifying the specific test case." # # Because of the vague nature of the docs, I'm not going to lock this # test down too much. Really all that can be asserted is that the id() # will be a string (either 8-byte or unicode -- again, because the docs # just say "string") def test_id(self): class Foo(unittest.TestCase): def runTest(self): pass self.assertTrue(isinstance(Foo().id(), basestring)) # "If result is omitted or None, a temporary result object is created # and used, but is not made available to the caller. As TestCase owns the # temporary result startTestRun and stopTestRun are called. def test_run__uses_defaultTestResult(self): events = [] class Foo(unittest.TestCase): def test(self): events.append('test') def defaultTestResult(self): return LoggingResult(events) # Make run() find a result object on its own Foo('test').run() expected = ['startTestRun', 'startTest', 'test', 'addSuccess', 'stopTest', 'stopTestRun'] self.assertEqual(events, expected) def testShortDescriptionWithoutDocstring(self): self.assertEqual( self.shortDescription(), 'testShortDescriptionWithoutDocstring (' + __name__ + '.Test_TestCase)') def testShortDescriptionWithOneLineDocstring(self): """Tests shortDescription() for a method with a docstring.""" self.assertEqual( self.shortDescription(), ('testShortDescriptionWithOneLineDocstring ' '(' + __name__ + '.Test_TestCase)\n' 'Tests shortDescription() for a method with a docstring.')) def testShortDescriptionWithMultiLineDocstring(self): """Tests shortDescription() for a method with a longer docstring. This method ensures that only the first line of a docstring is returned used in the short description, no matter how long the whole thing is. """ self.assertEqual( self.shortDescription(), ('testShortDescriptionWithMultiLineDocstring ' '(' + __name__ + '.Test_TestCase)\n' 'Tests shortDescription() for a method with a longer ' 'docstring.')) def testAddTypeEqualityFunc(self): class SadSnake(object): """Dummy class for test_addTypeEqualityFunc.""" s1, s2 = SadSnake(), SadSnake() self.assertFalse(s1 == s2) def AllSnakesCreatedEqual(a, b, msg=None): return type(a) == type(b) == SadSnake self.addTypeEqualityFunc(SadSnake, AllSnakesCreatedEqual) self.assertEqual(s1, s2) # No this doesn't clean up and remove the SadSnake equality func # from this TestCase instance but since its a local nothing else # will ever notice that. def testAssertIs(self): thing = object() self.assertIs(thing, thing) self.assertRaises(self.failureException, self.assertIs, thing, object()) def testAssertIsNot(self): thing = object() self.assertIsNot(thing, object()) self.assertRaises(self.failureException, self.assertIsNot, thing, thing) def testAssertIsInstance(self): thing = [] self.assertIsInstance(thing, list) self.assertRaises(self.failureException, self.assertIsInstance, thing, dict) def testAssertNotIsInstance(self): thing = [] self.assertNotIsInstance(thing, dict) self.assertRaises(self.failureException, self.assertNotIsInstance, thing, list) def testAssertIn(self): animals = {'monkey': 'banana', 'cow': 'grass', 'seal': 'fish'} self.assertIn('a', 'abc') self.assertIn(2, [1, 2, 3]) self.assertIn('monkey', animals) self.assertNotIn('d', 'abc') self.assertNotIn(0, [1, 2, 3]) self.assertNotIn('otter', animals) self.assertRaises(self.failureException, self.assertIn, 'x', 'abc') self.assertRaises(self.failureException, self.assertIn, 4, [1, 2, 3]) self.assertRaises(self.failureException, self.assertIn, 'elephant', animals) self.assertRaises(self.failureException, self.assertNotIn, 'c', 'abc') self.assertRaises(self.failureException, self.assertNotIn, 1, [1, 2, 3]) self.assertRaises(self.failureException, self.assertNotIn, 'cow', animals) def testAssertDictContainsSubset(self): self.assertDictContainsSubset({}, {}) self.assertDictContainsSubset({}, {'a': 1}) self.assertDictContainsSubset({'a': 1}, {'a': 1}) self.assertDictContainsSubset({'a': 1}, {'a': 1, 'b': 2}) self.assertDictContainsSubset({'a': 1, 'b': 2}, {'a': 1, 'b': 2}) self.assertRaises(unittest.TestCase.failureException, self.assertDictContainsSubset, {'a': 2}, {'a': 1}, '.*Mismatched values:.*') self.assertRaises(unittest.TestCase.failureException, self.assertDictContainsSubset, {'c': 1}, {'a': 1}, '.*Missing:.*') self.assertRaises(unittest.TestCase.failureException, self.assertDictContainsSubset, {'a': 1, 'c': 1}, {'a': 1}, '.*Missing:.*') self.assertRaises(unittest.TestCase.failureException, self.assertDictContainsSubset, {'a': 1, 'c': 1}, {'a': 1}, '.*Missing:.*Mismatched values:.*') def testAssertEqual(self): equal_pairs = [ ((), ()), ({}, {}), ([], []), (set(), set()), (frozenset(), frozenset())] for a, b in equal_pairs: # This mess of try excepts is to test the assertEqual behavior # itself. try: self.assertEqual(a, b) except self.failureException: self.fail('assertEqual(%r, %r) failed' % (a, b)) try: self.assertEqual(a, b, msg='foo') except self.failureException: self.fail('assertEqual(%r, %r) with msg= failed' % (a, b)) try: self.assertEqual(a, b, 'foo') except self.failureException: self.fail('assertEqual(%r, %r) with third parameter failed' % (a, b)) unequal_pairs = [ ((), []), ({}, set()), (set([4,1]), frozenset([4,2])), (frozenset([4,5]), set([2,3])), (set([3,4]), set([5,4]))] for a, b in unequal_pairs: self.assertRaises(self.failureException, self.assertEqual, a, b) self.assertRaises(self.failureException, self.assertEqual, a, b, 'foo') self.assertRaises(self.failureException, self.assertEqual, a, b, msg='foo') def testEquality(self): self.assertListEqual([], []) self.assertTupleEqual((), ()) self.assertSequenceEqual([], ()) a = [0, 'a', []] b = [] self.assertRaises(unittest.TestCase.failureException, self.assertListEqual, a, b) self.assertRaises(unittest.TestCase.failureException, self.assertListEqual, tuple(a), tuple(b)) self.assertRaises(unittest.TestCase.failureException, self.assertSequenceEqual, a, tuple(b)) b.extend(a) self.assertListEqual(a, b) self.assertTupleEqual(tuple(a), tuple(b)) self.assertSequenceEqual(a, tuple(b)) self.assertSequenceEqual(tuple(a), b) self.assertRaises(self.failureException, self.assertListEqual, a, tuple(b)) self.assertRaises(self.failureException, self.assertTupleEqual, tuple(a), b) self.assertRaises(self.failureException, self.assertListEqual, None, b) self.assertRaises(self.failureException, self.assertTupleEqual, None, tuple(b)) self.assertRaises(self.failureException, self.assertSequenceEqual, None, tuple(b)) self.assertRaises(self.failureException, self.assertListEqual, 1, 1) self.assertRaises(self.failureException, self.assertTupleEqual, 1, 1) self.assertRaises(self.failureException, self.assertSequenceEqual, 1, 1) self.assertDictEqual({}, {}) c = { 'x': 1 } d = {} self.assertRaises(unittest.TestCase.failureException, self.assertDictEqual, c, d) d.update(c) self.assertDictEqual(c, d) d['x'] = 0 self.assertRaises(unittest.TestCase.failureException, self.assertDictEqual, c, d, 'These are unequal') self.assertRaises(self.failureException, self.assertDictEqual, None, d) self.assertRaises(self.failureException, self.assertDictEqual, [], d) self.assertRaises(self.failureException, self.assertDictEqual, 1, 1) self.assertSameElements([1, 2, 3], [3, 2, 1]) self.assertSameElements([1, 2] + [3] * 100, [1] * 100 + [2, 3]) self.assertSameElements(['foo', 'bar', 'baz'], ['bar', 'baz', 'foo']) self.assertRaises(self.failureException, self.assertSameElements, [10], [10, 11]) self.assertRaises(self.failureException, self.assertSameElements, [10, 11], [10]) # Test that sequences of unhashable objects can be tested for sameness: self.assertSameElements([[1, 2], [3, 4]], [[3, 4], [1, 2]]) self.assertSameElements([{'a': 1}, {'b': 2}], [{'b': 2}, {'a': 1}]) self.assertRaises(self.failureException, self.assertSameElements, [[1]], [[2]]) def testAssertSetEqual(self): set1 = set() set2 = set() self.assertSetEqual(set1, set2) self.assertRaises(self.failureException, self.assertSetEqual, None, set2) self.assertRaises(self.failureException, self.assertSetEqual, [], set2) self.assertRaises(self.failureException, self.assertSetEqual, set1, None) self.assertRaises(self.failureException, self.assertSetEqual, set1, []) set1 = set(['a']) set2 = set() self.assertRaises(self.failureException, self.assertSetEqual, set1, set2) set1 = set(['a']) set2 = set(['a']) self.assertSetEqual(set1, set2) set1 = set(['a']) set2 = set(['a', 'b']) self.assertRaises(self.failureException, self.assertSetEqual, set1, set2) set1 = set(['a']) set2 = frozenset(['a', 'b']) self.assertRaises(self.failureException, self.assertSetEqual, set1, set2) set1 = set(['a', 'b']) set2 = frozenset(['a', 'b']) self.assertSetEqual(set1, set2) set1 = set() set2 = "foo" self.assertRaises(self.failureException, self.assertSetEqual, set1, set2) self.assertRaises(self.failureException, self.assertSetEqual, set2, set1) # make sure any string formatting is tuple-safe set1 = set([(0, 1), (2, 3)]) set2 = set([(4, 5)]) self.assertRaises(self.failureException, self.assertSetEqual, set1, set2) def testInequality(self): # Try ints self.assertGreater(2, 1) self.assertGreaterEqual(2, 1) self.assertGreaterEqual(1, 1) self.assertLess(1, 2) self.assertLessEqual(1, 2) self.assertLessEqual(1, 1) self.assertRaises(self.failureException, self.assertGreater, 1, 2) self.assertRaises(self.failureException, self.assertGreater, 1, 1) self.assertRaises(self.failureException, self.assertGreaterEqual, 1, 2) self.assertRaises(self.failureException, self.assertLess, 2, 1) self.assertRaises(self.failureException, self.assertLess, 1, 1) self.assertRaises(self.failureException, self.assertLessEqual, 2, 1) # Try Floats self.assertGreater(1.1, 1.0) self.assertGreaterEqual(1.1, 1.0) self.assertGreaterEqual(1.0, 1.0) self.assertLess(1.0, 1.1) self.assertLessEqual(1.0, 1.1) self.assertLessEqual(1.0, 1.0) self.assertRaises(self.failureException, self.assertGreater, 1.0, 1.1) self.assertRaises(self.failureException, self.assertGreater, 1.0, 1.0) self.assertRaises(self.failureException, self.assertGreaterEqual, 1.0, 1.1) self.assertRaises(self.failureException, self.assertLess, 1.1, 1.0) self.assertRaises(self.failureException, self.assertLess, 1.0, 1.0) self.assertRaises(self.failureException, self.assertLessEqual, 1.1, 1.0) # Try Strings self.assertGreater('bug', 'ant') self.assertGreaterEqual('bug', 'ant') self.assertGreaterEqual('ant', 'ant') self.assertLess('ant', 'bug') self.assertLessEqual('ant', 'bug') self.assertLessEqual('ant', 'ant') self.assertRaises(self.failureException, self.assertGreater, 'ant', 'bug') self.assertRaises(self.failureException, self.assertGreater, 'ant', 'ant') self.assertRaises(self.failureException, self.assertGreaterEqual, 'ant', 'bug') self.assertRaises(self.failureException, self.assertLess, 'bug', 'ant') self.assertRaises(self.failureException, self.assertLess, 'ant', 'ant') self.assertRaises(self.failureException, self.assertLessEqual, 'bug', 'ant') # Try Unicode self.assertGreater(u'bug', u'ant') self.assertGreaterEqual(u'bug', u'ant') self.assertGreaterEqual(u'ant', u'ant') self.assertLess(u'ant', u'bug') self.assertLessEqual(u'ant', u'bug') self.assertLessEqual(u'ant', u'ant') self.assertRaises(self.failureException, self.assertGreater, u'ant', u'bug') self.assertRaises(self.failureException, self.assertGreater, u'ant', u'ant') self.assertRaises(self.failureException, self.assertGreaterEqual, u'ant', u'bug') self.assertRaises(self.failureException, self.assertLess, u'bug', u'ant') self.assertRaises(self.failureException, self.assertLess, u'ant', u'ant') self.assertRaises(self.failureException, self.assertLessEqual, u'bug', u'ant') # Try Mixed String/Unicode self.assertGreater('bug', u'ant') self.assertGreater(u'bug', 'ant') self.assertGreaterEqual('bug', u'ant') self.assertGreaterEqual(u'bug', 'ant') self.assertGreaterEqual('ant', u'ant') self.assertGreaterEqual(u'ant', 'ant') self.assertLess('ant', u'bug') self.assertLess(u'ant', 'bug') self.assertLessEqual('ant', u'bug') self.assertLessEqual(u'ant', 'bug') self.assertLessEqual('ant', u'ant') self.assertLessEqual(u'ant', 'ant') self.assertRaises(self.failureException, self.assertGreater, 'ant', u'bug') self.assertRaises(self.failureException, self.assertGreater, u'ant', 'bug') self.assertRaises(self.failureException, self.assertGreater, 'ant', u'ant') self.assertRaises(self.failureException, self.assertGreater, u'ant', 'ant') self.assertRaises(self.failureException, self.assertGreaterEqual, 'ant', u'bug') self.assertRaises(self.failureException, self.assertGreaterEqual, u'ant', 'bug') self.assertRaises(self.failureException, self.assertLess, 'bug', u'ant') self.assertRaises(self.failureException, self.assertLess, u'bug', 'ant') self.assertRaises(self.failureException, self.assertLess, 'ant', u'ant') self.assertRaises(self.failureException, self.assertLess, u'ant', 'ant') self.assertRaises(self.failureException, self.assertLessEqual, 'bug', u'ant') self.assertRaises(self.failureException, self.assertLessEqual, u'bug', 'ant') def testAssertMultiLineEqual(self): sample_text = """\ http://www.python.org/doc/2.3/lib/module-unittest.html test case A test case is the smallest unit of testing. [...] """ revised_sample_text = """\ http://www.python.org/doc/2.4.1/lib/module-unittest.html test case A test case is the smallest unit of testing. [...] You may provide your own implementation that does not subclass from TestCase, of course. """ sample_text_error = """ - http://www.python.org/doc/2.3/lib/module-unittest.html ? ^ + http://www.python.org/doc/2.4.1/lib/module-unittest.html ? ^^^ test case - A test case is the smallest unit of testing. [...] + A test case is the smallest unit of testing. [...] You may provide your ? +++++++++++++++++++++ + own implementation that does not subclass from TestCase, of course. """ for type_changer in (lambda x: x, lambda x: x.decode('utf8')): try: self.assertMultiLineEqual(type_changer(sample_text), type_changer(revised_sample_text)) except self.failureException, e: # no fair testing ourself with ourself, use assertEqual.. self.assertEqual(sample_text_error, str(e).encode('utf8')) def testAssertIsNone(self): self.assertIsNone(None) self.assertRaises(self.failureException, self.assertIsNone, False) self.assertIsNotNone('DjZoPloGears on Rails') self.assertRaises(self.failureException, self.assertIsNotNone, None) def testAssertRegexpMatches(self): self.assertRegexpMatches('asdfabasdf', r'ab+') self.assertRaises(self.failureException, self.assertRegexpMatches, 'saaas', r'aaaa') def testAssertRaisesRegexp(self): class ExceptionMock(Exception): pass def Stub(): raise ExceptionMock('We expect') self.assertRaisesRegexp(ExceptionMock, re.compile('expect$'), Stub) self.assertRaisesRegexp(ExceptionMock, 'expect$', Stub) self.assertRaisesRegexp(ExceptionMock, u'expect$', Stub) def testAssertNotRaisesRegexp(self): self.assertRaisesRegexp( self.failureException, '^Exception not raised$', self.assertRaisesRegexp, Exception, re.compile('x'), lambda: None) self.assertRaisesRegexp( self.failureException, '^Exception not raised$', self.assertRaisesRegexp, Exception, 'x', lambda: None) self.assertRaisesRegexp( self.failureException, '^Exception not raised$', self.assertRaisesRegexp, Exception, u'x', lambda: None) def testAssertRaisesRegexpMismatch(self): def Stub(): raise Exception('Unexpected') self.assertRaisesRegexp( self.failureException, r'"\^Expected\$" does not match "Unexpected"', self.assertRaisesRegexp, Exception, '^Expected$', Stub) self.assertRaisesRegexp( self.failureException, r'"\^Expected\$" does not match "Unexpected"', self.assertRaisesRegexp, Exception, u'^Expected$', Stub) self.assertRaisesRegexp( self.failureException, r'"\^Expected\$" does not match "Unexpected"', self.assertRaisesRegexp, Exception, re.compile('^Expected$'), Stub) # def testAssertRaisesExcValue(self): # class ExceptionMock(Exception): # pass # def Stub(foo): # raise ExceptionMock(foo) # v = "particular value" # ctx = self.assertRaises(ExceptionMock) # with ctx: # Stub(v) # e = ctx.exc_value # self.assertTrue(isinstance(e, ExceptionMock)) # self.assertEqual(e.args[0], v) def testSynonymAssertMethodNames(self): """Test undocumented method name synonyms. Please do not use these methods names in your own code. This test confirms their continued existence and functionality in order to avoid breaking existing code. """ self.assertNotEquals(3, 5) self.assertEquals(3, 3) self.assertAlmostEquals(2.0, 2.0) self.assertNotAlmostEquals(3.0, 5.0) self.assert_(True) def testPendingDeprecationMethodNames(self): """Test fail* methods pending deprecation, they will warn in 3.2. Do not use these methods. They will go away in 3.3. """ self.failIfEqual(3, 5) self.failUnlessEqual(3, 3) self.failUnlessAlmostEqual(2.0, 2.0) self.failIfAlmostEqual(3.0, 5.0) self.failUnless(True) self.failUnlessRaises(TypeError, lambda _: 3.14 + u'spam') self.failIf(False) # not sure why this is broken, don't care # def testDeepcopy(self): # # Issue: 5660 # class TestableTest(TestCase): # def testNothing(self): # pass # test = TestableTest('testNothing') # # This shouldn't blow up # deepcopy(test) class Test_TestSkipping(TestCase): def test_skipping(self): class Foo(unittest.TestCase): def test_skip_me(self): self.skipTest("skip") events = [] result = LoggingResult(events) test = Foo("test_skip_me") test.run(result) self.assertEqual(events, ['startTest', 'addSkip', 'stopTest']) self.assertEqual(result.skipped, [(test, "skip")]) # Try letting setUp skip the test now. class Foo(unittest.TestCase): def setUp(self): self.skipTest("testing") def test_nothing(self): pass events = [] result = LoggingResult(events) test = Foo("test_nothing") test.run(result) self.assertEqual(events, ['startTest', 'addSkip', 'stopTest']) self.assertEqual(result.skipped, [(test, "testing")]) self.assertEqual(result.testsRun, 1) def test_skipping_decorators(self): op_table = ((unittest.skipUnless, False, True), (unittest.skipIf, True, False)) for deco, do_skip, dont_skip in op_table: class Foo(unittest.TestCase): @deco(do_skip, "testing") def test_skip(self): pass @deco(dont_skip, "testing") def test_dont_skip(self): pass test_do_skip = Foo("test_skip") test_dont_skip = Foo("test_dont_skip") suite = unittest.TestSuite([test_do_skip, test_dont_skip]) events = [] result = LoggingResult(events) suite.run(result) self.assertEqual(len(result.skipped), 1) expected = ['startTest', 'addSkip', 'stopTest', 'startTest', 'addSuccess', 'stopTest'] self.assertEqual(events, expected) self.assertEqual(result.testsRun, 2) self.assertEqual(result.skipped, [(test_do_skip, "testing")]) self.assertTrue(result.wasSuccessful()) def test_skip_class(self): class Foo(unittest.TestCase): def test_1(self): record.append(1) Foo = unittest.skip("testing")(Foo) record = [] result = unittest.TestResult() test = Foo("test_1") suite = unittest.TestSuite([test]) suite.run(result) self.assertEqual(result.skipped, [(test, "testing")]) self.assertEqual(record, []) def test_expected_failure(self): class Foo(unittest.TestCase): @unittest.expectedFailure def test_die(self): self.fail("help me!") events = [] result = LoggingResult(events) test = Foo("test_die") test.run(result) self.assertEqual(events, ['startTest', 'addExpectedFailure', 'stopTest']) self.assertEqual(result.expectedFailures[0][0], test) self.assertTrue(result.wasSuccessful()) def test_unexpected_success(self): class Foo(unittest.TestCase): @unittest.expectedFailure def test_die(self): pass events = [] result = LoggingResult(events) test = Foo("test_die") test.run(result) self.assertEqual(events, ['startTest', 'addUnexpectedSuccess', 'stopTest']) self.assertFalse(result.failures) self.assertEqual(result.unexpectedSuccesses, [test]) self.assertTrue(result.wasSuccessful()) class Test_Assertions(TestCase): def test_AlmostEqual(self): self.assertAlmostEqual(1.00000001, 1.0) self.assertNotAlmostEqual(1.0000001, 1.0) self.assertRaises(self.failureException, self.assertAlmostEqual, 1.0000001, 1.0) self.assertRaises(self.failureException, self.assertNotAlmostEqual, 1.00000001, 1.0) self.assertAlmostEqual(1.1, 1.0, places=0) self.assertRaises(self.failureException, self.assertAlmostEqual, 1.1, 1.0, places=1) self.assertAlmostEqual(0, .1+.1j, places=0) self.assertNotAlmostEqual(0, .1+.1j, places=1) self.assertRaises(self.failureException, self.assertAlmostEqual, 0, .1+.1j, places=1) self.assertRaises(self.failureException, self.assertNotAlmostEqual, 0, .1+.1j, places=0) self.assertAlmostEqual(float('inf'), float('inf')) self.assertRaises(self.failureException, self.assertNotAlmostEqual, float('inf'), float('inf')) def test_assertRaises(self): def _raise(e): raise e self.assertRaises(KeyError, _raise, KeyError) self.assertRaises(KeyError, _raise, KeyError("key")) try: self.assertRaises(KeyError, lambda: None) except self.failureException, e: self.assert_("KeyError not raised" in e, str(e)) else: self.fail("assertRaises() didn't fail") try: self.assertRaises(KeyError, _raise, ValueError) except ValueError: pass else: self.fail("assertRaises() didn't let exception pass through") # with self.assertRaises(KeyError): # raise KeyError # with self.assertRaises(KeyError): # raise KeyError("key") # try: # with self.assertRaises(KeyError): # pass # except self.failureException as e: # self.assert_("KeyError not raised" in e, str(e)) # else: # self.fail("assertRaises() didn't fail") # try: # with self.assertRaises(KeyError): # raise ValueError # except ValueError: # pass # else: # self.fail("assertRaises() didn't let exception pass through") class TestLongMessage(TestCase): """Test that the individual asserts honour longMessage. This actually tests all the message behaviour for asserts that use longMessage.""" def setUp(self): class TestableTestFalse(TestCase): longMessage = False failureException = self.failureException def testTest(self): pass class TestableTestTrue(TestCase): longMessage = True failureException = self.failureException def testTest(self): pass self.testableTrue = TestableTestTrue('testTest') self.testableFalse = TestableTestFalse('testTest') def testDefault(self): self.assertFalse(TestCase.longMessage) def test_formatMsg(self): self.assertEquals(self.testableFalse._formatMessage(None, "foo"), "foo") self.assertEquals(self.testableFalse._formatMessage("foo", "bar"), "foo") self.assertEquals(self.testableTrue._formatMessage(None, "foo"), "foo") self.assertEquals(self.testableTrue._formatMessage("foo", "bar"), "bar : foo") def assertMessages(self, methodName, args, errors): def getMethod(i): useTestableFalse = i < 2 if useTestableFalse: test = self.testableFalse else: test = self.testableTrue return getattr(test, methodName) for i, expected_regexp in enumerate(errors): testMethod = getMethod(i) kwargs = {} withMsg = i % 2 if withMsg: kwargs = {"msg": "oops"} self.assertRaisesRegexp(self.failureException, expected_regexp, lambda: testMethod(*args, **kwargs)) def testAssertTrue(self): self.assertMessages('assertTrue', (False,), ["^False is not True$", "^oops$", "^False is not True$", "^False is not True : oops$"]) def testAssertFalse(self): self.assertMessages('assertFalse', (True,), ["^True is not False$", "^oops$", "^True is not False$", "^True is not False : oops$"]) def testNotEqual(self): self.assertMessages('assertNotEqual', (1, 1), ["^1 == 1$", "^oops$", "^1 == 1$", "^1 == 1 : oops$"]) def testAlmostEqual(self): self.assertMessages('assertAlmostEqual', (1, 2), ["^1 != 2 within 7 places$", "^oops$", "^1 != 2 within 7 places$", "^1 != 2 within 7 places : oops$"]) def testNotAlmostEqual(self): self.assertMessages('assertNotAlmostEqual', (1, 1), ["^1 == 1 within 7 places$", "^oops$", "^1 == 1 within 7 places$", "^1 == 1 within 7 places : oops$"]) def test_baseAssertEqual(self): self.assertMessages('_baseAssertEqual', (1, 2), ["^1 != 2$", "^oops$", "^1 != 2$", "^1 != 2 : oops$"]) def testAssertSequenceEqual(self): # Error messages are multiline so not testing on full message # assertTupleEqual and assertListEqual delegate to this method self.assertMessages('assertSequenceEqual', ([], [None]), ["\+ \[None\]$", "^oops$", r"\+ \[None\]$", r"\+ \[None\] : oops$"]) def testAssertSetEqual(self): self.assertMessages('assertSetEqual', (set(), set([None])), ["None$", "^oops$", "None$", "None : oops$"]) def testAssertIn(self): self.assertMessages('assertIn', (None, []), ['^None not found in \[\]$', "^oops$", '^None not found in \[\]$', '^None not found in \[\] : oops$']) def testAssertNotIn(self): self.assertMessages('assertNotIn', (None, [None]), ['^None unexpectedly found in \[None\]$', "^oops$", '^None unexpectedly found in \[None\]$', '^None unexpectedly found in \[None\] : oops$']) def testAssertDictEqual(self): self.assertMessages('assertDictEqual', ({}, {'key': 'value'}), [r"\+ \{'key': 'value'\}$", "^oops$", "\+ \{'key': 'value'\}$", "\+ \{'key': 'value'\} : oops$"]) def testAssertDictContainsSubset(self): self.assertMessages('assertDictContainsSubset', ({'key': 'value'}, {}), ["^Missing: 'key'$", "^oops$", "^Missing: 'key'$", "^Missing: 'key' : oops$"]) def testAssertSameElements(self): self.assertMessages('assertSameElements', ([], [None]), [r"\[None\]$", "^oops$", r"\[None\]$", r"\[None\] : oops$"]) def testAssertMultiLineEqual(self): self.assertMessages('assertMultiLineEqual', ("", "foo"), [r"\+ foo$", "^oops$", r"\+ foo$", r"\+ foo : oops$"]) def testAssertLess(self): self.assertMessages('assertLess', (2, 1), ["^2 not less than 1$", "^oops$", "^2 not less than 1$", "^2 not less than 1 : oops$"]) def testAssertLessEqual(self): self.assertMessages('assertLessEqual', (2, 1), ["^2 not less than or equal to 1$", "^oops$", "^2 not less than or equal to 1$", "^2 not less than or equal to 1 : oops$"]) def testAssertGreater(self): self.assertMessages('assertGreater', (1, 2), ["^1 not greater than 2$", "^oops$", "^1 not greater than 2$", "^1 not greater than 2 : oops$"]) def testAssertGreaterEqual(self): self.assertMessages('assertGreaterEqual', (1, 2), ["^1 not greater than or equal to 2$", "^oops$", "^1 not greater than or equal to 2$", "^1 not greater than or equal to 2 : oops$"]) def testAssertIsNone(self): self.assertMessages('assertIsNone', ('not None',), ["^'not None' is not None$", "^oops$", "^'not None' is not None$", "^'not None' is not None : oops$"]) def testAssertIsNotNone(self): self.assertMessages('assertIsNotNone', (None,), ["^unexpectedly None$", "^oops$", "^unexpectedly None$", "^unexpectedly None : oops$"]) def testAssertIs(self): self.assertMessages('assertIs', (None, 'foo'), ["^None is not 'foo'$", "^oops$", "^None is not 'foo'$", "^None is not 'foo' : oops$"]) def testAssertIsNot(self): self.assertMessages('assertIsNot', (None, None), ["^unexpectedly identical: None$", "^oops$", "^unexpectedly identical: None$", "^unexpectedly identical: None : oops$"]) class TestCleanUp(TestCase): def testCleanUp(self): class TestableTest(TestCase): def testNothing(self): pass test = TestableTest('testNothing') self.assertEqual(test._cleanups, []) cleanups = [] def cleanup1(*args, **kwargs): cleanups.append((1, args, kwargs)) def cleanup2(*args, **kwargs): cleanups.append((2, args, kwargs)) test.addCleanup(cleanup1, 1, 2, 3, four='hello', five='goodbye') test.addCleanup(cleanup2) self.assertEqual(test._cleanups, [(cleanup1, (1, 2, 3), dict(four='hello', five='goodbye')), (cleanup2, (), {})]) result = test.doCleanups() self.assertTrue(result) self.assertEqual(cleanups, [(2, (), {}), (1, (1, 2, 3), dict(four='hello', five='goodbye'))]) def testCleanUpWithErrors(self): class TestableTest(TestCase): def testNothing(self): pass class MockResult(object): errors = [] def addError(self, test, exc_info): self.errors.append((test, exc_info)) result = MockResult() test = TestableTest('testNothing') test._resultForDoCleanups = result exc1 = Exception('foo') exc2 = Exception('bar') def cleanup1(): raise exc1 def cleanup2(): raise exc2 test.addCleanup(cleanup1) test.addCleanup(cleanup2) self.assertFalse(test.doCleanups()) (test1, (Type1, instance1, _)), (test2, (Type2, instance2, _)) = reversed(MockResult.errors) self.assertEqual((test1, Type1, instance1), (test, Exception, exc1)) self.assertEqual((test2, Type2, instance2), (test, Exception, exc2)) def testCleanupInRun(self): blowUp = False ordering = [] class TestableTest(TestCase): def setUp(self): ordering.append('setUp') if blowUp: raise Exception('foo') def testNothing(self): ordering.append('test') def tearDown(self): ordering.append('tearDown') test = TestableTest('testNothing') def cleanup1(): ordering.append('cleanup1') def cleanup2(): ordering.append('cleanup2') test.addCleanup(cleanup1) test.addCleanup(cleanup2) def success(some_test): self.assertEqual(some_test, test) ordering.append('success') result = unittest.TestResult() result.addSuccess = success test.run(result) self.assertEqual(ordering, ['setUp', 'test', 'tearDown', 'cleanup2', 'cleanup1', 'success']) blowUp = True ordering = [] test = TestableTest('testNothing') test.addCleanup(cleanup1) test.run(result) self.assertEqual(ordering, ['setUp', 'cleanup1']) class Test_TestProgram(TestCase): # Horrible white box test def testNoExit(self): result = object() test = object() class FakeRunner(object): def run(self, test): self.test = test return result runner = FakeRunner() oldParseArgs = TestProgram.parseArgs def restoreParseArgs(): TestProgram.parseArgs = oldParseArgs TestProgram.parseArgs = lambda *args: None self.addCleanup(restoreParseArgs) def removeTest(): del TestProgram.test TestProgram.test = test self.addCleanup(removeTest) program = TestProgram(testRunner=runner, exit=False, verbosity=2) self.assertEqual(program.result, result) self.assertEqual(runner.test, test) self.assertEqual(program.verbosity, 2) class FooBar(unittest.TestCase): def testPass(self): assert True def testFail(self): assert False class FooBarLoader(unittest.TestLoader): """Test loader that returns a suite containing FooBar.""" def loadTestsFromModule(self, module): return self.suiteClass( [self.loadTestsFromTestCase(Test_TestProgram.FooBar)]) def test_NonExit(self): program = unittest.main(exit=False, argv=["foobar"], testRunner=unittest.TextTestRunner(stream=StringIO()), testLoader=self.FooBarLoader()) self.assertTrue(hasattr(program, 'result')) def test_Exit(self): self.assertRaises( SystemExit, unittest.main, argv=["foobar"], testRunner=unittest.TextTestRunner(stream=StringIO()), exit=True, testLoader=self.FooBarLoader()) def test_ExitAsDefault(self): self.assertRaises( SystemExit, unittest.main, argv=["foobar"], testRunner=unittest.TextTestRunner(stream=StringIO()), testLoader=self.FooBarLoader()) class Test_TextTestRunner(TestCase): """Tests for TextTestRunner.""" def test_works_with_result_without_startTestRun_stopTestRun(self): class OldTextResult(ResultWithNoStartTestRunStopTestRun): separator2 = '' def printErrors(self): pass class Runner(unittest.TextTestRunner): def __init__(self): super(Runner, self).__init__(StringIO()) def _makeResult(self): return OldTextResult() runner = Runner() runner.run(unittest.TestSuite()) def test_startTestRun_stopTestRun_called(self): class LoggingTextResult(LoggingResult): separator2 = '' def printErrors(self): pass class LoggingRunner(unittest.TextTestRunner): def __init__(self, events): super(LoggingRunner, self).__init__(StringIO()) self._events = events def _makeResult(self): return LoggingTextResult(self._events) events = [] runner = LoggingRunner(events) runner.run(unittest.TestSuite()) expected = ['startTestRun', 'stopTestRun'] self.assertEqual(events, expected) def test_pickle_unpickle(self): # Issue #7197: a TextTestRunner should be (un)pickleable. This is # required by test_multiprocessing under Windows (in verbose mode). import StringIO # cStringIO objects are not pickleable, but StringIO objects are. stream = StringIO.StringIO("foo") runner = unittest.TextTestRunner(stream) for protocol in range(pickle.HIGHEST_PROTOCOL + 1): s = pickle.dumps(runner, protocol=protocol) obj = pickle.loads(s) # StringIO objects never compare equal, a cheap test instead. self.assertEqual(obj.stream.getvalue(), stream.getvalue()) class TestDiscovery(TestCase): # Heavily mocked tests so I can avoid hitting the filesystem def test_get_name_from_path(self): loader = unittest.TestLoader() loader._top_level_dir = '/foo' name = loader._get_name_from_path('/foo/bar/baz.py') self.assertEqual(name, 'bar.baz') if not __debug__: # asserts are off return self.assertRaises(AssertionError, loader._get_name_from_path, '/bar/baz.py') def test_find_tests(self): loader = unittest.TestLoader() original_listdir = os.listdir def restore_listdir(): os.listdir = original_listdir original_isfile = os.path.isfile def restore_isfile(): os.path.isfile = original_isfile original_isdir = os.path.isdir def restore_isdir(): os.path.isdir = original_isdir path_lists = [['test1.py', 'test2.py', 'not_a_test.py', 'test_dir', 'test.foo', 'test-not-a-module.py', 'another_dir'], ['test3.py', 'test4.py', ]] os.listdir = lambda path: path_lists.pop(0) self.addCleanup(restore_listdir) def isdir(path): return path.endswith('dir') os.path.isdir = isdir self.addCleanup(restore_isdir) def isfile(path): # another_dir is not a package and so shouldn't be recursed into return not path.endswith('dir') and not 'another_dir' in path os.path.isfile = isfile self.addCleanup(restore_isfile) loader._get_module_from_name = lambda path: path + ' module' loader.loadTestsFromModule = lambda module: module + ' tests' loader._top_level_dir = '/foo' suite = list(loader._find_tests('/foo', 'test*.py')) expected = [name + ' module tests' for name in ('test1', 'test2')] expected.extend([('test_dir.%s' % name) + ' module tests' for name in ('test3', 'test4')]) self.assertEqual(suite, expected) def test_find_tests_with_package(self): loader = unittest.TestLoader() original_listdir = os.listdir def restore_listdir(): os.listdir = original_listdir original_isfile = os.path.isfile def restore_isfile(): os.path.isfile = original_isfile original_isdir = os.path.isdir def restore_isdir(): os.path.isdir = original_isdir directories = ['a_directory', 'test_directory', 'test_directory2'] path_lists = [directories, [], [], []] os.listdir = lambda path: path_lists.pop(0) self.addCleanup(restore_listdir) os.path.isdir = lambda path: True self.addCleanup(restore_isdir) os.path.isfile = lambda path: os.path.basename(path) not in directories self.addCleanup(restore_isfile) class Module(object): paths = [] load_tests_args = [] def __init__(self, path): self.path = path self.paths.append(path) if os.path.basename(path) == 'test_directory': def load_tests(loader, tests, pattern): self.load_tests_args.append((loader, tests, pattern)) return 'load_tests' self.load_tests = load_tests def __eq__(self, other): return self.path == other.path loader._get_module_from_name = lambda name: Module(name) def loadTestsFromModule(module, use_load_tests): if use_load_tests: raise self.failureException('use_load_tests should be False for packages') return module.path + ' module tests' loader.loadTestsFromModule = loadTestsFromModule loader._top_level_dir = '/foo' # this time no '.py' on the pattern so that it can match # a test package suite = list(loader._find_tests('/foo', 'test*')) # We should have loaded tests from the test_directory package by calling load_tests # and directly from the test_directory2 package self.assertEqual(suite, ['load_tests', 'test_directory2' + ' module tests']) self.assertEqual(Module.paths, ['test_directory', 'test_directory2']) # load_tests should have been called once with loader, tests and pattern self.assertEqual(Module.load_tests_args, [(loader, 'test_directory' + ' module tests', 'test*')]) def test_discover(self): loader = unittest.TestLoader() original_isfile = os.path.isfile def restore_isfile(): os.path.isfile = original_isfile os.path.isfile = lambda path: False self.addCleanup(restore_isfile) orig_sys_path = sys.path[:] def restore_path(): sys.path[:] = orig_sys_path self.addCleanup(restore_path) full_path = os.path.abspath(os.path.normpath('/foo')) self.assertRaises(ImportError, loader.discover, '/foo/bar', top_level_dir='/foo') self.assertEqual(loader._top_level_dir, full_path) self.assertIn(full_path, sys.path) os.path.isfile = lambda path: True _find_tests_args = [] def test(): pass tests = [test] def _find_tests(start_dir, pattern): _find_tests_args.append((start_dir, pattern)) return [tests] loader._find_tests = _find_tests suite = loader.discover('/foo/bar/baz', 'pattern', '/foo/bar') top_level_dir = os.path.abspath(os.path.normpath('/foo/bar')) start_dir = os.path.abspath(os.path.normpath('/foo/bar/baz')) self.assertEqual(list(suite), tests) self.assertEqual(loader._top_level_dir, top_level_dir) self.assertEqual(_find_tests_args, [(start_dir, 'pattern')]) self.assertIn(top_level_dir, sys.path) def test_discover_with_modules_that_fail_to_import(self): loader = unittest.TestLoader() listdir = os.listdir os.listdir = lambda _: ['test_this_does_not_exist.py'] isfile = os.path.isfile os.path.isfile = lambda _: True orig_sys_path = sys.path[:] def restore(): os.path.isfile = isfile os.listdir = listdir sys.path[:] = orig_sys_path self.addCleanup(restore) suite = loader.discover('.') self.assertIn(os.getcwd(), sys.path) self.assertEqual(suite.countTestCases(), 1) test = list(suite)[0] # extract test from suite self.assertRaises(ImportError, test.test_this_does_not_exist) def test_command_line_handling_parseArgs(self): # Haha - take that uninstantiable class program = object.__new__(TestProgram) args = [] def do_discovery(argv): args.extend(argv) program._do_discovery = do_discovery program.parseArgs(['something', 'discover']) self.assertEqual(args, []) program.parseArgs(['something', 'discover', 'foo', 'bar']) self.assertEqual(args, ['foo', 'bar']) def test_command_line_handling_do_discovery_too_many_arguments(self): class Stop(Exception): pass def usageExit(): raise Stop program = object.__new__(TestProgram) program.usageExit = usageExit # too many args self.assertRaises( Stop, lambda: program._do_discovery(['one', 'two', 'three', 'four'])) def test_command_line_handling_do_discovery_calls_loader(self): program = object.__new__(TestProgram) class Loader(object): args = [] def discover(self, start_dir, pattern, top_level_dir): self.args.append((start_dir, pattern, top_level_dir)) return 'tests' program._do_discovery(['-v'], Loader=Loader) self.assertEqual(program.verbosity, 2) self.assertEqual(program.test, 'tests') self.assertEqual(Loader.args, [('.', 'test*.py', None)]) Loader.args = [] program = object.__new__(TestProgram) program._do_discovery(['--verbose'], Loader=Loader) self.assertEqual(program.test, 'tests') self.assertEqual(Loader.args, [('.', 'test*.py', None)]) Loader.args = [] program = object.__new__(TestProgram) program._do_discovery([], Loader=Loader) self.assertEqual(program.test, 'tests') self.assertEqual(Loader.args, [('.', 'test*.py', None)]) Loader.args = [] program = object.__new__(TestProgram) program._do_discovery(['fish'], Loader=Loader) self.assertEqual(program.test, 'tests') self.assertEqual(Loader.args, [('fish', 'test*.py', None)]) Loader.args = [] program = object.__new__(TestProgram) program._do_discovery(['fish', 'eggs'], Loader=Loader) self.assertEqual(program.test, 'tests') self.assertEqual(Loader.args, [('fish', 'eggs', None)]) Loader.args = [] program = object.__new__(TestProgram) program._do_discovery(['fish', 'eggs', 'ham'], Loader=Loader) self.assertEqual(program.test, 'tests') self.assertEqual(Loader.args, [('fish', 'eggs', 'ham')]) Loader.args = [] program = object.__new__(TestProgram) program._do_discovery(['-s', 'fish'], Loader=Loader) self.assertEqual(program.test, 'tests') self.assertEqual(Loader.args, [('fish', 'test*.py', None)]) Loader.args = [] program = object.__new__(TestProgram) program._do_discovery(['-t', 'fish'], Loader=Loader) self.assertEqual(program.test, 'tests') self.assertEqual(Loader.args, [('.', 'test*.py', 'fish')]) Loader.args = [] program = object.__new__(TestProgram) program._do_discovery(['-p', 'fish'], Loader=Loader) self.assertEqual(program.test, 'tests') self.assertEqual(Loader.args, [('.', 'fish', None)]) Loader.args = [] program = object.__new__(TestProgram) program._do_discovery(['-p', 'eggs', '-s', 'fish', '-v'], Loader=Loader) self.assertEqual(program.test, 'tests') self.assertEqual(Loader.args, [('fish', 'eggs', None)]) self.assertEqual(program.verbosity, 2) ###################################################################### ## Main ###################################################################### if __name__ == "__main__": unittest.main() mechanize-0.2.5/test/test_api.py0000644000175000017500000000041111545150644015301 0ustar johnjohnimport unittest class ImportTests(unittest.TestCase): def test_import_all(self): # the following will raise an exception if __all__ contains undefined # classes from mechanize import * if __name__ == "__main__": unittest.main() mechanize-0.2.5/test/test_date.py0000644000175000017500000000673211545150644015461 0ustar johnjohn"""Tests for ClientCookie._HTTPDate.""" import re, time from unittest import TestCase class DateTimeTests(TestCase): def test_time2isoz(self): from mechanize._util import time2isoz base = 1019227000 day = 24*3600 assert time2isoz(base) == "2002-04-19 14:36:40Z" assert time2isoz(base+day) == "2002-04-20 14:36:40Z" assert time2isoz(base+2*day) == "2002-04-21 14:36:40Z" assert time2isoz(base+3*day) == "2002-04-22 14:36:40Z" az = time2isoz() bz = time2isoz(500000) for text in (az, bz): assert re.search(r"^\d{4}-\d\d-\d\d \d\d:\d\d:\d\dZ$", text), \ "bad time2isoz format: %s %s" % (az, bz) def test_parse_date(self): from mechanize._util import http2time def parse_date(text, http2time=http2time): return time.gmtime(http2time(text))[:6] assert parse_date("01 Jan 2001") == (2001, 1, 1, 0, 0, 0.0) # this test will break around year 2070 assert parse_date("03-Feb-20") == (2020, 2, 3, 0, 0, 0.0) # this test will break around year 2048 assert parse_date("03-Feb-98") == (1998, 2, 3, 0, 0, 0.0) def test_http2time_formats(self): from mechanize._util import http2time, time2isoz # test http2time for supported dates. Test cases with 2 digit year # will probably break in year 2044. tests = [ 'Thu, 03 Feb 1994 00:00:00 GMT', # proposed new HTTP format 'Thursday, 03-Feb-94 00:00:00 GMT', # old rfc850 HTTP format 'Thursday, 03-Feb-1994 00:00:00 GMT', # broken rfc850 HTTP format '03 Feb 1994 00:00:00 GMT', # HTTP format (no weekday) '03-Feb-94 00:00:00 GMT', # old rfc850 (no weekday) '03-Feb-1994 00:00:00 GMT', # broken rfc850 (no weekday) '03-Feb-1994 00:00 GMT', # broken rfc850 (no weekday, no seconds) '03-Feb-1994 00:00', # broken rfc850 (no weekday, no seconds, no tz) '03-Feb-94', # old rfc850 HTTP format (no weekday, no time) '03-Feb-1994', # broken rfc850 HTTP format (no weekday, no time) '03 Feb 1994', # proposed new HTTP format (no weekday, no time) # A few tests with extra space at various places ' 03 Feb 1994 0:00 ', ' 03-Feb-1994 ', ] test_t = 760233600 # assume broken POSIX counting of seconds result = time2isoz(test_t) expected = "1994-02-03 00:00:00Z" assert result == expected, \ "%s => '%s' (%s)" % (test_t, result, expected) for s in tests: t = http2time(s) t2 = http2time(s.lower()) t3 = http2time(s.upper()) assert t == t2 == t3 == test_t, \ "'%s' => %s, %s, %s (%s)" % (s, t, t2, t3, test_t) def test_http2time_garbage(self): from mechanize._util import http2time for test in [ '', 'Garbage', 'Mandag 16. September 1996', '01-00-1980', '01-13-1980', '00-01-1980', '32-01-1980', '01-01-1980 25:00:00', '01-01-1980 00:61:00', '01-01-1980 00:00:62']: bad = False if http2time(test) is not None: print "http2time(%s) is not None" % (test,) print "http2time(test)", http2time(test) bad = True assert not bad if __name__ == "__main__": import unittest unittest.main() mechanize-0.2.5/release.py0000644000175000017500000013013711545150644014143 0ustar johnjohn"""%prog RELEASE_AREA [action ...] Perform needed actions to release mechanize, doing the work in directory RELEASE_AREA. If no actions are given, print the tree of actions and do nothing. This is only intended to work on Unix (unlike mechanize itself). Some of it only works on Ubuntu 10.04 (lucid). Warning: * Many actions do rm -rf on RELEASE_AREA or subdirectories of RELEASE_AREA. * The install_deps action installs some debian packages system-wide. The clean action doesn't uninstall them. * The install_deps action adds a PPA. * The install_deps action downloads and installs software to RELEASE_AREA. The clean action uninstalls (by rm -rf). """ # This script depends on the code from this git repository: # git://github.com/jjlee/mechanize-build-tools.git # TODO # * 0install package? # * test in a Windows VM import glob import optparse import os import re import shutil import smtplib import subprocess import sys import tempfile import time import unittest # Stop the test runner from reporting import failure if these modules aren't # available or not running under Python >= 2.6. AttributeError occurs if run # with Python < 2.6, due to lack of collections.namedtuple try: import email.mime.text import action_tree import cmd_env import buildtools.release as release except (ImportError, AttributeError): # fake module class action_tree(object): @staticmethod def action_node(func): return func # based on Mark Seaborn's plash build-tools (action_tree) and Cmed's in-chroot # (cmd_env) -- which is also Mark's idea class WrongVersionError(Exception): def __init__(self, version): Exception.__init__(self, version) self.version = version def __str__(self): return str(self.version) class MissingVersionError(Exception): def __init__(self, path, release_version): Exception.__init__(self, path, release_version) self.path = path self.release_version = release_version def __str__(self): return ("Release version string not found in %s: should be %s" % (self.path, self.release_version)) class CSSValidationError(Exception): def __init__(self, path, details): Exception.__init__(self, path, details) self.path = path self.details = details def __str__(self): return ("CSS validation of %s failed:\n%s" % (self.path, self.details)) def run_performance_tests(path): # TODO: use a better/standard test runner sys.path.insert(0, os.path.join(path, "test")) test_runner = unittest.TextTestRunner(verbosity=1) test_loader = unittest.defaultTestLoader modules = [] for module_name in ["test_performance"]: module = __import__(module_name) for part in module_name.split('.')[1:]: module = getattr(module, part) modules.append(module) suite = unittest.TestSuite() for module in modules: test = test_loader.loadTestsFromModule(module) suite.addTest(test) result = test_runner.run(test) return result def send_email(from_address, to_address, subject, body): msg = email.mime.text.MIMEText(body) msg['Subject'] = subject msg['From'] = from_address msg['To'] = to_address # print "from_address %r" % from_address # print "to_address %r" % to_address # print "msg.as_string():\n%s" % msg.as_string() s = smtplib.SMTP() s.connect() s.sendmail(from_address, [to_address], msg.as_string()) s.quit() def is_git_repository(path): return os.path.exists(os.path.join(path, ".git")) def ensure_unmodified(env, path): # raise if working tree differs from HEAD release.CwdEnv(env, path).cmd(["git", "diff", "--exit-code", "HEAD"]) def add_to_path_cmd(value): set_path_script = """\ if [ -n "$PATH" ] then export PATH="$PATH":%(value)s else export PATH=%(value)s fi exec "$@" """ % dict(value=value) return ["sh", "-c", set_path_script, "inline_script"] def clean_environ_env(env): return cmd_env.PrefixCmdEnv( ["sh", "-c", 'env -i HOME="$HOME" PATH="$PATH" "$@"', "clean_environ_env"], env) def ensure_trailing_slash(path): return path.rstrip("/") + "/" def clean_dir(env, path): env.cmd(release.rm_rf_cmd(path)) env.cmd(["mkdir", "-p", path]) def check_version_equals(env, version, python): try: output = release.get_cmd_stdout( env, [python, "-c", "import mechanize; print mechanize.__version__"], stderr=subprocess.PIPE) except cmd_env.CommandFailedError: raise WrongVersionError(None) else: version_tuple_string = output.strip() assert len(version.tuple) == 6, len(version.tuple) if not(version_tuple_string == str(version.tuple) or version_tuple_string == str(version.tuple[:-1])): raise WrongVersionError(version_tuple_string) def check_not_installed(env, python): bogus_version = release.parse_version("0.0.0") try: check_version_equals(env, bogus_version, python) except WrongVersionError, exc: if exc.version is not None: raise else: raise WrongVersionError(bogus_version) class EasyInstallTester(object): def __init__(self, env, install_dir, project_name, test_cmd, expected_version, easy_install_cmd=("easy_install",), python="python"): self._env = env self._install_dir = install_dir self._project_name = project_name self._test_cmd = test_cmd self._expected_version = expected_version self._easy_install_cmd = list(easy_install_cmd) self._python = python self._install_dir_on_pythonpath = cmd_env.set_environ_vars_env( [("PYTHONPATH", self._install_dir)], env) def easy_install(self, log): clean_dir(self._env, self._install_dir) check_not_installed(self._install_dir_on_pythonpath, self._python) output = release.get_cmd_stdout( self._install_dir_on_pythonpath, self._easy_install_cmd + ["-d", self._install_dir, self._project_name]) # easy_install doesn't fail properly :-( if "SyntaxError" in output: raise Exception(output) check_version_equals(self._install_dir_on_pythonpath, self._expected_version, self._python) def test(self, log): self._install_dir_on_pythonpath.cmd(self._test_cmd) @action_tree.action_node def easy_install_test(self): return [ self.easy_install, self.test, ] def make_source_dist_easy_install_test_step(env, install_dir, source_dir, test_cmd, expected_version, python_version): python = "python%d.%d" % python_version tester = EasyInstallTester( env, install_dir, project_name=".", test_cmd=test_cmd, expected_version=expected_version, easy_install_cmd=(cmd_env.in_dir(source_dir) + [python, "setup.py", "easy_install"]), python=python) return tester.easy_install_test def make_pypi_easy_install_test_step(env, install_dir, test_cmd, expected_version, python_version): easy_install = "easy_install-%d.%d" % python_version python = "python%d.%d" % python_version tester = EasyInstallTester( env, install_dir, project_name="mechanize", test_cmd=test_cmd, expected_version=expected_version, easy_install_cmd=[easy_install], python=python) return tester.easy_install_test def make_tarball_easy_install_test_step(env, install_dir, tarball_path, test_cmd, expected_version, python_version): easy_install = "easy_install-%d.%d" % python_version python = "python%d.%d" % python_version tester = EasyInstallTester( env, install_dir, project_name=tarball_path, test_cmd=test_cmd, expected_version=expected_version, easy_install_cmd=[easy_install], python=python) return tester.easy_install_test class Releaser(object): def __init__(self, env, git_repository_path, release_area, mirror_path, build_tools_repo_path=None, run_in_repository=False, tag_name=None, test_uri=None): self._release_area = release_area self._release_dir = release_dir = os.path.join(release_area, "release") self._opt_dir = os.path.join(release_dir, "opt") self._bin_dir = os.path.join(self._opt_dir, "bin") AddToPathEnv = release.make_env_maker(add_to_path_cmd) self._env = AddToPathEnv(release.GitPagerWrapper(env), self._bin_dir) self._source_repo_path = git_repository_path self._in_source_repo = release.CwdEnv(self._env, self._source_repo_path) self._tag_name = tag_name self._set_next_release_version() self._clone_path = os.path.join(release_dir, "clone") self._in_clone = release.CwdEnv(self._env, self._clone_path) if run_in_repository: self._in_repo = self._in_source_repo self._repo_path = self._source_repo_path else: self._in_repo = self._in_clone self._repo_path = self._clone_path self._docs_dir = os.path.join(self._repo_path, "docs") self._in_docs_dir = release.CwdEnv(self._env, self._docs_dir) self._in_release_dir = release.CwdEnv(self._env, self._release_dir) self._build_tools_path = build_tools_repo_path if self._build_tools_path is not None: self._website_source_path = os.path.join(self._build_tools_path, "website") self._mirror_path = mirror_path self._in_mirror = release.CwdEnv(self._env, self._mirror_path) self._css_validator_path = "css_validator" self._test_uri = test_uri self._test_deps_dir = os.path.join(release_dir, "test_deps") self._easy_install_test_dir = os.path.join(release_dir, "easy_install_test") self._in_easy_install_dir = release.CwdEnv(self._env, self._easy_install_test_dir) # prevent anything other than functional test dependencies being on # sys.path due to cwd or PYTHONPATH self._easy_install_env = clean_environ_env( release.CwdEnv(env, self._test_deps_dir)) self._zope_testbrowser_dir = os.path.join(release_dir, "zope_testbrowser_test") def _mkdtemp(self): temp_dir = tempfile.mkdtemp(prefix="tmp-%s-" % self.__class__.__name__) def tear_down(): shutil.rmtree(temp_dir) return temp_dir, tear_down def _get_next_release_version(self): # --pretend / git not installed most_recent, next = "dummy version", "dummy version" try: tags = release.get_cmd_stdout(self._in_source_repo, ["git", "tag", "-l"]).split() except cmd_env.CommandFailedError: pass else: versions = [release.parse_version(tag) for tag in tags] if versions: most_recent = max(versions) next = most_recent.next_version() return most_recent, next def _set_next_release_version(self): self._previous_version, self._release_version = \ self._get_next_release_version() if self._tag_name is not None: self._release_version = release.parse_version(self._tag_name) self._source_distributions = self._get_source_distributions( self._release_version) def _get_source_distributions(self, version): def dist_basename(version, format): return "mechanize-%s.%s" % (version, format) return set([dist_basename(version, "zip"), dist_basename(version, "tar.gz")]) def git_fetch(self, log): # for tags self._in_source_repo.cmd(["git", "fetch"]) self._set_next_release_version() def print_next_tag(self, log): print self._release_version def _verify_version(self, path): if str(self._release_version) not in \ release.read_file_from_env(self._in_repo, path): raise MissingVersionError(path, self._release_version) def _verify_versions(self): for path in ["ChangeLog", "mechanize/_version.py"]: self._verify_version(path) def clone(self, log): self._env.cmd(["git", "clone", self._source_repo_path, self._clone_path]) def checks(self, log): self._verify_versions() def _ensure_installed(self, package_name, ppa): release.ensure_installed(self._env, cmd_env.PrefixCmdEnv(["sudo"], self._env), package_name, ppa=ppa) def install_css_validator_in_release_area(self, log): jar_dir = os.path.join(self._release_area, self._css_validator_path) clean_dir(self._env, jar_dir) in_jar_dir = release.CwdEnv(self._env, jar_dir) in_jar_dir.cmd([ "wget", "http://www.w3.org/QA/Tools/css-validator/css-validator.jar"]) in_jar_dir.cmd(["wget", "http://jigsaw.w3.org/Distrib/jigsaw_2.2.6.tar.bz2"]) in_jar_dir.cmd(["sh", "-c", "tar xf jigsaw_*.tar.bz2"]) in_jar_dir.cmd(["ln", "-s", "Jigsaw/classes/jigsaw.jar"]) @action_tree.action_node def install_deps(self): dependency_actions = [] standard_dependency_actions = [] def add_dependency(package_name, ppa=None): if ppa is None: actions = standard_dependency_actions else: actions = dependency_actions actions.append( (package_name.replace(".", ""), lambda log: self._ensure_installed(package_name, ppa))) add_dependency("python2.6") # required, but ubuntu doesn't have them any more :-( I installed these # (and zope.interface and twisted SVN trunk) by hand # add_dependency("python2.4"), # add_dependency("python2.5") # add_dependency("python2.7") add_dependency("python-setuptools") add_dependency("git-core") # for running zope_testbrowser tests add_dependency("python-virtualenv") add_dependency("python2.6-dev") # for deployment to SF and local collation of files for release add_dependency("rsync") # for running functional tests against local web server add_dependency("python-twisted-web2") # for generating .html docs from .txt markdown files add_dependency("pandoc") # for generating docs from .in templates add_dependency("python-empy") # for post-processing generated HTML add_dependency("python-lxml") # for the validate command add_dependency("wdg-html-validator") # for collecting code coverage data and generating coverage reports # no 64 bit .deb ATM # add_dependency("python-figleaf", ppa="jjl/figleaf") # for css validator add_dependency("default-jre") add_dependency("libcommons-collections3-java") add_dependency("libcommons-lang-java") add_dependency("libxerces2-java") add_dependency("libtagsoup-java") # OMG, it depends on piles of java web server stuff, even for local # command-line validation. You're doing it wrong! add_dependency("velocity") dependency_actions.append(self.install_css_validator_in_release_area) dependency_actions.insert(0, action_tree.make_node( standard_dependency_actions, "standard_dependencies")) return dependency_actions def copy_test_dependencies(self, log): # so test.py can be run without the mechanize alongside it being on # sys.path # TODO: move mechanize package into a top-level directory, so it's not # automatically on sys.path def copy_in(src): self._env.cmd(["cp", "-r", src, self._test_deps_dir]) clean_dir(self._env, self._test_deps_dir) copy_in(os.path.join(self._repo_path, "test.py")) copy_in(os.path.join(self._repo_path, "test")) copy_in(os.path.join(self._repo_path, "test-tools")) copy_in(os.path.join(self._repo_path, "examples")) def _make_test_cmd(self, python_version, local_server=True, uri=None, coverage=False): python = "python%d.%d" % python_version if coverage: # python-figleaf only supports Python 2.6 ATM assert python_version == (2, 6), python_version python = "figleaf" test_cmd = [python, "test.py"] if not local_server: test_cmd.append("--no-local-server") # running against wwwsearch.sourceforge.net is slow, want to # see where it failed test_cmd.append("-v") if coverage: # TODO: Fix figleaf traceback with doctests test_cmd.append("--skip-doctests") if uri is not None: test_cmd.extend(["--uri", uri]) return test_cmd def performance_test(self, log): result = run_performance_tests(self._repo_path) if not result.wasSuccessful(): raise Exception("performance tests failed") def clean_coverage(self, log): self._in_repo.cmd(["rm", "-f", ".figleaf"]) self._in_repo.cmd(release.rm_rf_cmd("html")) def _make_test_step(self, env, **kwds): test_cmd = self._make_test_cmd(**kwds) def test_step(log): env.cmd(test_cmd) return test_step def _make_easy_install_test_cmd(self, **kwds): test_cmd = self._make_test_cmd(**kwds) test_cmd.extend(["discover", "--start-directory", self._test_deps_dir]) return test_cmd def _make_source_dist_easy_install_test_step(self, env, **kwds): test_cmd = self._make_easy_install_test_cmd(**kwds) return make_source_dist_easy_install_test_step( self._easy_install_env, self._easy_install_test_dir, self._repo_path, test_cmd, self._release_version, kwds["python_version"]) def _make_pypi_easy_install_test_step(self, env, **kwds): test_cmd = self._make_easy_install_test_cmd(**kwds) return make_pypi_easy_install_test_step( self._easy_install_env, self._easy_install_test_dir, test_cmd, self._release_version, kwds["python_version"]) def _make_tarball_easy_install_test_step(self, env, **kwds): test_cmd = self._make_easy_install_test_cmd(**kwds) [tarball] = list(d for d in self._source_distributions if d.endswith(".tar.gz")) return make_tarball_easy_install_test_step( self._easy_install_env, self._easy_install_test_dir, os.path.abspath(os.path.join(self._repo_path, "dist", tarball)), test_cmd, self._release_version, kwds["python_version"]) def _make_unpacked_tarball_test_step(self, env, **kwds): # This catches mistakes in listing test files in MANIFEST.in (the tests # don't get installed, so these don't get caught by testing installed # code). test_cmd = self._make_test_cmd(**kwds) [tarball] = list(d for d in self._source_distributions if d.endswith(".tar.gz")) tarball_path = os.path.abspath( os.path.join(self._repo_path, "dist", tarball)) def test_step(log): target_dir, tear_down = self._mkdtemp() try: env.cmd(["tar", "-C", target_dir, "-xf", tarball_path]) [source_dir] = glob.glob( os.path.join(target_dir, "mechanize-*")) test_env = clean_environ_env(release.CwdEnv(env, source_dir)) test_env.cmd(test_cmd) finally: tear_down() return test_step @action_tree.action_node def test(self): r = [] r.append(("python27_test", self._make_test_step(self._in_repo, python_version=(2, 7)))) r.append(("python27_easy_install_test", self._make_source_dist_easy_install_test_step( self._in_repo, python_version=(2, 7)))) r.append(("python26_test", self._make_test_step(self._in_repo, python_version=(2, 6)))) # disabled for the moment -- think I probably built the launchpad .deb # from wrong branch, without bug fixes # r.append(("python26_coverage", # self._make_test_step(self._in_repo, python_version=(2, 6), # coverage=True))) r.append(("python25_easy_install_test", self._make_source_dist_easy_install_test_step( self._in_repo, python_version=(2, 5)))) r.append(("python24_easy_install_test", self._make_source_dist_easy_install_test_step( self._in_repo, python_version=(2, 4)))) r.append(self.performance_test) return r def make_coverage_html(self, log): self._in_repo.cmd(["figleaf2html"]) def tag(self, log): self._in_repo.cmd(["git", "checkout", "master"]) self._in_repo.cmd(["git", "tag", "-m", "Tagging release %s" % self._release_version, str(self._release_version)]) def clean_docs(self, log): self._in_docs_dir.cmd(release.rm_rf_cmd("html")) def make_docs(self, log): self._in_docs_dir.cmd(["mkdir", "-p", "html"]) site_map = release.site_map() def pandoc(filename, source_filename): last_modified = release.last_modified(source_filename, self._in_docs_dir) if filename == "download.txt": last_modified = time.gmtime() variables = [ ("last_modified_iso", time.strftime("%Y-%m-%d", last_modified)), ("last_modified_month_year", time.strftime("%B %Y", last_modified))] page_name = os.path.splitext(os.path.basename(filename))[0] variables.append(("nav", release.nav_html(site_map, page_name))) variables.append(("subnav", release.subnav_html(site_map, page_name))) release.pandoc(self._in_docs_dir, filename, variables=variables) release.empy(self._in_docs_dir, "forms.txt.in") release.empy(self._in_docs_dir, "download.txt.in", defines=["version=%r" % str(self._release_version)]) for page in site_map.iter_pages(): if page.name in ["Root", "Changelog"]: continue source_filename = filename = page.name + ".txt" if page.name in ["forms", "download"]: source_filename += ".in" pandoc(filename, source_filename) self._in_repo.cmd(["cp", "-r", "ChangeLog", "docs/html/ChangeLog.txt"]) if self._build_tools_path is not None: styles = ensure_trailing_slash( os.path.join(self._website_source_path, "styles")) self._env.cmd(["rsync", "-a", styles, os.path.join(self._docs_dir, "styles")]) def setup_py_sdist(self, log): self._in_repo.cmd(release.rm_rf_cmd("dist")) # write empty setup.cfg so source distribution is built using a version # number without ".dev" and today's date appended self._in_repo.cmd(cmd_env.write_file_cmd("setup.cfg", "")) self._in_repo.cmd(["python", "setup.py", "sdist", "--formats=gztar,zip"]) archives = set(os.listdir(os.path.join(self._repo_path, "dist"))) assert archives == self._source_distributions, \ (archives, self._source_distributions) @action_tree.action_node def build_sdist(self): return [ self.clean_docs, self.make_docs, self.setup_py_sdist, ] def _stage(self, path, dest_dir, dest_basename=None, source_base_path=None): # IIRC not using rsync because didn't see easy way to avoid updating # timestamp of unchanged files, which was upsetting git # note: files in the website repository that are no longer generated # must be manually deleted from the repository if source_base_path is None: source_base_path = self._repo_path full_path = os.path.join(source_base_path, path) try: self._env.cmd(["readlink", "-e", full_path], stdout=open(os.devnull, "w")) except cmd_env.CommandFailedError: print "not staging (does not exist):", full_path return if dest_basename is None: dest_basename = os.path.basename(path) dest = os.path.join(self._mirror_path, dest_dir, dest_basename) try: self._env.cmd(["cmp", full_path, dest]) except cmd_env.CommandFailedError: print "staging: %s -> %s" % (full_path, dest) self._env.cmd(["cp", full_path, dest]) else: print "not staging (unchanged): %s -> %s" % (full_path, dest) def ensure_unmodified(self, log): if self._build_tools_path: ensure_unmodified(self._env, self._website_source_path) ensure_unmodified(self._env, self._mirror_path) def _stage_flat_dir(self, path, dest): self._env.cmd(["mkdir", "-p", os.path.join(self._mirror_path, dest)]) for filename in os.listdir(path): self._stage(os.path.join(path, filename), dest) def _symlink_flat_dir(self, path, exclude): for filename in os.listdir(path): if filename in exclude: continue link_dir = os.path.dirname(path) target = os.path.relpath(os.path.join(path, filename), link_dir) link_path = os.path.join(link_dir, filename) if not os.path.islink(link_path) or \ os.path.realpath(link_path) != target: self._env.cmd(["ln", "-f", "-s", "-t", link_dir, target]) def collate_from_mechanize(self, log): html_dir = os.path.join(self._docs_dir, "html") self._stage_flat_dir(html_dir, "htdocs/mechanize/docs") self._symlink_flat_dir( os.path.join(self._mirror_path, "htdocs/mechanize/docs"), exclude=[".git", ".htaccess", ".svn", "CVS"]) self._stage("test-tools/cookietest.cgi", "cgi-bin") self._stage("examples/forms/echo.cgi", "cgi-bin") self._stage("examples/forms/example.html", "htdocs/mechanize") for archive in self._source_distributions: placeholder = os.path.join("htdocs/mechanize/src", archive) self._in_mirror.cmd(["touch", placeholder]) def collate_from_build_tools(self, log): self._stage(os.path.join(self._website_source_path, "frontpage.html"), "htdocs", "index.html") self._stage_flat_dir( os.path.join(self._website_source_path, "styles"), "htdocs/styles") @action_tree.action_node def collate(self): r = [self.collate_from_mechanize] if self._build_tools_path is not None: r.append(self.collate_from_build_tools) return r def collate_pypi_upload_built_items(self, log): for archive in self._source_distributions: self._stage(os.path.join("dist", archive), "htdocs/mechanize/src") def commit_staging_website(self, log): self._in_mirror.cmd(["git", "add", "--all"]) self._in_mirror.cmd( ["git", "commit", "-m", "Automated update for release %s" % self._release_version]) def validate_html(self, log): exclusions = set(f for f in """\ ./cookietest.html htdocs/basic_auth/index.html htdocs/digest_auth/index.html htdocs/mechanize/example.html htdocs/test_fixtures/index.html htdocs/test_fixtures/mechanize_reload_test.html htdocs/test_fixtures/referertest.html """.splitlines() if not f.startswith("#")) for dirpath, dirnames, filenames in os.walk(self._mirror_path): try: # archived website dirnames.remove("old") except ValueError: pass for filename in filenames: if filename.endswith(".html"): page_path = os.path.join( os.path.relpath(dirpath, self._mirror_path), filename) if page_path not in exclusions: self._in_mirror.cmd(["validate", page_path]) def _classpath_cmd(self): from_packages = ["/usr/share/java/commons-collections3.jar", "/usr/share/java/commons-lang.jar", "/usr/share/java/xercesImpl.jar", "/usr/share/java/tagsoup.jar", "/usr/share/java/velocity.jar", ] jar_dir = os.path.join(self._release_area, self._css_validator_path) local = glob.glob(os.path.join(jar_dir, "*.jar")) path = ":".join(local + from_packages) return ["env", "CLASSPATH=%s" % path] def _sanitise_css(self, path): temp_dir, tear_down = self._mkdtemp() temp_path = os.path.join(temp_dir, os.path.basename(path)) temp = open(temp_path, "w") try: for line in open(path): if line.rstrip().endswith("/*novalidate*/"): # temp.write("/*%s*/\n" % line.rstrip()) temp.write("/*sanitised*/\n") else: temp.write(line) finally: temp.close() return temp_path, tear_down def validate_css(self, log): env = cmd_env.PrefixCmdEnv(self._classpath_cmd(), self._in_release_dir) # env.cmd(["java", "org.w3c.css.css.CssValidator", "--help"]) """ Usage: java org.w3c.css.css.CssValidator [OPTIONS] | [URL]* OPTIONS -p, --printCSS Prints the validated CSS (only with text output, the CSS is printed with other outputs) -profile PROFILE, --profile=PROFILE Checks the Stylesheet against PROFILE Possible values for PROFILE are css1, css2, css21 (default), css3, svg, svgbasic, svgtiny, atsc-tv, mobile, tv -medium MEDIUM, --medium=MEDIUM Checks the Stylesheet using the medium MEDIUM Possible values for MEDIUM are all (default), aural, braille, embossed, handheld, print, projection, screen, tty, tv, presentation -output OUTPUT, --output=OUTPUT Prints the result in the selected format Possible values for OUTPUT are text (default), xhtml, html (same result as xhtml), soap12 -lang LANG, --lang=LANG Prints the result in the specified language Possible values for LANG are de, en (default), es, fr, ja, ko, nl, zh-cn, pl, it -warning WARN, --warning=WARN Warnings verbosity level Possible values for WARN are -1 (no warning), 0, 1, 2 (default, all the warnings URL URL can either represent a distant web resource (http://) or a local file (file:/) """ validate_cmd = ["java", "org.w3c.css.css.CssValidator"] for dirpath, dirnames, filenames in os.walk(self._mirror_path): for filename in filenames: if filename.endswith(".css"): path = os.path.join(dirpath, filename) temp_path, tear_down = self._sanitise_css(path) try: page_url = "file://" + temp_path output = release.get_cmd_stdout( env, validate_cmd + [page_url]) finally: tear_down() # the validator doesn't fail properly: it exits # successfully on validation failure if "Sorry! We found the following errors" in output: raise CSSValidationError(path, output) def fetch_zope_testbrowser(self, log): clean_dir(self._env, self._zope_testbrowser_dir) in_testbrowser = release.CwdEnv(self._env, self._zope_testbrowser_dir) in_testbrowser.cmd(["easy_install", "--editable", "--build-directory", ".", "zope.testbrowser[test]"]) in_testbrowser.cmd( ["virtualenv", "--no-site-packages", "zope.testbrowser"]) project_dir = os.path.join(self._zope_testbrowser_dir, "zope.testbrowser") in_project_dir = clean_environ_env( release.CwdEnv(self._env, project_dir)) check_not_installed(in_project_dir, "bin/python") in_project_dir.cmd( ["sed", "-i", "-e", "s/mechanize[^\"']*/mechanize/", "setup.py"]) in_project_dir.cmd(["bin/easy_install", "zc.buildout"]) in_project_dir.cmd(["bin/buildout", "init"]) [mechanize_tarball] = list(d for d in self._source_distributions if d.endswith(".tar.gz")) tarball_path = os.path.join(self._repo_path, "dist", mechanize_tarball) in_project_dir.cmd(["bin/easy_install", tarball_path]) in_project_dir.cmd(["bin/buildout", "install"]) def test_zope_testbrowser(self, log): project_dir = os.path.join(self._zope_testbrowser_dir, "zope.testbrowser") env = clean_environ_env(release.CwdEnv(self._env, project_dir)) check_version_equals(env, self._release_version, "bin/python") env.cmd(["bin/test"]) @action_tree.action_node def zope_testbrowser(self): return [self.fetch_zope_testbrowser, self.test_zope_testbrowser, ] def upload_to_pypi(self, log): self._in_repo.cmd(["python", "setup.py", "sdist", "--formats=gztar,zip", "upload"]) def sync_to_sf(self, log): assert os.path.isdir( os.path.join(self._mirror_path, "htdocs/mechanize")) self._env.cmd(["rsync", "-rlptvuz", "--exclude", "*~", "--delete", ensure_trailing_slash(self._mirror_path), "jjlee,wwwsearch@web.sourceforge.net:"]) @action_tree.action_node def upload(self): r = [] r.append(self.upload_to_pypi) # setup.py upload requires sdist command to upload zip files, and the # sdist comment insists on rebuilding source distributions, so it's not # possible to use the upload command to upload the already-built zip # file. Work around that by copying the rebuilt source distributions # into website repository only now (rather than at build/test time), so # don't end up with two different sets of source distributions with # different md5 sums due to timestamps in the archives. r.append(self.collate_pypi_upload_built_items) r.append(self.commit_staging_website) if self._mirror_path is not None: r.append(self.sync_to_sf) return r def clean(self, log): clean_dir(self._env, self._release_area) def clean_most(self, log): # not dependencies installed in release area (css validator) clean_dir(self._env, self._release_dir) def write_email(self, log): log = release.get_cmd_stdout(self._in_repo, ["git", "log", '--pretty=format: * %s', "%s..HEAD" % self._previous_version]) # filter out some uninteresting commits log = "".join(line for line in log.splitlines(True) if not re.match("^ \* Update (?:changelog|version)$", line, re.I)) self._in_release_dir.cmd(cmd_env.write_file_cmd( "announce_email.txt", u"""\ ANN: mechanize {version} released http://wwwsearch.sourceforge.net/mechanize/ This is a stable bugfix release. Changes since {previous_version}: {log} About mechanize ============================================= Requires Python 2.4, 2.5, 2.6, or 2.7. Stateful programmatic web browsing, after Andy Lester's Perl module WWW::Mechanize. Example: import re from mechanize import Browser b = Browser() b.open("http://www.example.com/") # follow second link with element text matching regular expression response = b.follow_link(text_regex=re.compile(r"cheese\s*shop"), nr=1) b.select_form(name="order") # Browser passes through unknown attributes (including methods) # to the selected HTMLForm b["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__) response2 = b.submit() # submit current form response3 = b.back() # back to cheese shop response4 = b.reload() for link in b.forms(): print form # .links() optionally accepts the keyword args of .follow_/.find_link() for link in b.links(url_regex=re.compile("python.org")): print link b.follow_link(link) # can be EITHER Link instance OR keyword args b.back() John """.format(log=log, version=self._release_version, previous_version=self._previous_version))) def edit_email(self, log): self._in_release_dir.cmd(["sensible-editor", "announce_email.txt"]) def push_tag(self, log): self._in_repo.cmd(["git", "push", "git@github.com:jjlee/mechanize.git", "tag", str(self._release_version)]) def send_email(self, log): text = release.read_file_from_env(self._in_release_dir, "announce_email.txt") subject, sep, body = text.partition("\n") body = body.lstrip() assert len(body) > 0, body send_email(from_address="John J Lee ", to_address="wwwsearch-general@lists.sourceforge.net", subject=subject, body=body) @action_tree.action_node def build(self): return [ self.clean, self.install_deps, self.clean_most, self.git_fetch, self.print_next_tag, self.clone, self.checks, # self.clean_coverage, self.copy_test_dependencies, self.test, # self.make_coverage_html, self.tag, self.build_sdist, ("unpacked_tarball_test", self._make_unpacked_tarball_test_step( self._env, python_version=(2,6))), ("easy_install_test", self._make_tarball_easy_install_test_step( self._in_repo, python_version=(2, 6), local_server=False, uri=self._test_uri)), self.zope_testbrowser, self.write_email, self.edit_email, ] def update_version(self, log): version_path = "mechanize/_version.py" template = """\ "%(text)s" __version__ = %(tuple)s """ old_text = release.read_file_from_env(self._in_source_repo, version_path) old_version = old_text.splitlines()[0].strip(' "') assert old_version == str(self._release_version), \ (old_version, str(self._release_version)) def version_text(version): return template % {"text": str(version), "tuple": repr(tuple(version.tuple[:-1]))} assert old_text == version_text(release.parse_version(old_version)), \ (old_text, version_text(release.parse_version(old_version))) self._in_source_repo.cmd(cmd_env.write_file_cmd( version_path, version_text(self._release_version.next_version()))) self._in_source_repo.cmd(["git", "commit", "-m", "Update version", version_path]) @action_tree.action_node def update_staging_website(self): if self._mirror_path is None: return [] return [ self.ensure_unmodified, self.collate, self.validate_html, self.validate_css, self.commit_staging_website, ] @action_tree.action_node def tell_the_world(self): return [ self.push_tag, self.upload, ("easy_install_test_internet", self._make_pypi_easy_install_test_step( self._in_repo, python_version=(2, 6), local_server=False, uri="http://wwwsearch.sourceforge.net/")), self.send_email, ] @action_tree.action_node def all(self): return [ self.build, self.update_staging_website, self.update_version, self.tell_the_world, ] def parse_options(args): parser = optparse.OptionParser(usage=__doc__.strip()) release.add_basic_env_options(parser) action_tree.add_options(parser) parser.add_option("--mechanize-repository", metavar="DIRECTORY", dest="git_repository_path", help="path to mechanize git repository (default is cwd)") parser.add_option("--build-tools-repository", metavar="DIRECTORY", help=("path of mechanize-build-tools git repository, " "from which to get other website source files " "(default is not to build those files)")) parser.add_option("--website-repository", metavar="DIRECTORY", dest="mirror_path", help=("path of local website mirror git repository into " "which built files will be copied (default is not " "to copy the files)")) parser.add_option("--in-source-repository", action="store_true", dest="in_repository", help=("run all commands in original repository " "(specified by --git-repository), rather than in " "the clone of it in the release area")) parser.add_option("--tag-name", metavar="TAG_NAME") parser.add_option("--uri", default="http://wwwsearch.sourceforge.net/", help=("base URI to run tests against when not using a " "built-in web server")) options, remaining_args = parser.parse_args(args) nr_args = len(remaining_args) try: options.release_area = remaining_args.pop(0) except IndexError: parser.error("Expected at least 1 argument, got %d" % nr_args) if options.git_repository_path is None: options.git_repository_path = os.getcwd() if not is_git_repository(options.git_repository_path): parser.error("incorrect git repository path") if options.build_tools_repository is not None and \ not is_git_repository(options.build_tools_repository): parser.error("incorrect mechanize-build-tools repository path") mirror_path = options.mirror_path if mirror_path is not None: if not is_git_repository(options.mirror_path): parser.error("mirror path is not a git reporsitory") mirror_path = os.path.join(mirror_path, "mirror") if not os.path.isdir(mirror_path): parser.error("%r does not exist" % mirror_path) options.mirror_path = mirror_path return options, remaining_args def main(argv): if not hasattr(action_tree, "action_main"): sys.exit("failed to import required modules") options, action_tree_args = parse_options(argv[1:]) env = release.get_env_from_options(options) releaser = Releaser(env, options.git_repository_path, options.release_area, options.mirror_path, options.build_tools_repository, options.in_repository, options.tag_name, options.uri) action_tree.action_main_(releaser.all, options, action_tree_args) if __name__ == "__main__": main(sys.argv) mechanize-0.2.5/test.py0000755000175000017500000000276611545150644013513 0ustar johnjohn#!/usr/bin/env python """ Note that the functional tests and doctests require test-tools to be on sys.path before the stdlib. One way to ensure that is to use this script to run tests. """ import os import sys def mutate_sys_path(): this_dir = os.path.dirname(__file__) sys.path.insert(0, os.path.join(this_dir, "test")) sys.path.insert(0, os.path.join(this_dir, "test-tools")) def main(argv): # test-tools/ dir includes a bundled Python 2.5 doctest / linecache, and a # bundled & modified Python trunk (2.7 vintage) unittest. This is only for # testing purposes, and these don't get installed. # unittest revision 77209, modified (probably I should have used PyPI # project discover, which is already backported to 2.4, but since I've # already done that and made changes, I won't bother for now) # doctest.py revision 45701 and linecache.py revision 45940. Since # linecache is used by Python itself, linecache.py is renamed # linecache_copy.py, and this copy of doctest is modified (only) to use # that renamed module. mutate_sys_path() assert "doctest" not in sys.modules import testprogram # *.py to catch doctests in docstrings this_dir = os.path.dirname(__file__) prog = testprogram.TestProgram( argv=argv, default_discovery_args=(this_dir, "*.py", None), module=None) result = prog.runTests() success = result.wasSuccessful() sys.exit(int(not success)) if __name__ == "__main__": main(sys.argv) mechanize-0.2.5/PKG-INFO0000644000175000017500000000577411545173600013253 0ustar johnjohnMetadata-Version: 1.0 Name: mechanize Version: 0.2.5 Summary: Stateful programmatic web browsing. Home-page: http://wwwsearch.sourceforge.net/mechanize/ Author: John J. Lee Author-email: jjl@pobox.com License: BSD Download-URL: http://pypi.python.org/packages/source/m/mechanize/mechanize-0.2.5.tar.gz Description: Stateful programmatic web browsing, after Andy Lester's Perl module WWW::Mechanize. mechanize.Browser implements the urllib2.OpenerDirector interface. Browser objects have state, including navigation history, HTML form state, cookies, etc. The set of features and URL schemes handled by Browser objects is configurable. The library also provides an API that is mostly compatible with urllib2: your urllib2 program will likely still work if you replace "urllib2" with "mechanize" everywhere. Features include: ftp:, http: and file: URL schemes, browser history, hyperlink and HTML form support, HTTP cookies, HTTP-EQUIV and Refresh, Referer [sic] header, robots.txt, redirections, proxies, and Basic and Digest HTTP authentication. Much of the code originally derived from Perl code by Gisle Aas (libwww-perl), Johnny Lee (MSIE Cookie support) and last but not least Andy Lester (WWW::Mechanize). urllib2 was written by Jeremy Hylton. Platform: any Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: Intended Audience :: System Administrators Classifier: License :: OSI Approved :: BSD License Classifier: License :: OSI Approved :: Zope Public License Classifier: Natural Language :: English Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.4 Classifier: Programming Language :: Python :: 2.5 Classifier: Programming Language :: Python :: 2.6 Classifier: Programming Language :: Python :: 2.7 Classifier: Topic :: Internet Classifier: Topic :: Internet :: File Transfer Protocol (FTP) Classifier: Topic :: Internet :: WWW/HTTP Classifier: Topic :: Internet :: WWW/HTTP :: Browsers Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search Classifier: Topic :: Internet :: WWW/HTTP :: Site Management Classifier: Topic :: Internet :: WWW/HTTP :: Site Management :: Link Checking Classifier: Topic :: Software Development :: Libraries Classifier: Topic :: Software Development :: Libraries :: Python Modules Classifier: Topic :: Software Development :: Testing Classifier: Topic :: Software Development :: Testing :: Traffic Generation Classifier: Topic :: System :: Archiving :: Mirroring Classifier: Topic :: System :: Networking :: Monitoring Classifier: Topic :: System :: Systems Administration Classifier: Topic :: Text Processing Classifier: Topic :: Text Processing :: Markup Classifier: Topic :: Text Processing :: Markup :: HTML Classifier: Topic :: Text Processing :: Markup :: XML mechanize-0.2.5/mechanize/0000755000175000017500000000000011545173600014104 5ustar johnjohnmechanize-0.2.5/mechanize/_response.py0000644000175000017500000004261311545150644016464 0ustar johnjohn"""Response classes. The seek_wrapper code is not used if you're using UserAgent with .set_seekable_responses(False), or if you're using the urllib2-level interface HTTPEquivProcessor. Class closeable_response is instantiated by some handlers (AbstractHTTPHandler), but the closeable_response interface is only depended upon by Browser-level code. Function upgrade_response is only used if you're using Browser. Copyright 2006 John J. Lee This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ import copy, mimetools, urllib2 from cStringIO import StringIO def len_of_seekable(file_): # this function exists because evaluation of len(file_.getvalue()) on every # .read() from seek_wrapper would be O(N**2) in number of .read()s pos = file_.tell() file_.seek(0, 2) # to end try: return file_.tell() finally: file_.seek(pos) # XXX Andrew Dalke kindly sent me a similar class in response to my request on # comp.lang.python, which I then proceeded to lose. I wrote this class # instead, but I think he's released his code publicly since, could pinch the # tests from it, at least... # For testing seek_wrapper invariant (note that # test_urllib2.HandlerTest.test_seekable is expected to fail when this # invariant checking is turned on). The invariant checking is done by module # ipdc, which is available here: # http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/436834 ## from ipdbc import ContractBase ## class seek_wrapper(ContractBase): class seek_wrapper: """Adds a seek method to a file object. This is only designed for seeking on readonly file-like objects. Wrapped file-like object must have a read method. The readline method is only supported if that method is present on the wrapped object. The readlines method is always supported. xreadlines and iteration are supported only for Python 2.2 and above. Public attributes: wrapped: the wrapped file object is_closed: true iff .close() has been called WARNING: All other attributes of the wrapped object (ie. those that are not one of wrapped, read, readline, readlines, xreadlines, __iter__ and next) are passed through unaltered, which may or may not make sense for your particular file object. """ # General strategy is to check that cache is full enough, then delegate to # the cache (self.__cache, which is a cStringIO.StringIO instance). A seek # position (self.__pos) is maintained independently of the cache, in order # that a single cache may be shared between multiple seek_wrapper objects. # Copying using module copy shares the cache in this way. def __init__(self, wrapped): self.wrapped = wrapped self.__read_complete_state = [False] self.__is_closed_state = [False] self.__have_readline = hasattr(self.wrapped, "readline") self.__cache = StringIO() self.__pos = 0 # seek position def invariant(self): # The end of the cache is always at the same place as the end of the # wrapped file (though the .tell() method is not required to be present # on wrapped file). return self.wrapped.tell() == len(self.__cache.getvalue()) def close(self): self.wrapped.close() self.is_closed = True def __getattr__(self, name): if name == "is_closed": return self.__is_closed_state[0] elif name == "read_complete": return self.__read_complete_state[0] wrapped = self.__dict__.get("wrapped") if wrapped: return getattr(wrapped, name) return getattr(self.__class__, name) def __setattr__(self, name, value): if name == "is_closed": self.__is_closed_state[0] = bool(value) elif name == "read_complete": if not self.is_closed: self.__read_complete_state[0] = bool(value) else: self.__dict__[name] = value def seek(self, offset, whence=0): assert whence in [0,1,2] # how much data, if any, do we need to read? if whence == 2: # 2: relative to end of *wrapped* file if offset < 0: raise ValueError("negative seek offset") # since we don't know yet where the end of that file is, we must # read everything to_read = None else: if whence == 0: # 0: absolute if offset < 0: raise ValueError("negative seek offset") dest = offset else: # 1: relative to current position pos = self.__pos if pos < offset: raise ValueError("seek to before start of file") dest = pos + offset end = len_of_seekable(self.__cache) to_read = dest - end if to_read < 0: to_read = 0 if to_read != 0: self.__cache.seek(0, 2) if to_read is None: assert whence == 2 self.__cache.write(self.wrapped.read()) self.read_complete = True self.__pos = self.__cache.tell() - offset else: data = self.wrapped.read(to_read) if not data: self.read_complete = True else: self.__cache.write(data) # Don't raise an exception even if we've seek()ed past the end # of .wrapped, since fseek() doesn't complain in that case. # Also like fseek(), pretend we have seek()ed past the end, # i.e. not: #self.__pos = self.__cache.tell() # but rather: self.__pos = dest else: self.__pos = dest def tell(self): return self.__pos def __copy__(self): cpy = self.__class__(self.wrapped) cpy.__cache = self.__cache cpy.__read_complete_state = self.__read_complete_state cpy.__is_closed_state = self.__is_closed_state return cpy def get_data(self): pos = self.__pos try: self.seek(0) return self.read(-1) finally: self.__pos = pos def read(self, size=-1): pos = self.__pos end = len_of_seekable(self.__cache) available = end - pos # enough data already cached? if size <= available and size != -1: self.__cache.seek(pos) self.__pos = pos+size return self.__cache.read(size) # no, so read sufficient data from wrapped file and cache it self.__cache.seek(0, 2) if size == -1: self.__cache.write(self.wrapped.read()) self.read_complete = True else: to_read = size - available assert to_read > 0 data = self.wrapped.read(to_read) if not data: self.read_complete = True else: self.__cache.write(data) self.__cache.seek(pos) data = self.__cache.read(size) self.__pos = self.__cache.tell() assert self.__pos == pos + len(data) return data def readline(self, size=-1): if not self.__have_readline: raise NotImplementedError("no readline method on wrapped object") # line we're about to read might not be complete in the cache, so # read another line first pos = self.__pos self.__cache.seek(0, 2) data = self.wrapped.readline() if not data: self.read_complete = True else: self.__cache.write(data) self.__cache.seek(pos) data = self.__cache.readline() if size != -1: r = data[:size] self.__pos = pos+size else: r = data self.__pos = pos+len(data) return r def readlines(self, sizehint=-1): pos = self.__pos self.__cache.seek(0, 2) self.__cache.write(self.wrapped.read()) self.read_complete = True self.__cache.seek(pos) data = self.__cache.readlines(sizehint) self.__pos = self.__cache.tell() return data def __iter__(self): return self def next(self): line = self.readline() if line == "": raise StopIteration return line xreadlines = __iter__ def __repr__(self): return ("<%s at %s whose wrapped object = %r>" % (self.__class__.__name__, hex(abs(id(self))), self.wrapped)) class response_seek_wrapper(seek_wrapper): """ Supports copying response objects and setting response body data. """ def __init__(self, wrapped): seek_wrapper.__init__(self, wrapped) self._headers = self.wrapped.info() def __copy__(self): cpy = seek_wrapper.__copy__(self) # copy headers from delegate cpy._headers = copy.copy(self.info()) return cpy # Note that .info() and .geturl() (the only two urllib2 response methods # that are not implemented by seek_wrapper) must be here explicitly rather # than by seek_wrapper's __getattr__ delegation) so that the nasty # dynamically-created HTTPError classes in get_seek_wrapper_class() get the # wrapped object's implementation, and not HTTPError's. def info(self): return self._headers def geturl(self): return self.wrapped.geturl() def set_data(self, data): self.seek(0) self.read() self.close() cache = self._seek_wrapper__cache = StringIO() cache.write(data) self.seek(0) class eoffile: # file-like object that always claims to be at end-of-file... def read(self, size=-1): return "" def readline(self, size=-1): return "" def __iter__(self): return self def next(self): return "" def close(self): pass class eofresponse(eoffile): def __init__(self, url, headers, code, msg): self._url = url self._headers = headers self.code = code self.msg = msg def geturl(self): return self._url def info(self): return self._headers class closeable_response: """Avoids unnecessarily clobbering urllib.addinfourl methods on .close(). Only supports responses returned by mechanize.HTTPHandler. After .close(), the following methods are supported: .read() .readline() .info() .geturl() .__iter__() .next() .close() and the following attributes are supported: .code .msg Also supports pickling (but the stdlib currently does something to prevent it: http://python.org/sf/1144636). """ # presence of this attr indicates is useable after .close() closeable_response = None def __init__(self, fp, headers, url, code, msg): self._set_fp(fp) self._headers = headers self._url = url self.code = code self.msg = msg def _set_fp(self, fp): self.fp = fp self.read = self.fp.read self.readline = self.fp.readline if hasattr(self.fp, "readlines"): self.readlines = self.fp.readlines if hasattr(self.fp, "fileno"): self.fileno = self.fp.fileno else: self.fileno = lambda: None self.__iter__ = self.fp.__iter__ self.next = self.fp.next def __repr__(self): return '<%s at %s whose fp = %r>' % ( self.__class__.__name__, hex(abs(id(self))), self.fp) def info(self): return self._headers def geturl(self): return self._url def close(self): wrapped = self.fp wrapped.close() new_wrapped = eofresponse( self._url, self._headers, self.code, self.msg) self._set_fp(new_wrapped) def __getstate__(self): # There are three obvious options here: # 1. truncate # 2. read to end # 3. close socket, pickle state including read position, then open # again on unpickle and use Range header # XXXX um, 4. refuse to pickle unless .close()d. This is better, # actually ("errors should never pass silently"). Pickling doesn't # work anyway ATM, because of http://python.org/sf/1144636 so fix # this later # 2 breaks pickle protocol, because one expects the original object # to be left unscathed by pickling. 3 is too complicated and # surprising (and too much work ;-) to happen in a sane __getstate__. # So we do 1. state = self.__dict__.copy() new_wrapped = eofresponse( self._url, self._headers, self.code, self.msg) state["wrapped"] = new_wrapped return state def test_response(data='test data', headers=[], url="http://example.com/", code=200, msg="OK"): return make_response(data, headers, url, code, msg) def test_html_response(data='test data', headers=[], url="http://example.com/", code=200, msg="OK"): headers += [("Content-type", "text/html")] return make_response(data, headers, url, code, msg) def make_response(data, headers, url, code, msg): """Convenient factory for objects implementing response interface. data: string containing response body data headers: sequence of (name, value) pairs url: URL of response code: integer response code (e.g. 200) msg: string response code message (e.g. "OK") """ mime_headers = make_headers(headers) r = closeable_response(StringIO(data), mime_headers, url, code, msg) return response_seek_wrapper(r) def make_headers(headers): """ headers: sequence of (name, value) pairs """ hdr_text = [] for name_value in headers: hdr_text.append("%s: %s" % name_value) return mimetools.Message(StringIO("\n".join(hdr_text))) # Rest of this module is especially horrible, but needed, at least until fork # urllib2. Even then, may want to preseve urllib2 compatibility. def get_seek_wrapper_class(response): # in order to wrap response objects that are also exceptions, we must # dynamically subclass the exception :-((( if (isinstance(response, urllib2.HTTPError) and not hasattr(response, "seek")): if response.__class__.__module__ == "__builtin__": exc_class_name = response.__class__.__name__ else: exc_class_name = "%s.%s" % ( response.__class__.__module__, response.__class__.__name__) class httperror_seek_wrapper(response_seek_wrapper, response.__class__): # this only derives from HTTPError in order to be a subclass -- # the HTTPError behaviour comes from delegation _exc_class_name = exc_class_name def __init__(self, wrapped): response_seek_wrapper.__init__(self, wrapped) # be compatible with undocumented HTTPError attributes :-( self.hdrs = wrapped.info() self.filename = wrapped.geturl() def __repr__(self): return ( "<%s (%s instance) at %s " "whose wrapped object = %r>" % ( self.__class__.__name__, self._exc_class_name, hex(abs(id(self))), self.wrapped) ) wrapper_class = httperror_seek_wrapper else: wrapper_class = response_seek_wrapper return wrapper_class def seek_wrapped_response(response): """Return a copy of response that supports seekable response interface. Accepts responses from both mechanize and urllib2 handlers. Copes with both ordinary response instances and HTTPError instances (which can't be simply wrapped due to the requirement of preserving the exception base class). """ if not hasattr(response, "seek"): wrapper_class = get_seek_wrapper_class(response) response = wrapper_class(response) assert hasattr(response, "get_data") return response def upgrade_response(response): """Return a copy of response that supports Browser response interface. Browser response interface is that of "seekable responses" (response_seek_wrapper), plus the requirement that responses must be useable after .close() (closeable_response). Accepts responses from both mechanize and urllib2 handlers. Copes with both ordinary response instances and HTTPError instances (which can't be simply wrapped due to the requirement of preserving the exception base class). """ wrapper_class = get_seek_wrapper_class(response) if hasattr(response, "closeable_response"): if not hasattr(response, "seek"): response = wrapper_class(response) assert hasattr(response, "get_data") return copy.copy(response) # a urllib2 handler constructed the response, i.e. the response is an # urllib.addinfourl or a urllib2.HTTPError, instead of a # _Util.closeable_response as returned by e.g. mechanize.HTTPHandler try: code = response.code except AttributeError: code = None try: msg = response.msg except AttributeError: msg = None # may have already-.read() data from .seek() cache data = None get_data = getattr(response, "get_data", None) if get_data: data = get_data() response = closeable_response( response.fp, response.info(), response.geturl(), code, msg) response = wrapper_class(response) if data: response.set_data(data) return response mechanize-0.2.5/mechanize/_rfc3986.py0000644000175000017500000001676111545150644015737 0ustar johnjohn"""RFC 3986 URI parsing and relative reference resolution / absolutization. (aka splitting and joining) Copyright 2006 John J. Lee This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ # XXX Wow, this is ugly. Overly-direct translation of the RFC ATM. import re, urllib ## def chr_range(a, b): ## return "".join(map(chr, range(ord(a), ord(b)+1))) ## UNRESERVED_URI_CHARS = ("ABCDEFGHIJKLMNOPQRSTUVWXYZ" ## "abcdefghijklmnopqrstuvwxyz" ## "0123456789" ## "-_.~") ## RESERVED_URI_CHARS = "!*'();:@&=+$,/?#[]" ## URI_CHARS = RESERVED_URI_CHARS+UNRESERVED_URI_CHARS+'%' # this re matches any character that's not in URI_CHARS BAD_URI_CHARS_RE = re.compile("[^A-Za-z0-9\-_.~!*'();:@&=+$,/?%#[\]]") def clean_url(url, encoding): # percent-encode illegal URI characters # Trying to come up with test cases for this gave me a headache, revisit # when do switch to unicode. # Somebody else's comments (lost the attribution): ## - IE will return you the url in the encoding you send it ## - Mozilla/Firefox will send you latin-1 if there's no non latin-1 ## characters in your link. It will send you utf-8 however if there are... if type(url) == type(""): url = url.decode(encoding, "replace") url = url.strip() # for second param to urllib.quote(), we want URI_CHARS, minus the # 'always_safe' characters that urllib.quote() never percent-encodes return urllib.quote(url.encode(encoding), "!*'();:@&=+$,/?%#[]~") def is_clean_uri(uri): """ >>> is_clean_uri("ABC!") True >>> is_clean_uri(u"ABC!") True >>> is_clean_uri("ABC|") False >>> is_clean_uri(u"ABC|") False >>> is_clean_uri("http://example.com/0") True >>> is_clean_uri(u"http://example.com/0") True """ # note module re treats bytestrings as through they were decoded as latin-1 # so this function accepts both unicode and bytestrings return not bool(BAD_URI_CHARS_RE.search(uri)) SPLIT_MATCH = re.compile( r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?").match def urlsplit(absolute_uri): """Return scheme, authority, path, query, fragment.""" match = SPLIT_MATCH(absolute_uri) if match: g = match.groups() return g[1], g[3], g[4], g[6], g[8] def urlunsplit(parts): scheme, authority, path, query, fragment = parts r = [] append = r.append if scheme is not None: append(scheme) append(":") if authority is not None: append("//") append(authority) append(path) if query is not None: append("?") append(query) if fragment is not None: append("#") append(fragment) return "".join(r) def urljoin(base_uri, uri_reference): """Join a base URI with a URI reference and return the resulting URI. See RFC 3986. """ return urlunsplit(urljoin_parts(urlsplit(base_uri), urlsplit(uri_reference))) # oops, this doesn't do the same thing as the literal translation # from the RFC below ## import posixpath ## def urljoin_parts(base_parts, reference_parts): ## scheme, authority, path, query, fragment = base_parts ## rscheme, rauthority, rpath, rquery, rfragment = reference_parts ## # compute target URI path ## if rpath == "": ## tpath = path ## else: ## tpath = rpath ## if not tpath.startswith("/"): ## tpath = merge(authority, path, tpath) ## tpath = posixpath.normpath(tpath) ## if rscheme is not None: ## return (rscheme, rauthority, tpath, rquery, rfragment) ## elif rauthority is not None: ## return (scheme, rauthority, tpath, rquery, rfragment) ## elif rpath == "": ## if rquery is not None: ## tquery = rquery ## else: ## tquery = query ## return (scheme, authority, tpath, tquery, rfragment) ## else: ## return (scheme, authority, tpath, rquery, rfragment) def urljoin_parts(base_parts, reference_parts): scheme, authority, path, query, fragment = base_parts rscheme, rauthority, rpath, rquery, rfragment = reference_parts if rscheme == scheme: rscheme = None if rscheme is not None: tscheme, tauthority, tpath, tquery = ( rscheme, rauthority, remove_dot_segments(rpath), rquery) else: if rauthority is not None: tauthority, tpath, tquery = ( rauthority, remove_dot_segments(rpath), rquery) else: if rpath == "": tpath = path if rquery is not None: tquery = rquery else: tquery = query else: if rpath.startswith("/"): tpath = remove_dot_segments(rpath) else: tpath = merge(authority, path, rpath) tpath = remove_dot_segments(tpath) tquery = rquery tauthority = authority tscheme = scheme tfragment = rfragment return (tscheme, tauthority, tpath, tquery, tfragment) # um, something *vaguely* like this is what I want, but I have to generate # lots of test cases first, if only to understand what it is that # remove_dot_segments really does... ## def remove_dot_segments(path): ## if path == '': ## return '' ## comps = path.split('/') ## new_comps = [] ## for comp in comps: ## if comp in ['.', '']: ## if not new_comps or new_comps[-1]: ## new_comps.append('') ## continue ## if comp != '..': ## new_comps.append(comp) ## elif new_comps: ## new_comps.pop() ## return '/'.join(new_comps) def remove_dot_segments(path): r = [] while path: # A if path.startswith("../"): path = path[3:] continue if path.startswith("./"): path = path[2:] continue # B if path.startswith("/./"): path = path[2:] continue if path == "/.": path = "/" continue # C if path.startswith("/../"): path = path[3:] if r: r.pop() continue if path == "/..": path = "/" if r: r.pop() continue # D if path == ".": path = path[1:] continue if path == "..": path = path[2:] continue # E start = 0 if path.startswith("/"): start = 1 ii = path.find("/", start) if ii < 0: ii = None r.append(path[:ii]) if ii is None: break path = path[ii:] return "".join(r) def merge(base_authority, base_path, ref_path): # XXXX Oddly, the sample Perl implementation of this by Roy Fielding # doesn't even take base_authority as a parameter, despite the wording in # the RFC suggesting otherwise. Perhaps I'm missing some obvious identity. #if base_authority is not None and base_path == "": if base_path == "": return "/" + ref_path ii = base_path.rfind("/") if ii >= 0: return base_path[:ii+1] + ref_path return ref_path if __name__ == "__main__": import doctest doctest.testmod() mechanize-0.2.5/mechanize/_useragent.py0000644000175000017500000003375011545150644016625 0ustar johnjohn"""Convenient HTTP UserAgent class. This is a subclass of urllib2.OpenerDirector. Copyright 2003-2006 John J. Lee This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ import warnings import _auth import _gzip import _opener import _response import _sockettimeout import _urllib2 class UserAgentBase(_opener.OpenerDirector): """Convenient user-agent class. Do not use .add_handler() to add a handler for something already dealt with by this code. The only reason at present for the distinction between UserAgent and UserAgentBase is so that classes that depend on .seek()able responses (e.g. mechanize.Browser) can inherit from UserAgentBase. The subclass UserAgent exposes a .set_seekable_responses() method that allows switching off the adding of a .seek() method to responses. Public attributes: addheaders: list of (name, value) pairs specifying headers to send with every request, unless they are overridden in the Request instance. >>> ua = UserAgentBase() >>> ua.addheaders = [ ... ("User-agent", "Mozilla/5.0 (compatible)"), ... ("From", "responsible.person@example.com")] """ handler_classes = { # scheme handlers "http": _urllib2.HTTPHandler, # CacheFTPHandler is buggy, at least in 2.3, so we don't use it "ftp": _urllib2.FTPHandler, "file": _urllib2.FileHandler, # other handlers "_unknown": _urllib2.UnknownHandler, # HTTP{S,}Handler depend on HTTPErrorProcessor too "_http_error": _urllib2.HTTPErrorProcessor, "_http_default_error": _urllib2.HTTPDefaultErrorHandler, # feature handlers "_basicauth": _urllib2.HTTPBasicAuthHandler, "_digestauth": _urllib2.HTTPDigestAuthHandler, "_redirect": _urllib2.HTTPRedirectHandler, "_cookies": _urllib2.HTTPCookieProcessor, "_refresh": _urllib2.HTTPRefreshProcessor, "_equiv": _urllib2.HTTPEquivProcessor, "_proxy": _urllib2.ProxyHandler, "_proxy_basicauth": _urllib2.ProxyBasicAuthHandler, "_proxy_digestauth": _urllib2.ProxyDigestAuthHandler, "_robots": _urllib2.HTTPRobotRulesProcessor, "_gzip": _gzip.HTTPGzipProcessor, # experimental! # debug handlers "_debug_redirect": _urllib2.HTTPRedirectDebugProcessor, "_debug_response_body": _urllib2.HTTPResponseDebugProcessor, } default_schemes = ["http", "ftp", "file"] default_others = ["_unknown", "_http_error", "_http_default_error"] default_features = ["_redirect", "_cookies", "_refresh", "_equiv", "_basicauth", "_digestauth", "_proxy", "_proxy_basicauth", "_proxy_digestauth", "_robots", ] if hasattr(_urllib2, 'HTTPSHandler'): handler_classes["https"] = _urllib2.HTTPSHandler default_schemes.append("https") def __init__(self): _opener.OpenerDirector.__init__(self) ua_handlers = self._ua_handlers = {} for scheme in (self.default_schemes+ self.default_others+ self.default_features): klass = self.handler_classes[scheme] ua_handlers[scheme] = klass() for handler in ua_handlers.itervalues(): self.add_handler(handler) # Yuck. # Ensure correct default constructor args were passed to # HTTPRefreshProcessor and HTTPEquivProcessor. if "_refresh" in ua_handlers: self.set_handle_refresh(True) if "_equiv" in ua_handlers: self.set_handle_equiv(True) # Ensure default password managers are installed. pm = ppm = None if "_basicauth" in ua_handlers or "_digestauth" in ua_handlers: pm = _urllib2.HTTPPasswordMgrWithDefaultRealm() if ("_proxy_basicauth" in ua_handlers or "_proxy_digestauth" in ua_handlers): ppm = _auth.HTTPProxyPasswordMgr() self.set_password_manager(pm) self.set_proxy_password_manager(ppm) # set default certificate manager if "https" in ua_handlers: cm = _urllib2.HTTPSClientCertMgr() self.set_client_cert_manager(cm) def close(self): _opener.OpenerDirector.close(self) self._ua_handlers = None # XXX ## def set_timeout(self, timeout): ## self._timeout = timeout ## def set_http_connection_cache(self, conn_cache): ## self._http_conn_cache = conn_cache ## def set_ftp_connection_cache(self, conn_cache): ## # XXX ATM, FTP has cache as part of handler; should it be separate? ## self._ftp_conn_cache = conn_cache def set_handled_schemes(self, schemes): """Set sequence of URL scheme (protocol) strings. For example: ua.set_handled_schemes(["http", "ftp"]) If this fails (with ValueError) because you've passed an unknown scheme, the set of handled schemes will not be changed. """ want = {} for scheme in schemes: if scheme.startswith("_"): raise ValueError("not a scheme '%s'" % scheme) if scheme not in self.handler_classes: raise ValueError("unknown scheme '%s'") want[scheme] = None # get rid of scheme handlers we don't want for scheme, oldhandler in self._ua_handlers.items(): if scheme.startswith("_"): continue # not a scheme handler if scheme not in want: self._replace_handler(scheme, None) else: del want[scheme] # already got it # add the scheme handlers that are missing for scheme in want.keys(): self._set_handler(scheme, True) def set_cookiejar(self, cookiejar): """Set a mechanize.CookieJar, or None.""" self._set_handler("_cookies", obj=cookiejar) # XXX could use Greg Stein's httpx for some of this instead? # or httplib2?? def set_proxies(self, proxies=None, proxy_bypass=None): """Configure proxy settings. proxies: dictionary mapping URL scheme to proxy specification. None means use the default system-specific settings. proxy_bypass: function taking hostname, returning whether proxy should be used. None means use the default system-specific settings. The default is to try to obtain proxy settings from the system (see the documentation for urllib.urlopen for information about the system-specific methods used -- note that's urllib, not urllib2). To avoid all use of proxies, pass an empty proxies dict. >>> ua = UserAgentBase() >>> def proxy_bypass(hostname): ... return hostname == "noproxy.com" >>> ua.set_proxies( ... {"http": "joe:password@myproxy.example.com:3128", ... "ftp": "proxy.example.com"}, ... proxy_bypass) """ self._set_handler("_proxy", True, constructor_kwds=dict(proxies=proxies, proxy_bypass=proxy_bypass)) def add_password(self, url, user, password, realm=None): self._password_manager.add_password(realm, url, user, password) def add_proxy_password(self, user, password, hostport=None, realm=None): self._proxy_password_manager.add_password( realm, hostport, user, password) def add_client_certificate(self, url, key_file, cert_file): """Add an SSL client certificate, for HTTPS client auth. key_file and cert_file must be filenames of the key and certificate files, in PEM format. You can use e.g. OpenSSL to convert a p12 (PKCS 12) file to PEM format: openssl pkcs12 -clcerts -nokeys -in cert.p12 -out cert.pem openssl pkcs12 -nocerts -in cert.p12 -out key.pem Note that client certificate password input is very inflexible ATM. At the moment this seems to be console only, which is presumably the default behaviour of libopenssl. In future mechanize may support third-party libraries that (I assume) allow more options here. """ self._client_cert_manager.add_key_cert(url, key_file, cert_file) # the following are rarely useful -- use add_password / add_proxy_password # instead def set_password_manager(self, password_manager): """Set a mechanize.HTTPPasswordMgrWithDefaultRealm, or None.""" self._password_manager = password_manager self._set_handler("_basicauth", obj=password_manager) self._set_handler("_digestauth", obj=password_manager) def set_proxy_password_manager(self, password_manager): """Set a mechanize.HTTPProxyPasswordMgr, or None.""" self._proxy_password_manager = password_manager self._set_handler("_proxy_basicauth", obj=password_manager) self._set_handler("_proxy_digestauth", obj=password_manager) def set_client_cert_manager(self, cert_manager): """Set a mechanize.HTTPClientCertMgr, or None.""" self._client_cert_manager = cert_manager handler = self._ua_handlers["https"] handler.client_cert_manager = cert_manager # these methods all take a boolean parameter def set_handle_robots(self, handle): """Set whether to observe rules from robots.txt.""" self._set_handler("_robots", handle) def set_handle_redirect(self, handle): """Set whether to handle HTTP 30x redirections.""" self._set_handler("_redirect", handle) def set_handle_refresh(self, handle, max_time=None, honor_time=True): """Set whether to handle HTTP Refresh headers.""" self._set_handler("_refresh", handle, constructor_kwds= {"max_time": max_time, "honor_time": honor_time}) def set_handle_equiv(self, handle, head_parser_class=None): """Set whether to treat HTML http-equiv headers like HTTP headers. Response objects may be .seek()able if this is set (currently returned responses are, raised HTTPError exception responses are not). """ if head_parser_class is not None: constructor_kwds = {"head_parser_class": head_parser_class} else: constructor_kwds={} self._set_handler("_equiv", handle, constructor_kwds=constructor_kwds) def set_handle_gzip(self, handle): """Handle gzip transfer encoding. """ if handle: warnings.warn( "gzip transfer encoding is experimental!", stacklevel=2) self._set_handler("_gzip", handle) def set_debug_redirects(self, handle): """Log information about HTTP redirects (including refreshes). Logging is performed using module logging. The logger name is "mechanize.http_redirects". To actually print some debug output, eg: import sys, logging logger = logging.getLogger("mechanize.http_redirects") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.INFO) Other logger names relevant to this module: "mechanize.http_responses" "mechanize.cookies" To turn on everything: import sys, logging logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.INFO) """ self._set_handler("_debug_redirect", handle) def set_debug_responses(self, handle): """Log HTTP response bodies. See docstring for .set_debug_redirects() for details of logging. Response objects may be .seek()able if this is set (currently returned responses are, raised HTTPError exception responses are not). """ self._set_handler("_debug_response_body", handle) def set_debug_http(self, handle): """Print HTTP headers to sys.stdout.""" level = int(bool(handle)) for scheme in "http", "https": h = self._ua_handlers.get(scheme) if h is not None: h.set_http_debuglevel(level) def _set_handler(self, name, handle=None, obj=None, constructor_args=(), constructor_kwds={}): if handle is None: handle = obj is not None if handle: handler_class = self.handler_classes[name] if obj is not None: newhandler = handler_class(obj) else: newhandler = handler_class( *constructor_args, **constructor_kwds) else: newhandler = None self._replace_handler(name, newhandler) def _replace_handler(self, name, newhandler=None): # first, if handler was previously added, remove it if name is not None: handler = self._ua_handlers.get(name) if handler: try: self.handlers.remove(handler) except ValueError: pass # then add the replacement, if any if newhandler is not None: self.add_handler(newhandler) self._ua_handlers[name] = newhandler class UserAgent(UserAgentBase): def __init__(self): UserAgentBase.__init__(self) self._seekable = False def set_seekable_responses(self, handle): """Make response objects .seek()able.""" self._seekable = bool(handle) def open(self, fullurl, data=None, timeout=_sockettimeout._GLOBAL_DEFAULT_TIMEOUT): if self._seekable: def bound_open(fullurl, data=None, timeout=_sockettimeout._GLOBAL_DEFAULT_TIMEOUT): return UserAgentBase.open(self, fullurl, data, timeout) response = _opener.wrapped_open( bound_open, _response.seek_wrapped_response, fullurl, data, timeout) else: response = UserAgentBase.open(self, fullurl, data) return response mechanize-0.2.5/mechanize/_debug.py0000644000175000017500000000165211545150644015712 0ustar johnjohnimport logging from _response import response_seek_wrapper from _urllib2_fork import BaseHandler class HTTPResponseDebugProcessor(BaseHandler): handler_order = 900 # before redirections, after everything else def http_response(self, request, response): if not hasattr(response, "seek"): response = response_seek_wrapper(response) info = logging.getLogger("mechanize.http_responses").info try: info(response.read()) finally: response.seek(0) info("*****************************************************") return response https_response = http_response class HTTPRedirectDebugProcessor(BaseHandler): def http_request(self, request): if hasattr(request, "redirect_dict"): info = logging.getLogger("mechanize.http_redirects").info info("redirecting to %s", request.get_full_url()) return request mechanize-0.2.5/mechanize/_sgmllib_copy.py0000644000175000017500000004351011545150644017306 0ustar johnjohn# Taken from Python 2.6.4 and regexp module constants modified """A parser for SGML, using the derived class as a static DTD.""" # XXX This only supports those SGML features used by HTML. # XXX There should be a way to distinguish between PCDATA (parsed # character data -- the normal case), RCDATA (replaceable character # data -- only char and entity references and end tags are special) # and CDATA (character data -- only end tags are special). RCDATA is # not supported at all. # from warnings import warnpy3k # warnpy3k("the sgmllib module has been removed in Python 3.0", # stacklevel=2) # del warnpy3k import markupbase import re __all__ = ["SGMLParser", "SGMLParseError"] # Regular expressions used for parsing interesting = re.compile('[&<]') incomplete = re.compile('&([a-zA-Z][a-zA-Z0-9]*|#[0-9]*)?|' '<([a-zA-Z][^<>]*|' '/([a-zA-Z][^<>]*)?|' '![^<>]*)?') entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]') # hack to fix http://bugs.python.org/issue803422 # charref = re.compile('&#([0-9]+)[^0-9]') charref = re.compile("&#(x?[0-9a-fA-F]+)[^0-9a-fA-F]") starttagopen = re.compile('<[>a-zA-Z]') shorttagopen = re.compile('<[a-zA-Z][-.a-zA-Z0-9]*/') shorttag = re.compile('<([a-zA-Z][-.a-zA-Z0-9]*)/([^/]*)/') piclose = re.compile('>') endbracket = re.compile('[<>]') # hack moved from _beautifulsoup.py (bundled BeautifulSoup version 2) #This code makes Beautiful Soup able to parse XML with namespaces # tagfind = re.compile('[a-zA-Z][-_.a-zA-Z0-9]*') tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*') attrfind = re.compile( r'\s*([a-zA-Z_][-:.a-zA-Z_0-9]*)(\s*=\s*' r'(\'[^\']*\'|"[^"]*"|[][\-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?') class SGMLParseError(RuntimeError): """Exception raised for all parse errors.""" pass # SGML parser base class -- find tags and call handler functions. # Usage: p = SGMLParser(); p.feed(data); ...; p.close(). # The dtd is defined by deriving a class which defines methods # with special names to handle tags: start_foo and end_foo to handle # and , respectively, or do_foo to handle by itself. # (Tags are converted to lower case for this purpose.) The data # between tags is passed to the parser by calling self.handle_data() # with some data as argument (the data may be split up in arbitrary # chunks). Entity references are passed by calling # self.handle_entityref() with the entity reference as argument. class SGMLParser(markupbase.ParserBase): # Definition of entities -- derived classes may override entity_or_charref = re.compile('&(?:' '([a-zA-Z][-.a-zA-Z0-9]*)|#([0-9]+)' ')(;?)') def __init__(self, verbose=0): """Initialize and reset this instance.""" self.verbose = verbose self.reset() def reset(self): """Reset this instance. Loses all unprocessed data.""" self.__starttag_text = None self.rawdata = '' self.stack = [] self.lasttag = '???' self.nomoretags = 0 self.literal = 0 markupbase.ParserBase.reset(self) def setnomoretags(self): """Enter literal mode (CDATA) till EOF. Intended for derived classes only. """ self.nomoretags = self.literal = 1 def setliteral(self, *args): """Enter literal mode (CDATA). Intended for derived classes only. """ self.literal = 1 def feed(self, data): """Feed some data to the parser. Call this as often as you want, with as little or as much text as you want (may include '\n'). (This just saves the text, all the processing is done by goahead().) """ self.rawdata = self.rawdata + data self.goahead(0) def close(self): """Handle the remaining data.""" self.goahead(1) def error(self, message): raise SGMLParseError(message) # Internal -- handle data as far as reasonable. May leave state # and data to be processed by a subsequent call. If 'end' is # true, force handling all data as if followed by EOF marker. def goahead(self, end): rawdata = self.rawdata i = 0 n = len(rawdata) while i < n: if self.nomoretags: self.handle_data(rawdata[i:n]) i = n break match = interesting.search(rawdata, i) if match: j = match.start() else: j = n if i < j: self.handle_data(rawdata[i:j]) i = j if i == n: break if rawdata[i] == '<': if starttagopen.match(rawdata, i): if self.literal: self.handle_data(rawdata[i]) i = i+1 continue k = self.parse_starttag(i) if k < 0: break i = k continue if rawdata.startswith(" (i + 1): self.handle_data("<") i = i+1 else: # incomplete break continue if rawdata.startswith(" (Extraneous whitespace in declaration) You can pass in a custom list of (RE object, replace method) tuples to get Beautiful Soup to scrub your input the way you want.""" Tag.__init__(self, self.ROOT_TAG_NAME) if avoidParserProblems \ and not isList(avoidParserProblems): avoidParserProblems = self.PARSER_MASSAGE self.avoidParserProblems = avoidParserProblems SGMLParser.__init__(self) self.quoteStack = [] self.hidden = 1 self.reset() if hasattr(text, 'read'): #It's a file-type object. text = text.read() if text: self.feed(text) if initialTextIsEverything: self.done() def __getattr__(self, methodName): """This method routes method call requests to either the SGMLParser superclass or the Tag superclass, depending on the method name.""" if methodName.find('start_') == 0 or methodName.find('end_') == 0 \ or methodName.find('do_') == 0: return SGMLParser.__getattr__(self, methodName) elif methodName.find('__') != 0: return Tag.__getattr__(self, methodName) else: raise AttributeError def feed(self, text): if self.avoidParserProblems: for fix, m in self.avoidParserProblems: text = fix.sub(m, text) SGMLParser.feed(self, text) def done(self): """Called when you're done parsing, so that the unclosed tags can be correctly processed.""" self.endData() #NEW while self.currentTag.name != self.ROOT_TAG_NAME: self.popTag() def reset(self): SGMLParser.reset(self) self.currentData = [] self.currentTag = None self.tagStack = [] self.pushTag(self) def popTag(self): tag = self.tagStack.pop() # Tags with just one string-owning child get the child as a # 'string' property, so that soup.tag.string is shorthand for # soup.tag.contents[0] if len(self.currentTag.contents) == 1 and \ isinstance(self.currentTag.contents[0], NavigableText): self.currentTag.string = self.currentTag.contents[0] #print "Pop", tag.name if self.tagStack: self.currentTag = self.tagStack[-1] return self.currentTag def pushTag(self, tag): #print "Push", tag.name if self.currentTag: self.currentTag.append(tag) self.tagStack.append(tag) self.currentTag = self.tagStack[-1] def endData(self): currentData = ''.join(self.currentData) if currentData: if not currentData.strip(): if '\n' in currentData: currentData = '\n' else: currentData = ' ' c = NavigableString if type(currentData) == types.UnicodeType: c = NavigableUnicodeString o = c(currentData) o.setup(self.currentTag, self.previous) if self.previous: self.previous.next = o self.previous = o self.currentTag.contents.append(o) self.currentData = [] def _popToTag(self, name, inclusivePop=True): """Pops the tag stack up to and including the most recent instance of the given tag. If inclusivePop is false, pops the tag stack up to but *not* including the most recent instqance of the given tag.""" if name == self.ROOT_TAG_NAME: return numPops = 0 mostRecentTag = None for i in range(len(self.tagStack)-1, 0, -1): if name == self.tagStack[i].name: numPops = len(self.tagStack)-i break if not inclusivePop: numPops = numPops - 1 for i in range(0, numPops): mostRecentTag = self.popTag() return mostRecentTag def _smartPop(self, name): """We need to pop up to the previous tag of this type, unless one of this tag's nesting reset triggers comes between this tag and the previous tag of this type, OR unless this tag is a generic nesting trigger and another generic nesting trigger comes between this tag and the previous tag of this type. Examples:

FooBar

should pop to 'p', not 'b'.

FooBar

should pop to 'table', not 'p'.

Foo

Bar

should pop to 'tr', not 'p'.

FooBar

should pop to 'p', not 'b'.

    • *
    • * should pop to 'ul', not the first 'li'.
  • ** should pop to 'table', not the first 'tr' tag should implicitly close the previous tag within the same
    ** should pop to 'tr', not the first 'td' """ nestingResetTriggers = self.NESTABLE_TAGS.get(name) isNestable = nestingResetTriggers != None isResetNesting = self.RESET_NESTING_TAGS.has_key(name) popTo = None inclusive = True for i in range(len(self.tagStack)-1, 0, -1): p = self.tagStack[i] if (not p or p.name == name) and not isNestable: #Non-nestable tags get popped to the top or to their #last occurance. popTo = name break if (nestingResetTriggers != None and p.name in nestingResetTriggers) \ or (nestingResetTriggers == None and isResetNesting and self.RESET_NESTING_TAGS.has_key(p.name)): #If we encounter one of the nesting reset triggers #peculiar to this tag, or we encounter another tag #that causes nesting to reset, pop up to but not #including that tag. popTo = p.name inclusive = False break p = p.parent if popTo: self._popToTag(popTo, inclusive) def unknown_starttag(self, name, attrs, selfClosing=0): #print "Start tag %s" % name if self.quoteStack: #This is not a real tag. #print "<%s> is not real!" % name attrs = ''.join(map(lambda(x, y): ' %s="%s"' % (x, y), attrs)) self.handle_data('<%s%s>' % (name, attrs)) return self.endData() if not name in self.SELF_CLOSING_TAGS and not selfClosing: self._smartPop(name) tag = Tag(name, attrs, self.currentTag, self.previous) if self.previous: self.previous.next = tag self.previous = tag self.pushTag(tag) if selfClosing or name in self.SELF_CLOSING_TAGS: self.popTag() if name in self.QUOTE_TAGS: #print "Beginning quote (%s)" % name self.quoteStack.append(name) self.literal = 1 def unknown_endtag(self, name): if self.quoteStack and self.quoteStack[-1] != name: #This is not a real end tag. #print " is not real!" % name self.handle_data('' % name) return self.endData() self._popToTag(name) if self.quoteStack and self.quoteStack[-1] == name: self.quoteStack.pop() self.literal = (len(self.quoteStack) > 0) def handle_data(self, data): self.currentData.append(data) def handle_pi(self, text): "Propagate processing instructions right through." self.handle_data("" % text) def handle_comment(self, text): "Propagate comments right through." self.handle_data("" % text) def handle_charref(self, ref): "Propagate char refs right through." self.handle_data('&#%s;' % ref) def handle_entityref(self, ref): "Propagate entity refs right through." self.handle_data('&%s;' % ref) def handle_decl(self, data): "Propagate DOCTYPEs and the like right through." self.handle_data('' % data) def parse_declaration(self, i): """Treat a bogus SGML declaration as raw data. Treat a CDATA declaration as regular data.""" j = None if self.rawdata[i:i+9] == '', i) if k == -1: k = len(self.rawdata) self.handle_data(self.rawdata[i+9:k]) j = k+3 else: try: j = SGMLParser.parse_declaration(self, i) except SGMLParseError: toHandle = self.rawdata[i:] self.handle_data(toHandle) j = i + len(toHandle) return j class BeautifulSoup(BeautifulStoneSoup): """This parser knows the following facts about HTML: * Some tags have no closing tag and should be interpreted as being closed as soon as they are encountered. * The text inside some tags (ie. 'script') may contain tags which are not really part of the document and which should be parsed as text, not tags. If you want to parse the text as tags, you can always fetch it and parse it explicitly. * Tag nesting rules: Most tags can't be nested at all. For instance, the occurance of a

    tag should implicitly close the previous

    tag.

    Para1

    Para2 should be transformed into:

    Para1

    Para2 Some tags can be nested arbitrarily. For instance, the occurance of a

    tag should _not_ implicitly close the previous
    tag. Alice said:
    Bob said:
    Blah should NOT be transformed into: Alice said:
    Bob said:
    Blah Some tags can be nested, but the nesting is reset by the interposition of other tags. For instance, a
    , but not close a tag in another table.
    BlahBlah should be transformed into:
    BlahBlah but, Blah
    Blah should NOT be transformed into Blah
    Blah Differing assumptions about tag nesting rules are a major source of problems with the BeautifulSoup class. If BeautifulSoup is not treating as nestable a tag your page author treats as nestable, try ICantBelieveItsBeautifulSoup before writing your own subclass.""" SELF_CLOSING_TAGS = buildTagMap(None, ['br' , 'hr', 'input', 'img', 'meta', 'spacer', 'link', 'frame', 'base']) QUOTE_TAGS = {'script': None} #According to the HTML standard, each of these inline tags can #contain another tag of the same type. Furthermore, it's common #to actually use these tags this way. NESTABLE_INLINE_TAGS = ['span', 'font', 'q', 'object', 'bdo', 'sub', 'sup', 'center'] #According to the HTML standard, these block tags can contain #another tag of the same type. Furthermore, it's common #to actually use these tags this way. NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del'] #Lists can contain other lists, but there are restrictions. NESTABLE_LIST_TAGS = { 'ol' : [], 'ul' : [], 'li' : ['ul', 'ol'], 'dl' : [], 'dd' : ['dl'], 'dt' : ['dl'] } #Tables can contain other tables, but there are restrictions. NESTABLE_TABLE_TAGS = {'table' : [], 'tr' : ['table', 'tbody', 'tfoot', 'thead'], 'td' : ['tr'], 'th' : ['tr'], } NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre'] #If one of these tags is encountered, all tags up to the next tag of #this type are popped. RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript', NON_NESTABLE_BLOCK_TAGS, NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS) NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS, NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS) class ICantBelieveItsBeautifulSoup(BeautifulSoup): """The BeautifulSoup class is oriented towards skipping over common HTML errors like unclosed tags. However, sometimes it makes errors of its own. For instance, consider this fragment: FooBar This is perfectly valid (if bizarre) HTML. However, the BeautifulSoup class will implicitly close the first b tag when it encounters the second 'b'. It will think the author wrote "FooBar", and didn't close the first 'b' tag, because there's no real-world reason to bold something that's already bold. When it encounters '' it will close two more 'b' tags, for a grand total of three tags closed instead of two. This can throw off the rest of your document structure. The same is true of a number of other tags, listed below. It's much more common for someone to forget to close (eg.) a 'b' tag than to actually use nested 'b' tags, and the BeautifulSoup class handles the common case. This class handles the not-co-common case: where you can't believe someone wrote what they did, but it's valid HTML and BeautifulSoup screwed up by assuming it wouldn't be. If this doesn't do what you need, try subclassing this class or BeautifulSoup, and providing your own list of NESTABLE_TAGS.""" I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \ ['em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong', 'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b', 'big'] I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ['noscript'] NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS, I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS, I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS) class BeautifulSOAP(BeautifulStoneSoup): """This class will push a tag with only a single string child into the tag's parent as an attribute. The attribute's name is the tag name, and the value is the string child. An example should give the flavor of the change: baz => baz You can then access fooTag['bar'] instead of fooTag.barTag.string. This is, of course, useful for scraping structures that tend to use subelements instead of attributes, such as SOAP messages. Note that it modifies its input, so don't print the modified version out. I'm not sure how many people really want to use this class; let me know if you do. Mainly I like the name.""" def popTag(self): if len(self.tagStack) > 1: tag = self.tagStack[-1] parent = self.tagStack[-2] parent._getAttrMap() if (isinstance(tag, Tag) and len(tag.contents) == 1 and isinstance(tag.contents[0], NavigableText) and not parent.attrMap.has_key(tag.name)): parent[tag.name] = tag.contents[0] BeautifulStoneSoup.popTag(self) #Enterprise class names! It has come to our attention that some people #think the names of the Beautiful Soup parser classes are too silly #and "unprofessional" for use in enterprise screen-scraping. We feel #your pain! For such-minded folk, the Beautiful Soup Consortium And #All-Night Kosher Bakery recommends renaming this file to #"RobustParser.py" (or, in cases of extreme enterprisitude, #"RobustParserBeanInterface.class") and using the following #enterprise-friendly class aliases: class RobustXMLParser(BeautifulStoneSoup): pass class RobustHTMLParser(BeautifulSoup): pass class RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup): pass class SimplifyingSOAPParser(BeautifulSOAP): pass ### #By default, act as an HTML pretty-printer. if __name__ == '__main__': import sys soup = BeautifulStoneSoup(sys.stdin.read()) print soup.prettify() mechanize-0.2.5/mechanize/_pullparser.py0000644000175000017500000003402011545150644017010 0ustar johnjohn"""A simple "pull API" for HTML parsing, after Perl's HTML::TokeParser. Examples This program extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the ... tags: import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) for token in p.tags("a"): if token.type == "endtag": continue url = dict(token.attrs).get("href", "-") text = p.get_compressed_text(endat=("endtag", "a")) print "%s\t%s" % (url, text) This program extracts the from the document: import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) if p.get_tag("title"): title = p.get_compressed_text() print "Title: %s" % title Copyright 2003-2006 John J. Lee <jjl@pobox.com> Copyright 1998-2001 Gisle Aas (original libwww-perl code) This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses. """ import re, htmlentitydefs import _sgmllib_copy as sgmllib import HTMLParser from xml.sax import saxutils from _html import unescape, unescape_charref class NoMoreTokensError(Exception): pass class Token: """Represents an HTML tag, declaration, processing instruction etc. Behaves as both a tuple-like object (ie. iterable) and has attributes .type, .data and .attrs. >>> t = Token("starttag", "a", [("href", "http://www.python.org/")]) >>> t == ("starttag", "a", [("href", "http://www.python.org/")]) True >>> (t.type, t.data) == ("starttag", "a") True >>> t.attrs == [("href", "http://www.python.org/")] True Public attributes type: one of "starttag", "endtag", "startendtag", "charref", "entityref", "data", "comment", "decl", "pi", after the corresponding methods of HTMLParser.HTMLParser data: For a tag, the tag name; otherwise, the relevant data carried by the tag, as a string attrs: list of (name, value) pairs representing HTML attributes (or None if token does not represent an opening tag) """ def __init__(self, type, data, attrs=None): self.type = type self.data = data self.attrs = attrs def __iter__(self): return iter((self.type, self.data, self.attrs)) def __eq__(self, other): type, data, attrs = other if (self.type == type and self.data == data and self.attrs == attrs): return True else: return False def __ne__(self, other): return not self.__eq__(other) def __repr__(self): args = ", ".join(map(repr, [self.type, self.data, self.attrs])) return self.__class__.__name__+"(%s)" % args def __str__(self): """ >>> print Token("starttag", "br") <br> >>> print Token("starttag", "a", ... [("href", "http://www.python.org/"), ("alt", '"foo"')]) <a href="http://www.python.org/" alt='"foo"'> >>> print Token("startendtag", "br") <br /> >>> print Token("startendtag", "br", [("spam", "eggs")]) <br spam="eggs" /> >>> print Token("endtag", "p") </p> >>> print Token("charref", "38") & >>> print Token("entityref", "amp") & >>> print Token("data", "foo\\nbar") foo bar >>> print Token("comment", "Life is a bowl\\nof cherries.") <!--Life is a bowl of cherries.--> >>> print Token("decl", "decl") <!decl> >>> print Token("pi", "pi") <?pi> """ if self.attrs is not None: attrs = "".join([" %s=%s" % (k, saxutils.quoteattr(v)) for k, v in self.attrs]) else: attrs = "" if self.type == "starttag": return "<%s%s>" % (self.data, attrs) elif self.type == "startendtag": return "<%s%s />" % (self.data, attrs) elif self.type == "endtag": return "</%s>" % self.data elif self.type == "charref": return "&#%s;" % self.data elif self.type == "entityref": return "&%s;" % self.data elif self.type == "data": return self.data elif self.type == "comment": return "<!--%s-->" % self.data elif self.type == "decl": return "<!%s>" % self.data elif self.type == "pi": return "<?%s>" % self.data assert False def iter_until_exception(fn, exception, *args, **kwds): while 1: try: yield fn(*args, **kwds) except exception: raise StopIteration class _AbstractParser: chunk = 1024 compress_re = re.compile(r"\s+") def __init__(self, fh, textify={"img": "alt", "applet": "alt"}, encoding="ascii", entitydefs=None): """ fh: file-like object (only a .read() method is required) from which to read HTML to be parsed textify: mapping used by .get_text() and .get_compressed_text() methods to represent opening tags as text encoding: encoding used to encode numeric character references by .get_text() and .get_compressed_text() ("ascii" by default) entitydefs: mapping like {"amp": "&", ...} containing HTML entity definitions (a sensible default is used). This is used to unescape entities in .get_text() (and .get_compressed_text()) and attribute values. If the encoding can not represent the character, the entity reference is left unescaped. Note that entity references (both numeric - e.g. { or ઼ - and non-numeric - e.g. &) are unescaped in attribute values and the return value of .get_text(), but not in data outside of tags. Instead, entity references outside of tags are represented as tokens. This is a bit odd, it's true :-/ If the element name of an opening tag matches a key in the textify mapping then that tag is converted to text. The corresponding value is used to specify which tag attribute to obtain the text from. textify maps from element names to either: - an HTML attribute name, in which case the HTML attribute value is used as its text value along with the element name in square brackets (e.g. "alt text goes here[IMG]", or, if the alt attribute were missing, just "[IMG]") - a callable object (e.g. a function) which takes a Token and returns the string to be used as its text value If textify has no key for an element name, nothing is substituted for the opening tag. Public attributes: encoding and textify: see above """ self._fh = fh self._tokenstack = [] # FIFO self.textify = textify self.encoding = encoding if entitydefs is None: entitydefs = htmlentitydefs.name2codepoint self._entitydefs = entitydefs def __iter__(self): return self def tags(self, *names): return iter_until_exception(self.get_tag, NoMoreTokensError, *names) def tokens(self, *tokentypes): return iter_until_exception(self.get_token, NoMoreTokensError, *tokentypes) def next(self): try: return self.get_token() except NoMoreTokensError: raise StopIteration() def get_token(self, *tokentypes): """Pop the next Token object from the stack of parsed tokens. If arguments are given, they are taken to be token types in which the caller is interested: tokens representing other elements will be skipped. Element names must be given in lower case. Raises NoMoreTokensError. """ while 1: while self._tokenstack: token = self._tokenstack.pop(0) if tokentypes: if token.type in tokentypes: return token else: return token data = self._fh.read(self.chunk) if not data: raise NoMoreTokensError() self.feed(data) def unget_token(self, token): """Push a Token back onto the stack.""" self._tokenstack.insert(0, token) def get_tag(self, *names): """Return the next Token that represents an opening or closing tag. If arguments are given, they are taken to be element names in which the caller is interested: tags representing other elements will be skipped. Element names must be given in lower case. Raises NoMoreTokensError. """ while 1: tok = self.get_token() if tok.type not in ["starttag", "endtag", "startendtag"]: continue if names: if tok.data in names: return tok else: return tok def get_text(self, endat=None): """Get some text. endat: stop reading text at this tag (the tag is included in the returned text); endtag is a tuple (type, name) where type is "starttag", "endtag" or "startendtag", and name is the element name of the tag (element names must be given in lower case) If endat is not given, .get_text() will stop at the next opening or closing tag, or when there are no more tokens (no exception is raised). Note that .get_text() includes the text representation (if any) of the opening tag, but pushes the opening tag back onto the stack. As a result, if you want to call .get_text() again, you need to call .get_tag() first (unless you want an empty string returned when you next call .get_text()). Entity references are translated using the value of the entitydefs constructor argument (a mapping from names to characters like that provided by the standard module htmlentitydefs). Named entity references that are not in this mapping are left unchanged. The textify attribute is used to translate opening tags into text: see the class docstring. """ text = [] tok = None while 1: try: tok = self.get_token() except NoMoreTokensError: # unget last token (not the one we just failed to get) if tok: self.unget_token(tok) break if tok.type == "data": text.append(tok.data) elif tok.type == "entityref": t = unescape("&%s;"%tok.data, self._entitydefs, self.encoding) text.append(t) elif tok.type == "charref": t = unescape_charref(tok.data, self.encoding) text.append(t) elif tok.type in ["starttag", "endtag", "startendtag"]: tag_name = tok.data if tok.type in ["starttag", "startendtag"]: alt = self.textify.get(tag_name) if alt is not None: if callable(alt): text.append(alt(tok)) elif tok.attrs is not None: for k, v in tok.attrs: if k == alt: text.append(v) text.append("[%s]" % tag_name.upper()) if endat is None or endat == (tok.type, tag_name): self.unget_token(tok) break return "".join(text) def get_compressed_text(self, *args, **kwds): """ As .get_text(), but collapses each group of contiguous whitespace to a single space character, and removes all initial and trailing whitespace. """ text = self.get_text(*args, **kwds) text = text.strip() return self.compress_re.sub(" ", text) def handle_startendtag(self, tag, attrs): self._tokenstack.append(Token("startendtag", tag, attrs)) def handle_starttag(self, tag, attrs): self._tokenstack.append(Token("starttag", tag, attrs)) def handle_endtag(self, tag): self._tokenstack.append(Token("endtag", tag)) def handle_charref(self, name): self._tokenstack.append(Token("charref", name)) def handle_entityref(self, name): self._tokenstack.append(Token("entityref", name)) def handle_data(self, data): self._tokenstack.append(Token("data", data)) def handle_comment(self, data): self._tokenstack.append(Token("comment", data)) def handle_decl(self, decl): self._tokenstack.append(Token("decl", decl)) def unknown_decl(self, data): # XXX should this call self.error instead? #self.error("unknown declaration: " + `data`) self._tokenstack.append(Token("decl", data)) def handle_pi(self, data): self._tokenstack.append(Token("pi", data)) def unescape_attr(self, name): return unescape(name, self._entitydefs, self.encoding) def unescape_attrs(self, attrs): escaped_attrs = [] for key, val in attrs: escaped_attrs.append((key, self.unescape_attr(val))) return escaped_attrs class PullParser(_AbstractParser, HTMLParser.HTMLParser): def __init__(self, *args, **kwds): HTMLParser.HTMLParser.__init__(self) _AbstractParser.__init__(self, *args, **kwds) def unescape(self, name): # Use the entitydefs passed into constructor, not # HTMLParser.HTMLParser's entitydefs. return self.unescape_attr(name) class TolerantPullParser(_AbstractParser, sgmllib.SGMLParser): def __init__(self, *args, **kwds): sgmllib.SGMLParser.__init__(self) _AbstractParser.__init__(self, *args, **kwds) def unknown_starttag(self, tag, attrs): attrs = self.unescape_attrs(attrs) self._tokenstack.append(Token("starttag", tag, attrs)) def unknown_endtag(self, tag): self._tokenstack.append(Token("endtag", tag)) def _test(): import doctest, _pullparser return doctest.testmod(_pullparser) if __name__ == "__main__": _test() ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.2.5/mechanize/_headersutil.py�����������������������������������������������������������0000644�0001750�0001750�00000020263�11545150644�017134� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""Utility functions for HTTP header value parsing and construction. Copyright 1997-1998, Gisle Aas Copyright 2002-2006, John J. Lee This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ import os, re from types import StringType from types import UnicodeType STRING_TYPES = StringType, UnicodeType from _util import http2time import _rfc3986 def is_html_file_extension(url, allow_xhtml): ext = os.path.splitext(_rfc3986.urlsplit(url)[2])[1] html_exts = [".htm", ".html"] if allow_xhtml: html_exts += [".xhtml"] return ext in html_exts def is_html(ct_headers, url, allow_xhtml=False): """ ct_headers: Sequence of Content-Type headers url: Response URL """ if not ct_headers: return is_html_file_extension(url, allow_xhtml) headers = split_header_words(ct_headers) if len(headers) < 1: return is_html_file_extension(url, allow_xhtml) first_header = headers[0] first_parameter = first_header[0] ct = first_parameter[0] html_types = ["text/html"] if allow_xhtml: html_types += [ "text/xhtml", "text/xml", "application/xml", "application/xhtml+xml", ] return ct in html_types def unmatched(match): """Return unmatched part of re.Match object.""" start, end = match.span(0) return match.string[:start]+match.string[end:] token_re = re.compile(r"^\s*([^=\s;,]+)") quoted_value_re = re.compile(r"^\s*=\s*\"([^\"\\]*(?:\\.[^\"\\]*)*)\"") value_re = re.compile(r"^\s*=\s*([^\s;,]*)") escape_re = re.compile(r"\\(.)") def split_header_words(header_values): r"""Parse header values into a list of lists containing key,value pairs. The function knows how to deal with ",", ";" and "=" as well as quoted values after "=". A list of space separated tokens are parsed as if they were separated by ";". If the header_values passed as argument contains multiple values, then they are treated as if they were a single value separated by comma ",". This means that this function is useful for parsing header fields that follow this syntax (BNF as from the HTTP/1.1 specification, but we relax the requirement for tokens). headers = #header header = (token | parameter) *( [";"] (token | parameter)) token = 1*<any CHAR except CTLs or separators> separators = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\" | <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT quoted-string = ( <"> *(qdtext | quoted-pair ) <"> ) qdtext = <any TEXT except <">> quoted-pair = "\" CHAR parameter = attribute "=" value attribute = token value = token | quoted-string Each header is represented by a list of key/value pairs. The value for a simple token (not part of a parameter) is None. Syntactically incorrect headers will not necessarily be parsed as you would want. This is easier to describe with some examples: >>> split_header_words(['foo="bar"; port="80,81"; discard, bar=baz']) [[('foo', 'bar'), ('port', '80,81'), ('discard', None)], [('bar', 'baz')]] >>> split_header_words(['text/html; charset="iso-8859-1"']) [[('text/html', None), ('charset', 'iso-8859-1')]] >>> split_header_words([r'Basic realm="\"foo\bar\""']) [[('Basic', None), ('realm', '"foobar"')]] """ assert type(header_values) not in STRING_TYPES result = [] for text in header_values: orig_text = text pairs = [] while text: m = token_re.search(text) if m: text = unmatched(m) name = m.group(1) m = quoted_value_re.search(text) if m: # quoted value text = unmatched(m) value = m.group(1) value = escape_re.sub(r"\1", value) else: m = value_re.search(text) if m: # unquoted value text = unmatched(m) value = m.group(1) value = value.rstrip() else: # no value, a lone token value = None pairs.append((name, value)) elif text.lstrip().startswith(","): # concatenated headers, as per RFC 2616 section 4.2 text = text.lstrip()[1:] if pairs: result.append(pairs) pairs = [] else: # skip junk non_junk, nr_junk_chars = re.subn("^[=\s;]*", "", text) assert nr_junk_chars > 0, ( "split_header_words bug: '%s', '%s', %s" % (orig_text, text, pairs)) text = non_junk if pairs: result.append(pairs) return result join_escape_re = re.compile(r"([\"\\])") def join_header_words(lists): """Do the inverse of the conversion done by split_header_words. Takes a list of lists of (key, value) pairs and produces a single header value. Attribute values are quoted if needed. >>> join_header_words([[("text/plain", None), ("charset", "iso-8859/1")]]) 'text/plain; charset="iso-8859/1"' >>> join_header_words([[("text/plain", None)], [("charset", "iso-8859/1")]]) 'text/plain, charset="iso-8859/1"' """ headers = [] for pairs in lists: attr = [] for k, v in pairs: if v is not None: if not re.search(r"^\w+$", v): v = join_escape_re.sub(r"\\\1", v) # escape " and \ v = '"%s"' % v if k is None: # Netscape cookies may have no name k = v else: k = "%s=%s" % (k, v) attr.append(k) if attr: headers.append("; ".join(attr)) return ", ".join(headers) def strip_quotes(text): if text.startswith('"'): text = text[1:] if text.endswith('"'): text = text[:-1] return text def parse_ns_headers(ns_headers): """Ad-hoc parser for Netscape protocol cookie-attributes. The old Netscape cookie format for Set-Cookie can for instance contain an unquoted "," in the expires field, so we have to use this ad-hoc parser instead of split_header_words. XXX This may not make the best possible effort to parse all the crap that Netscape Cookie headers contain. Ronald Tschalar's HTTPClient parser is probably better, so could do worse than following that if this ever gives any trouble. Currently, this is also used for parsing RFC 2109 cookies. """ known_attrs = ("expires", "domain", "path", "secure", # RFC 2109 attrs (may turn up in Netscape cookies, too) "version", "port", "max-age") result = [] for ns_header in ns_headers: pairs = [] version_set = False params = re.split(r";\s*", ns_header) for ii in range(len(params)): param = params[ii] param = param.rstrip() if param == "": continue if "=" not in param: k, v = param, None else: k, v = re.split(r"\s*=\s*", param, 1) k = k.lstrip() if ii != 0: lc = k.lower() if lc in known_attrs: k = lc if k == "version": # This is an RFC 2109 cookie. v = strip_quotes(v) version_set = True if k == "expires": # convert expires date to seconds since epoch v = http2time(strip_quotes(v)) # None if invalid pairs.append((k, v)) if pairs: if not version_set: pairs.append(("version", "0")) result.append(pairs) return result def _test(): import doctest, _headersutil return doctest.testmod(_headersutil) if __name__ == "__main__": _test() ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.2.5/mechanize/_auth.py������������������������������������������������������������������0000644�0001750�0001750�00000005020�11545150644�015556� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""HTTP Authentication and Proxy support. Copyright 2006 John J. Lee <jjl@pobox.com> This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ from _urllib2_fork import HTTPPasswordMgr # TODO: stop deriving from HTTPPasswordMgr class HTTPProxyPasswordMgr(HTTPPasswordMgr): # has default realm and host/port def add_password(self, realm, uri, user, passwd): # uri could be a single URI or a sequence if uri is None or isinstance(uri, basestring): uris = [uri] else: uris = uri passwd_by_domain = self.passwd.setdefault(realm, {}) for uri in uris: for default_port in True, False: reduced_uri = self.reduce_uri(uri, default_port) passwd_by_domain[reduced_uri] = (user, passwd) def find_user_password(self, realm, authuri): attempts = [(realm, authuri), (None, authuri)] # bleh, want default realm to take precedence over default # URI/authority, hence this outer loop for default_uri in False, True: for realm, authuri in attempts: authinfo_by_domain = self.passwd.get(realm, {}) for default_port in True, False: reduced_authuri = self.reduce_uri(authuri, default_port) for uri, authinfo in authinfo_by_domain.iteritems(): if uri is None and not default_uri: continue if self.is_suburi(uri, reduced_authuri): return authinfo user, password = None, None if user is not None: break return user, password def reduce_uri(self, uri, default_port=True): if uri is None: return None return HTTPPasswordMgr.reduce_uri(self, uri, default_port) def is_suburi(self, base, test): if base is None: # default to the proxy's host/port hostport, path = test base = (hostport, "/") return HTTPPasswordMgr.is_suburi(self, base, test) class HTTPSClientCertMgr(HTTPPasswordMgr): # implementation inheritance: this is not a proper subclass def add_key_cert(self, uri, key_file, cert_file): self.add_password(None, uri, key_file, cert_file) def find_key_cert(self, authuri): return HTTPPasswordMgr.find_user_password(self, None, authuri) ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.2.5/mechanize/_util.py������������������������������������������������������������������0000644�0001750�0001750�00000021440�11545150644�015576� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""Utility functions and date/time routines. Copyright 2002-2006 John J Lee <jjl@pobox.com> This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ import re import time import warnings class ExperimentalWarning(UserWarning): pass def experimental(message): warnings.warn(message, ExperimentalWarning, stacklevel=3) def hide_experimental_warnings(): warnings.filterwarnings("ignore", category=ExperimentalWarning) def reset_experimental_warnings(): warnings.filterwarnings("default", category=ExperimentalWarning) def deprecation(message): warnings.warn(message, DeprecationWarning, stacklevel=3) def hide_deprecations(): warnings.filterwarnings("ignore", category=DeprecationWarning) def reset_deprecations(): warnings.filterwarnings("default", category=DeprecationWarning) def write_file(filename, data): f = open(filename, "wb") try: f.write(data) finally: f.close() def get1(sequence): assert len(sequence) == 1 return sequence[0] def isstringlike(x): try: x+"" except: return False else: return True ## def caller(): ## try: ## raise SyntaxError ## except: ## import sys ## return sys.exc_traceback.tb_frame.f_back.f_back.f_code.co_name from calendar import timegm # Date/time conversion routines for formats used by the HTTP protocol. EPOCH = 1970 def my_timegm(tt): year, month, mday, hour, min, sec = tt[:6] if ((year >= EPOCH) and (1 <= month <= 12) and (1 <= mday <= 31) and (0 <= hour <= 24) and (0 <= min <= 59) and (0 <= sec <= 61)): return timegm(tt) else: return None days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"] months_lower = [] for month in months: months_lower.append(month.lower()) def time2isoz(t=None): """Return a string representing time in seconds since epoch, t. If the function is called without an argument, it will use the current time. The format of the returned string is like "YYYY-MM-DD hh:mm:ssZ", representing Universal Time (UTC, aka GMT). An example of this format is: 1994-11-24 08:49:37Z """ if t is None: t = time.time() year, mon, mday, hour, min, sec = time.gmtime(t)[:6] return "%04d-%02d-%02d %02d:%02d:%02dZ" % ( year, mon, mday, hour, min, sec) def time2netscape(t=None): """Return a string representing time in seconds since epoch, t. If the function is called without an argument, it will use the current time. The format of the returned string is like this: Wed, DD-Mon-YYYY HH:MM:SS GMT """ if t is None: t = time.time() year, mon, mday, hour, min, sec, wday = time.gmtime(t)[:7] return "%s %02d-%s-%04d %02d:%02d:%02d GMT" % ( days[wday], mday, months[mon-1], year, hour, min, sec) UTC_ZONES = {"GMT": None, "UTC": None, "UT": None, "Z": None} timezone_re = re.compile(r"^([-+])?(\d\d?):?(\d\d)?$") def offset_from_tz_string(tz): offset = None if UTC_ZONES.has_key(tz): offset = 0 else: m = timezone_re.search(tz) if m: offset = 3600 * int(m.group(2)) if m.group(3): offset = offset + 60 * int(m.group(3)) if m.group(1) == '-': offset = -offset return offset def _str2time(day, mon, yr, hr, min, sec, tz): # translate month name to number # month numbers start with 1 (January) try: mon = months_lower.index(mon.lower())+1 except ValueError: # maybe it's already a number try: imon = int(mon) except ValueError: return None if 1 <= imon <= 12: mon = imon else: return None # make sure clock elements are defined if hr is None: hr = 0 if min is None: min = 0 if sec is None: sec = 0 yr = int(yr) day = int(day) hr = int(hr) min = int(min) sec = int(sec) if yr < 1000: # find "obvious" year cur_yr = time.localtime(time.time())[0] m = cur_yr % 100 tmp = yr yr = yr + cur_yr - m m = m - tmp if abs(m) > 50: if m > 0: yr = yr + 100 else: yr = yr - 100 # convert UTC time tuple to seconds since epoch (not timezone-adjusted) t = my_timegm((yr, mon, day, hr, min, sec, tz)) if t is not None: # adjust time using timezone string, to get absolute time since epoch if tz is None: tz = "UTC" tz = tz.upper() offset = offset_from_tz_string(tz) if offset is None: return None t = t - offset return t strict_re = re.compile(r"^[SMTWF][a-z][a-z], (\d\d) ([JFMASOND][a-z][a-z]) " r"(\d\d\d\d) (\d\d):(\d\d):(\d\d) GMT$") wkday_re = re.compile( r"^(?:Sun|Mon|Tue|Wed|Thu|Fri|Sat)[a-z]*,?\s*", re.I) loose_http_re = re.compile( r"""^ (\d\d?) # day (?:\s+|[-\/]) (\w+) # month (?:\s+|[-\/]) (\d+) # year (?: (?:\s+|:) # separator before clock (\d\d?):(\d\d) # hour:min (?::(\d\d))? # optional seconds )? # optional clock \s* ([-+]?\d{2,4}|(?![APap][Mm]\b)[A-Za-z]+)? # timezone \s* (?:\(\w+\))? # ASCII representation of timezone in parens. \s*$""", re.X) def http2time(text): """Returns time in seconds since epoch of time represented by a string. Return value is an integer. None is returned if the format of str is unrecognized, the time is outside the representable range, or the timezone string is not recognized. If the string contains no timezone, UTC is assumed. The timezone in the string may be numerical (like "-0800" or "+0100") or a string timezone (like "UTC", "GMT", "BST" or "EST"). Currently, only the timezone strings equivalent to UTC (zero offset) are known to the function. The function loosely parses the following formats: Wed, 09 Feb 1994 22:23:32 GMT -- HTTP format Tuesday, 08-Feb-94 14:15:29 GMT -- old rfc850 HTTP format Tuesday, 08-Feb-1994 14:15:29 GMT -- broken rfc850 HTTP format 09 Feb 1994 22:23:32 GMT -- HTTP format (no weekday) 08-Feb-94 14:15:29 GMT -- rfc850 format (no weekday) 08-Feb-1994 14:15:29 GMT -- broken rfc850 format (no weekday) The parser ignores leading and trailing whitespace. The time may be absent. If the year is given with only 2 digits, the function will select the century that makes the year closest to the current date. """ # fast exit for strictly conforming string m = strict_re.search(text) if m: g = m.groups() mon = months_lower.index(g[1].lower()) + 1 tt = (int(g[2]), mon, int(g[0]), int(g[3]), int(g[4]), float(g[5])) return my_timegm(tt) # No, we need some messy parsing... # clean up text = text.lstrip() text = wkday_re.sub("", text, 1) # Useless weekday # tz is time zone specifier string day, mon, yr, hr, min, sec, tz = [None]*7 # loose regexp parse m = loose_http_re.search(text) if m is not None: day, mon, yr, hr, min, sec, tz = m.groups() else: return None # bad format return _str2time(day, mon, yr, hr, min, sec, tz) iso_re = re.compile( """^ (\d{4}) # year [-\/]? (\d\d?) # numerical month [-\/]? (\d\d?) # day (?: (?:\s+|[-:Tt]) # separator before clock (\d\d?):?(\d\d) # hour:min (?::?(\d\d(?:\.\d*)?))? # optional seconds (and fractional) )? # optional clock \s* ([-+]?\d\d?:?(:?\d\d)? |Z|z)? # timezone (Z is "zero meridian", i.e. GMT) \s*$""", re.X) def iso2time(text): """ As for http2time, but parses the ISO 8601 formats: 1994-02-03 14:15:29 -0100 -- ISO 8601 format 1994-02-03 14:15:29 -- zone is optional 1994-02-03 -- only date 1994-02-03T14:15:29 -- Use T as separator 19940203T141529Z -- ISO 8601 compact format 19940203 -- only date """ # clean up text = text.lstrip() # tz is time zone specifier string day, mon, yr, hr, min, sec, tz = [None]*7 # loose regexp parse m = iso_re.search(text) if m is not None: # XXX there's an extra bit of the timezone I'm ignoring here: is # this the right thing to do? yr, mon, day, hr, min, sec, tz, _ = m.groups() else: return None # bad format return _str2time(day, mon, yr, hr, min, sec, tz) ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.2.5/mechanize/_urllib2_fork.py����������������������������������������������������������0000644�0001750�0001750�00000137653�11545150644�017233� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""Fork of urllib2. When reading this, don't assume that all code in here is reachable. Code in the rest of mechanize may be used instead. Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009 Python Software Foundation; All Rights Reserved Copyright 2002-2009 John J Lee <jjl@pobox.com> This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ # XXX issues: # If an authentication error handler that tries to perform # authentication for some reason but fails, how should the error be # signalled? The client needs to know the HTTP error code. But if # the handler knows that the problem was, e.g., that it didn't know # that hash algo that requested in the challenge, it would be good to # pass that information along to the client, too. # ftp errors aren't handled cleanly # check digest against correct (i.e. non-apache) implementation # Possible extensions: # complex proxies XXX not sure what exactly was meant by this # abstract factory for opener import copy import base64 import httplib import mimetools import logging import os import posixpath import random import re import socket import sys import time import urllib import urlparse import bisect try: from cStringIO import StringIO except ImportError: from StringIO import StringIO try: import hashlib except ImportError: # python 2.4 import md5 import sha def sha1_digest(bytes): return sha.new(bytes).hexdigest() def md5_digest(bytes): return md5.new(bytes).hexdigest() else: def sha1_digest(bytes): return hashlib.sha1(bytes).hexdigest() def md5_digest(bytes): return hashlib.md5(bytes).hexdigest() try: socket._fileobject("fake socket", close=True) except TypeError: # python <= 2.4 create_readline_wrapper = socket._fileobject else: def create_readline_wrapper(fh): return socket._fileobject(fh, close=True) # python 2.4 splithost has a bug in empty path component case _hostprog = None def splithost(url): """splithost('//host[:port]/path') --> 'host[:port]', '/path'.""" global _hostprog if _hostprog is None: import re _hostprog = re.compile('^//([^/?]*)(.*)$') match = _hostprog.match(url) if match: return match.group(1, 2) return None, url from urllib import (unwrap, unquote, splittype, quote, addinfourl, splitport, splitattr, ftpwrapper, splituser, splitpasswd, splitvalue) # support for FileHandler, proxies via environment variables from urllib import localhost, url2pathname, getproxies from urllib2 import HTTPError, URLError import _request import _rfc3986 import _sockettimeout from _clientcookie import CookieJar from _response import closeable_response # used in User-Agent header sent __version__ = sys.version[:3] _opener = None def urlopen(url, data=None, timeout=_sockettimeout._GLOBAL_DEFAULT_TIMEOUT): global _opener if _opener is None: _opener = build_opener() return _opener.open(url, data, timeout) def install_opener(opener): global _opener _opener = opener # copied from cookielib.py _cut_port_re = re.compile(r":\d+$") def request_host(request): """Return request-host, as defined by RFC 2965. Variation from RFC: returned value is lowercased, for convenient comparison. """ url = request.get_full_url() host = urlparse.urlparse(url)[1] if host == "": host = request.get_header("Host", "") # remove port, if present host = _cut_port_re.sub("", host, 1) return host.lower() class Request: def __init__(self, url, data=None, headers={}, origin_req_host=None, unverifiable=False): # unwrap('<URL:type://host/path>') --> 'type://host/path' self.__original = unwrap(url) self.type = None # self.__r_type is what's left after doing the splittype self.host = None self.port = None self._tunnel_host = None self.data = data self.headers = {} for key, value in headers.items(): self.add_header(key, value) self.unredirected_hdrs = {} if origin_req_host is None: origin_req_host = request_host(self) self.origin_req_host = origin_req_host self.unverifiable = unverifiable def __getattr__(self, attr): # XXX this is a fallback mechanism to guard against these # methods getting called in a non-standard order. this may be # too complicated and/or unnecessary. # XXX should the __r_XXX attributes be public? if attr[:12] == '_Request__r_': name = attr[12:] if hasattr(Request, 'get_' + name): getattr(self, 'get_' + name)() return getattr(self, attr) raise AttributeError, attr def get_method(self): if self.has_data(): return "POST" else: return "GET" # XXX these helper methods are lame def add_data(self, data): self.data = data def has_data(self): return self.data is not None def get_data(self): return self.data def get_full_url(self): return self.__original def get_type(self): if self.type is None: self.type, self.__r_type = splittype(self.__original) if self.type is None: raise ValueError, "unknown url type: %s" % self.__original return self.type def get_host(self): if self.host is None: self.host, self.__r_host = splithost(self.__r_type) if self.host: self.host = unquote(self.host) return self.host def get_selector(self): scheme, authority, path, query, fragment = _rfc3986.urlsplit( self.__r_host) if path == "": path = "/" # RFC 2616, section 3.2.2 fragment = None # RFC 3986, section 3.5 return _rfc3986.urlunsplit([scheme, authority, path, query, fragment]) def set_proxy(self, host, type): orig_host = self.get_host() if self.get_type() == 'https' and not self._tunnel_host: self._tunnel_host = orig_host else: self.type = type self.__r_host = self.__original self.host = host def has_proxy(self): """Private method.""" # has non-HTTPS proxy return self.__r_host == self.__original def get_origin_req_host(self): return self.origin_req_host def is_unverifiable(self): return self.unverifiable def add_header(self, key, val): # useful for something like authentication self.headers[key.capitalize()] = val def add_unredirected_header(self, key, val): # will not be added to a redirected request self.unredirected_hdrs[key.capitalize()] = val def has_header(self, header_name): return (header_name in self.headers or header_name in self.unredirected_hdrs) def get_header(self, header_name, default=None): return self.headers.get( header_name, self.unredirected_hdrs.get(header_name, default)) def header_items(self): hdrs = self.unredirected_hdrs.copy() hdrs.update(self.headers) return hdrs.items() class OpenerDirector: def __init__(self): client_version = "Python-urllib/%s" % __version__ self.addheaders = [('User-agent', client_version)] # manage the individual handlers self.handlers = [] self.handle_open = {} self.handle_error = {} self.process_response = {} self.process_request = {} def add_handler(self, handler): if not hasattr(handler, "add_parent"): raise TypeError("expected BaseHandler instance, got %r" % type(handler)) added = False for meth in dir(handler): if meth in ["redirect_request", "do_open", "proxy_open"]: # oops, coincidental match continue i = meth.find("_") protocol = meth[:i] condition = meth[i+1:] if condition.startswith("error"): j = condition.find("_") + i + 1 kind = meth[j+1:] try: kind = int(kind) except ValueError: pass lookup = self.handle_error.get(protocol, {}) self.handle_error[protocol] = lookup elif condition == "open": kind = protocol lookup = self.handle_open elif condition == "response": kind = protocol lookup = self.process_response elif condition == "request": kind = protocol lookup = self.process_request else: continue handlers = lookup.setdefault(kind, []) if handlers: bisect.insort(handlers, handler) else: handlers.append(handler) added = True if added: # the handlers must work in an specific order, the order # is specified in a Handler attribute bisect.insort(self.handlers, handler) handler.add_parent(self) def close(self): # Only exists for backwards compatibility. pass def _call_chain(self, chain, kind, meth_name, *args): # Handlers raise an exception if no one else should try to handle # the request, or return None if they can't but another handler # could. Otherwise, they return the response. handlers = chain.get(kind, ()) for handler in handlers: func = getattr(handler, meth_name) result = func(*args) if result is not None: return result def _open(self, req, data=None): result = self._call_chain(self.handle_open, 'default', 'default_open', req) if result: return result protocol = req.get_type() result = self._call_chain(self.handle_open, protocol, protocol + '_open', req) if result: return result return self._call_chain(self.handle_open, 'unknown', 'unknown_open', req) def error(self, proto, *args): if proto in ('http', 'https'): # XXX http[s] protocols are special-cased dict = self.handle_error['http'] # https is not different than http proto = args[2] # YUCK! meth_name = 'http_error_%s' % proto http_err = 1 orig_args = args else: dict = self.handle_error meth_name = proto + '_error' http_err = 0 args = (dict, proto, meth_name) + args result = self._call_chain(*args) if result: return result if http_err: args = (dict, 'default', 'http_error_default') + orig_args return self._call_chain(*args) # XXX probably also want an abstract factory that knows when it makes # sense to skip a superclass in favor of a subclass and when it might # make sense to include both def build_opener(*handlers): """Create an opener object from a list of handlers. The opener will use several default handlers, including support for HTTP, FTP and when applicable, HTTPS. If any of the handlers passed as arguments are subclasses of the default handlers, the default handlers will not be used. """ import types def isclass(obj): return isinstance(obj, (types.ClassType, type)) opener = OpenerDirector() default_classes = [ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor] if hasattr(httplib, 'HTTPS'): default_classes.append(HTTPSHandler) skip = set() for klass in default_classes: for check in handlers: if isclass(check): if issubclass(check, klass): skip.add(klass) elif isinstance(check, klass): skip.add(klass) for klass in skip: default_classes.remove(klass) for klass in default_classes: opener.add_handler(klass()) for h in handlers: if isclass(h): h = h() opener.add_handler(h) return opener class BaseHandler: handler_order = 500 def add_parent(self, parent): self.parent = parent def close(self): # Only exists for backwards compatibility pass def __lt__(self, other): if not hasattr(other, "handler_order"): # Try to preserve the old behavior of having custom classes # inserted after default ones (works only for custom user # classes which are not aware of handler_order). return True return self.handler_order < other.handler_order class HTTPErrorProcessor(BaseHandler): """Process HTTP error responses. The purpose of this handler is to to allow other response processors a look-in by removing the call to parent.error() from AbstractHTTPHandler. For non-2xx error codes, this just passes the job on to the Handler.<proto>_error_<code> methods, via the OpenerDirector.error method. Eventually, HTTPDefaultErrorHandler will raise an HTTPError if no other handler handles the error. """ handler_order = 1000 # after all other processors def http_response(self, request, response): code, msg, hdrs = response.code, response.msg, response.info() # According to RFC 2616, "2xx" code indicates that the client's # request was successfully received, understood, and accepted. if not (200 <= code < 300): # hardcoded http is NOT a bug response = self.parent.error( 'http', request, response, code, msg, hdrs) return response https_response = http_response class HTTPDefaultErrorHandler(BaseHandler): def http_error_default(self, req, fp, code, msg, hdrs): # why these error methods took the code, msg, headers args in the first # place rather than a response object, I don't know, but to avoid # multiple wrapping, we're discarding them if isinstance(fp, HTTPError): response = fp else: response = HTTPError( req.get_full_url(), code, msg, hdrs, fp) assert code == response.code assert msg == response.msg assert hdrs == response.hdrs raise response class HTTPRedirectHandler(BaseHandler): # maximum number of redirections to any single URL # this is needed because of the state that cookies introduce max_repeats = 4 # maximum total number of redirections (regardless of URL) before # assuming we're in a loop max_redirections = 10 # Implementation notes: # To avoid the server sending us into an infinite loop, the request # object needs to track what URLs we have already seen. Do this by # adding a handler-specific attribute to the Request object. The value # of the dict is used to count the number of times the same URL has # been visited. This is needed because visiting the same URL twice # does not necessarily imply a loop, thanks to state introduced by # cookies. # Always unhandled redirection codes: # 300 Multiple Choices: should not handle this here. # 304 Not Modified: no need to handle here: only of interest to caches # that do conditional GETs # 305 Use Proxy: probably not worth dealing with here # 306 Unused: what was this for in the previous versions of protocol?? def redirect_request(self, req, fp, code, msg, headers, newurl): """Return a Request or None in response to a redirect. This is called by the http_error_30x methods when a redirection response is received. If a redirection should take place, return a new Request to allow http_error_30x to perform the redirect. Otherwise, raise HTTPError if no-one else should try to handle this url. Return None if you can't but another Handler might. """ m = req.get_method() if (code in (301, 302, 303, 307, "refresh") and m in ("GET", "HEAD") or code in (301, 302, 303, "refresh") and m == "POST"): # Strictly (according to RFC 2616), 301 or 302 in response # to a POST MUST NOT cause a redirection without confirmation # from the user (of urllib2, in this case). In practice, # essentially all clients do redirect in this case, so we do # the same. # TODO: really refresh redirections should be visiting; tricky to fix new = _request.Request( newurl, headers=req.headers, origin_req_host=req.get_origin_req_host(), unverifiable=True, visit=False, timeout=req.timeout) new._origin_req = getattr(req, "_origin_req", req) return new else: raise HTTPError(req.get_full_url(), code, msg, headers, fp) def http_error_302(self, req, fp, code, msg, headers): # Some servers (incorrectly) return multiple Location headers # (so probably same goes for URI). Use first header. if 'location' in headers: newurl = headers.getheaders('location')[0] elif 'uri' in headers: newurl = headers.getheaders('uri')[0] else: return newurl = _rfc3986.clean_url(newurl, "latin-1") newurl = _rfc3986.urljoin(req.get_full_url(), newurl) # XXX Probably want to forget about the state of the current # request, although that might interact poorly with other # handlers that also use handler-specific request attributes new = self.redirect_request(req, fp, code, msg, headers, newurl) if new is None: return # loop detection # .redirect_dict has a key url if url was previously visited. if hasattr(req, 'redirect_dict'): visited = new.redirect_dict = req.redirect_dict if (visited.get(newurl, 0) >= self.max_repeats or len(visited) >= self.max_redirections): raise HTTPError(req.get_full_url(), code, self.inf_msg + msg, headers, fp) else: visited = new.redirect_dict = req.redirect_dict = {} visited[newurl] = visited.get(newurl, 0) + 1 # Don't close the fp until we are sure that we won't use it # with HTTPError. fp.read() fp.close() return self.parent.open(new) http_error_301 = http_error_303 = http_error_307 = http_error_302 http_error_refresh = http_error_302 inf_msg = "The HTTP server returned a redirect error that would " \ "lead to an infinite loop.\n" \ "The last 30x error message was:\n" def _parse_proxy(proxy): """Return (scheme, user, password, host/port) given a URL or an authority. If a URL is supplied, it must have an authority (host:port) component. According to RFC 3986, having an authority component means the URL must have two slashes after the scheme: >>> _parse_proxy('file:/ftp.example.com/') Traceback (most recent call last): ValueError: proxy URL with no authority: 'file:/ftp.example.com/' The first three items of the returned tuple may be None. Examples of authority parsing: >>> _parse_proxy('proxy.example.com') (None, None, None, 'proxy.example.com') >>> _parse_proxy('proxy.example.com:3128') (None, None, None, 'proxy.example.com:3128') The authority component may optionally include userinfo (assumed to be username:password): >>> _parse_proxy('joe:password@proxy.example.com') (None, 'joe', 'password', 'proxy.example.com') >>> _parse_proxy('joe:password@proxy.example.com:3128') (None, 'joe', 'password', 'proxy.example.com:3128') Same examples, but with URLs instead: >>> _parse_proxy('http://proxy.example.com/') ('http', None, None, 'proxy.example.com') >>> _parse_proxy('http://proxy.example.com:3128/') ('http', None, None, 'proxy.example.com:3128') >>> _parse_proxy('http://joe:password@proxy.example.com/') ('http', 'joe', 'password', 'proxy.example.com') >>> _parse_proxy('http://joe:password@proxy.example.com:3128') ('http', 'joe', 'password', 'proxy.example.com:3128') Everything after the authority is ignored: >>> _parse_proxy('ftp://joe:password@proxy.example.com/rubbish:3128') ('ftp', 'joe', 'password', 'proxy.example.com') Test for no trailing '/' case: >>> _parse_proxy('http://joe:password@proxy.example.com') ('http', 'joe', 'password', 'proxy.example.com') """ scheme, r_scheme = splittype(proxy) if not r_scheme.startswith("/"): # authority scheme = None authority = proxy else: # URL if not r_scheme.startswith("//"): raise ValueError("proxy URL with no authority: %r" % proxy) # We have an authority, so for RFC 3986-compliant URLs (by ss 3. # and 3.3.), path is empty or starts with '/' end = r_scheme.find("/", 2) if end == -1: end = None authority = r_scheme[2:end] userinfo, hostport = splituser(authority) if userinfo is not None: user, password = splitpasswd(userinfo) else: user = password = None return scheme, user, password, hostport class ProxyHandler(BaseHandler): # Proxies must be in front handler_order = 100 def __init__(self, proxies=None, proxy_bypass=None): if proxies is None: proxies = getproxies() assert hasattr(proxies, 'has_key'), "proxies must be a mapping" self.proxies = proxies for type, url in proxies.items(): setattr(self, '%s_open' % type, lambda r, proxy=url, type=type, meth=self.proxy_open: \ meth(r, proxy, type)) if proxy_bypass is None: proxy_bypass = urllib.proxy_bypass self._proxy_bypass = proxy_bypass def proxy_open(self, req, proxy, type): orig_type = req.get_type() proxy_type, user, password, hostport = _parse_proxy(proxy) if proxy_type is None: proxy_type = orig_type if req.get_host() and self._proxy_bypass(req.get_host()): return None if user and password: user_pass = '%s:%s' % (unquote(user), unquote(password)) creds = base64.b64encode(user_pass).strip() req.add_header('Proxy-authorization', 'Basic ' + creds) hostport = unquote(hostport) req.set_proxy(hostport, proxy_type) if orig_type == proxy_type or orig_type == 'https': # let other handlers take care of it return None else: # need to start over, because the other handlers don't # grok the proxy's URL type # e.g. if we have a constructor arg proxies like so: # {'http': 'ftp://proxy.example.com'}, we may end up turning # a request for http://acme.example.com/a into one for # ftp://proxy.example.com/a return self.parent.open(req) class HTTPPasswordMgr: def __init__(self): self.passwd = {} def add_password(self, realm, uri, user, passwd): # uri could be a single URI or a sequence if isinstance(uri, basestring): uri = [uri] if not realm in self.passwd: self.passwd[realm] = {} for default_port in True, False: reduced_uri = tuple( [self.reduce_uri(u, default_port) for u in uri]) self.passwd[realm][reduced_uri] = (user, passwd) def find_user_password(self, realm, authuri): domains = self.passwd.get(realm, {}) for default_port in True, False: reduced_authuri = self.reduce_uri(authuri, default_port) for uris, authinfo in domains.iteritems(): for uri in uris: if self.is_suburi(uri, reduced_authuri): return authinfo return None, None def reduce_uri(self, uri, default_port=True): """Accept authority or URI and extract only the authority and path.""" # note HTTP URLs do not have a userinfo component parts = urlparse.urlsplit(uri) if parts[1]: # URI scheme = parts[0] authority = parts[1] path = parts[2] or '/' else: # host or host:port scheme = None authority = uri path = '/' host, port = splitport(authority) if default_port and port is None and scheme is not None: dport = {"http": 80, "https": 443, }.get(scheme) if dport is not None: authority = "%s:%d" % (host, dport) return authority, path def is_suburi(self, base, test): """Check if test is below base in a URI tree Both args must be URIs in reduced form. """ if base == test: return True if base[0] != test[0]: return False common = posixpath.commonprefix((base[1], test[1])) if len(common) == len(base[1]): return True return False class HTTPPasswordMgrWithDefaultRealm(HTTPPasswordMgr): def find_user_password(self, realm, authuri): user, password = HTTPPasswordMgr.find_user_password(self, realm, authuri) if user is not None: return user, password return HTTPPasswordMgr.find_user_password(self, None, authuri) class AbstractBasicAuthHandler: # XXX this allows for multiple auth-schemes, but will stupidly pick # the last one with a realm specified. # allow for double- and single-quoted realm values # (single quotes are a violation of the RFC, but appear in the wild) rx = re.compile('(?:.*,)*[ \t]*([^ \t]+)[ \t]+' 'realm=(["\'])(.*?)\\2', re.I) # XXX could pre-emptively send auth info already accepted (RFC 2617, # end of section 2, and section 1.2 immediately after "credentials" # production). def __init__(self, password_mgr=None): if password_mgr is None: password_mgr = HTTPPasswordMgr() self.passwd = password_mgr self.add_password = self.passwd.add_password def http_error_auth_reqed(self, authreq, host, req, headers): # host may be an authority (without userinfo) or a URL with an # authority # XXX could be multiple headers authreq = headers.get(authreq, None) if authreq: mo = AbstractBasicAuthHandler.rx.search(authreq) if mo: scheme, quote, realm = mo.groups() if scheme.lower() == 'basic': return self.retry_http_basic_auth(host, req, realm) def retry_http_basic_auth(self, host, req, realm): user, pw = self.passwd.find_user_password(realm, host) if pw is not None: raw = "%s:%s" % (user, pw) auth = 'Basic %s' % base64.b64encode(raw).strip() if req.headers.get(self.auth_header, None) == auth: return None newreq = copy.copy(req) newreq.add_header(self.auth_header, auth) newreq.visit = False return self.parent.open(newreq) else: return None class HTTPBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler): auth_header = 'Authorization' def http_error_401(self, req, fp, code, msg, headers): url = req.get_full_url() return self.http_error_auth_reqed('www-authenticate', url, req, headers) class ProxyBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler): auth_header = 'Proxy-authorization' def http_error_407(self, req, fp, code, msg, headers): # http_error_auth_reqed requires that there is no userinfo component in # authority. Assume there isn't one, since urllib2 does not (and # should not, RFC 3986 s. 3.2.1) support requests for URLs containing # userinfo. authority = req.get_host() return self.http_error_auth_reqed('proxy-authenticate', authority, req, headers) def randombytes(n): """Return n random bytes.""" # Use /dev/urandom if it is available. Fall back to random module # if not. It might be worthwhile to extend this function to use # other platform-specific mechanisms for getting random bytes. if os.path.exists("/dev/urandom"): f = open("/dev/urandom") s = f.read(n) f.close() return s else: L = [chr(random.randrange(0, 256)) for i in range(n)] return "".join(L) class AbstractDigestAuthHandler: # Digest authentication is specified in RFC 2617. # XXX The client does not inspect the Authentication-Info header # in a successful response. # XXX It should be possible to test this implementation against # a mock server that just generates a static set of challenges. # XXX qop="auth-int" supports is shaky def __init__(self, passwd=None): if passwd is None: passwd = HTTPPasswordMgr() self.passwd = passwd self.add_password = self.passwd.add_password self.retried = 0 self.nonce_count = 0 self.last_nonce = None def reset_retry_count(self): self.retried = 0 def http_error_auth_reqed(self, auth_header, host, req, headers): authreq = headers.get(auth_header, None) if self.retried > 5: # Don't fail endlessly - if we failed once, we'll probably # fail a second time. Hm. Unless the Password Manager is # prompting for the information. Crap. This isn't great # but it's better than the current 'repeat until recursion # depth exceeded' approach <wink> raise HTTPError(req.get_full_url(), 401, "digest auth failed", headers, None) else: self.retried += 1 if authreq: scheme = authreq.split()[0] if scheme.lower() == 'digest': return self.retry_http_digest_auth(req, authreq) def retry_http_digest_auth(self, req, auth): token, challenge = auth.split(' ', 1) chal = parse_keqv_list(parse_http_list(challenge)) auth = self.get_authorization(req, chal) if auth: auth_val = 'Digest %s' % auth if req.headers.get(self.auth_header, None) == auth_val: return None newreq = copy.copy(req) newreq.add_unredirected_header(self.auth_header, auth_val) newreq.visit = False return self.parent.open(newreq) def get_cnonce(self, nonce): # The cnonce-value is an opaque # quoted string value provided by the client and used by both client # and server to avoid chosen plaintext attacks, to provide mutual # authentication, and to provide some message integrity protection. # This isn't a fabulous effort, but it's probably Good Enough. dig = sha1_digest("%s:%s:%s:%s" % (self.nonce_count, nonce, time.ctime(), randombytes(8))) return dig[:16] def get_authorization(self, req, chal): try: realm = chal['realm'] nonce = chal['nonce'] qop = chal.get('qop') algorithm = chal.get('algorithm', 'MD5') # mod_digest doesn't send an opaque, even though it isn't # supposed to be optional opaque = chal.get('opaque', None) except KeyError: return None H, KD = self.get_algorithm_impls(algorithm) if H is None: return None user, pw = self.passwd.find_user_password(realm, req.get_full_url()) if user is None: return None # XXX not implemented yet if req.has_data(): entdig = self.get_entity_digest(req.get_data(), chal) else: entdig = None A1 = "%s:%s:%s" % (user, realm, pw) A2 = "%s:%s" % (req.get_method(), # XXX selector: what about proxies and full urls req.get_selector()) if qop == 'auth': if nonce == self.last_nonce: self.nonce_count += 1 else: self.nonce_count = 1 self.last_nonce = nonce ncvalue = '%08x' % self.nonce_count cnonce = self.get_cnonce(nonce) noncebit = "%s:%s:%s:%s:%s" % (nonce, ncvalue, cnonce, qop, H(A2)) respdig = KD(H(A1), noncebit) elif qop is None: respdig = KD(H(A1), "%s:%s" % (nonce, H(A2))) else: # XXX handle auth-int. logger = logging.getLogger("mechanize.auth") logger.info("digest auth auth-int qop is not supported, not " "handling digest authentication") return None # XXX should the partial digests be encoded too? base = 'username="%s", realm="%s", nonce="%s", uri="%s", ' \ 'response="%s"' % (user, realm, nonce, req.get_selector(), respdig) if opaque: base += ', opaque="%s"' % opaque if entdig: base += ', digest="%s"' % entdig base += ', algorithm="%s"' % algorithm if qop: base += ', qop=auth, nc=%s, cnonce="%s"' % (ncvalue, cnonce) return base def get_algorithm_impls(self, algorithm): # algorithm should be case-insensitive according to RFC2617 algorithm = algorithm.upper() if algorithm == 'MD5': H = md5_digest elif algorithm == 'SHA': H = sha1_digest # XXX MD5-sess KD = lambda s, d: H("%s:%s" % (s, d)) return H, KD def get_entity_digest(self, data, chal): # XXX not implemented yet return None class HTTPDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler): """An authentication protocol defined by RFC 2069 Digest authentication improves on basic authentication because it does not transmit passwords in the clear. """ auth_header = 'Authorization' handler_order = 490 # before Basic auth def http_error_401(self, req, fp, code, msg, headers): host = urlparse.urlparse(req.get_full_url())[1] retry = self.http_error_auth_reqed('www-authenticate', host, req, headers) self.reset_retry_count() return retry class ProxyDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler): auth_header = 'Proxy-Authorization' handler_order = 490 # before Basic auth def http_error_407(self, req, fp, code, msg, headers): host = req.get_host() retry = self.http_error_auth_reqed('proxy-authenticate', host, req, headers) self.reset_retry_count() return retry class AbstractHTTPHandler(BaseHandler): def __init__(self, debuglevel=0): self._debuglevel = debuglevel def set_http_debuglevel(self, level): self._debuglevel = level def do_request_(self, request): host = request.get_host() if not host: raise URLError('no host given') if request.has_data(): # POST data = request.get_data() if not request.has_header('Content-type'): request.add_unredirected_header( 'Content-type', 'application/x-www-form-urlencoded') if not request.has_header('Content-length'): request.add_unredirected_header( 'Content-length', '%d' % len(data)) sel_host = host if request.has_proxy(): scheme, sel = splittype(request.get_selector()) sel_host, sel_path = splithost(sel) if not request.has_header('Host'): request.add_unredirected_header('Host', sel_host) for name, value in self.parent.addheaders: name = name.capitalize() if not request.has_header(name): request.add_unredirected_header(name, value) return request def do_open(self, http_class, req): """Return an addinfourl object for the request, using http_class. http_class must implement the HTTPConnection API from httplib. The addinfourl return value is a file-like object. It also has methods and attributes including: - info(): return a mimetools.Message object for the headers - geturl(): return the original request URL - code: HTTP status code """ host_port = req.get_host() if not host_port: raise URLError('no host given') try: h = http_class(host_port, timeout=req.timeout) except TypeError: # Python < 2.6, no per-connection timeout support h = http_class(host_port) h.set_debuglevel(self._debuglevel) headers = dict(req.headers) headers.update(req.unredirected_hdrs) # We want to make an HTTP/1.1 request, but the addinfourl # class isn't prepared to deal with a persistent connection. # It will try to read all remaining data from the socket, # which will block while the server waits for the next request. # So make sure the connection gets closed after the (only) # request. headers["Connection"] = "close" headers = dict( (name.title(), val) for name, val in headers.items()) if req._tunnel_host: if not hasattr(h, "set_tunnel"): if not hasattr(h, "_set_tunnel"): raise URLError("HTTPS through proxy not supported " "(Python >= 2.6.4 required)") else: # python 2.6 set_tunnel = h._set_tunnel else: set_tunnel = h.set_tunnel set_tunnel(req._tunnel_host) try: h.request(req.get_method(), req.get_selector(), req.data, headers) r = h.getresponse() except socket.error, err: # XXX what error? raise URLError(err) # Pick apart the HTTPResponse object to get the addinfourl # object initialized properly. # Wrap the HTTPResponse object in socket's file object adapter # for Windows. That adapter calls recv(), so delegate recv() # to read(). This weird wrapping allows the returned object to # have readline() and readlines() methods. # XXX It might be better to extract the read buffering code # out of socket._fileobject() and into a base class. r.recv = r.read fp = create_readline_wrapper(r) resp = closeable_response(fp, r.msg, req.get_full_url(), r.status, r.reason) return resp class HTTPHandler(AbstractHTTPHandler): def http_open(self, req): return self.do_open(httplib.HTTPConnection, req) http_request = AbstractHTTPHandler.do_request_ if hasattr(httplib, 'HTTPS'): class HTTPSConnectionFactory: def __init__(self, key_file, cert_file): self._key_file = key_file self._cert_file = cert_file def __call__(self, hostport): return httplib.HTTPSConnection( hostport, key_file=self._key_file, cert_file=self._cert_file) class HTTPSHandler(AbstractHTTPHandler): def __init__(self, client_cert_manager=None): AbstractHTTPHandler.__init__(self) self.client_cert_manager = client_cert_manager def https_open(self, req): if self.client_cert_manager is not None: key_file, cert_file = self.client_cert_manager.find_key_cert( req.get_full_url()) conn_factory = HTTPSConnectionFactory(key_file, cert_file) else: conn_factory = httplib.HTTPSConnection return self.do_open(conn_factory, req) https_request = AbstractHTTPHandler.do_request_ class HTTPCookieProcessor(BaseHandler): """Handle HTTP cookies. Public attributes: cookiejar: CookieJar instance """ def __init__(self, cookiejar=None): if cookiejar is None: cookiejar = CookieJar() self.cookiejar = cookiejar def http_request(self, request): self.cookiejar.add_cookie_header(request) return request def http_response(self, request, response): self.cookiejar.extract_cookies(response, request) return response https_request = http_request https_response = http_response class UnknownHandler(BaseHandler): def unknown_open(self, req): type = req.get_type() raise URLError('unknown url type: %s' % type) def parse_keqv_list(l): """Parse list of key=value strings where keys are not duplicated.""" parsed = {} for elt in l: k, v = elt.split('=', 1) if v[0] == '"' and v[-1] == '"': v = v[1:-1] parsed[k] = v return parsed def parse_http_list(s): """Parse lists as described by RFC 2068 Section 2. In particular, parse comma-separated lists where the elements of the list may include quoted-strings. A quoted-string could contain a comma. A non-quoted string could have quotes in the middle. Neither commas nor quotes count if they are escaped. Only double-quotes count, not single-quotes. """ res = [] part = '' escape = quote = False for cur in s: if escape: part += cur escape = False continue if quote: if cur == '\\': escape = True continue elif cur == '"': quote = False part += cur continue if cur == ',': res.append(part) part = '' continue if cur == '"': quote = True part += cur # append last part if part: res.append(part) return [part.strip() for part in res] class FileHandler(BaseHandler): # Use local file or FTP depending on form of URL def file_open(self, req): url = req.get_selector() if url[:2] == '//' and url[2:3] != '/': req.type = 'ftp' return self.parent.open(req) else: return self.open_local_file(req) # names for the localhost names = None def get_names(self): if FileHandler.names is None: try: FileHandler.names = (socket.gethostbyname('localhost'), socket.gethostbyname(socket.gethostname())) except socket.gaierror: FileHandler.names = (socket.gethostbyname('localhost'),) return FileHandler.names # not entirely sure what the rules are here def open_local_file(self, req): try: import email.utils as emailutils except ImportError: # python 2.4 import email.Utils as emailutils import mimetypes host = req.get_host() file = req.get_selector() localfile = url2pathname(file) try: stats = os.stat(localfile) size = stats.st_size modified = emailutils.formatdate(stats.st_mtime, usegmt=True) mtype = mimetypes.guess_type(file)[0] headers = mimetools.Message(StringIO( 'Content-type: %s\nContent-length: %d\nLast-modified: %s\n' % (mtype or 'text/plain', size, modified))) if host: host, port = splitport(host) if not host or \ (not port and socket.gethostbyname(host) in self.get_names()): return addinfourl(open(localfile, 'rb'), headers, 'file:'+file) except OSError, msg: # urllib2 users shouldn't expect OSErrors coming from urlopen() raise URLError(msg) raise URLError('file not on local host') class FTPHandler(BaseHandler): def ftp_open(self, req): import ftplib import mimetypes host = req.get_host() if not host: raise URLError('ftp error: no host given') host, port = splitport(host) if port is None: port = ftplib.FTP_PORT else: port = int(port) # username/password handling user, host = splituser(host) if user: user, passwd = splitpasswd(user) else: passwd = None host = unquote(host) user = unquote(user or '') passwd = unquote(passwd or '') try: host = socket.gethostbyname(host) except socket.error, msg: raise URLError(msg) path, attrs = splitattr(req.get_selector()) dirs = path.split('/') dirs = map(unquote, dirs) dirs, file = dirs[:-1], dirs[-1] if dirs and not dirs[0]: dirs = dirs[1:] try: fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) type = file and 'I' or 'D' for attr in attrs: attr, value = splitvalue(attr) if attr.lower() == 'type' and \ value in ('a', 'A', 'i', 'I', 'd', 'D'): type = value.upper() fp, retrlen = fw.retrfile(file, type) headers = "" mtype = mimetypes.guess_type(req.get_full_url())[0] if mtype: headers += "Content-type: %s\n" % mtype if retrlen is not None and retrlen >= 0: headers += "Content-length: %d\n" % retrlen sf = StringIO(headers) headers = mimetools.Message(sf) return addinfourl(fp, headers, req.get_full_url()) except ftplib.all_errors, msg: raise URLError, ('ftp error: %s' % msg), sys.exc_info()[2] def connect_ftp(self, user, passwd, host, port, dirs, timeout): try: fw = ftpwrapper(user, passwd, host, port, dirs, timeout) except TypeError: # Python < 2.6, no per-connection timeout support fw = ftpwrapper(user, passwd, host, port, dirs) ## fw.ftp.set_debuglevel(1) return fw class CacheFTPHandler(FTPHandler): # XXX would be nice to have pluggable cache strategies # XXX this stuff is definitely not thread safe def __init__(self): self.cache = {} self.timeout = {} self.soonest = 0 self.delay = 60 self.max_conns = 16 def setTimeout(self, t): self.delay = t def setMaxConns(self, m): self.max_conns = m def connect_ftp(self, user, passwd, host, port, dirs, timeout): key = user, host, port, '/'.join(dirs), timeout if key in self.cache: self.timeout[key] = time.time() + self.delay else: self.cache[key] = ftpwrapper(user, passwd, host, port, dirs, timeout) self.timeout[key] = time.time() + self.delay self.check_cache() return self.cache[key] def check_cache(self): # first check for old ones t = time.time() if self.soonest <= t: for k, v in self.timeout.items(): if v < t: self.cache[k].close() del self.cache[k] del self.timeout[k] self.soonest = min(self.timeout.values()) # then check the size if len(self.cache) == self.max_conns: for k, v in self.timeout.items(): if v == self.soonest: del self.cache[k] del self.timeout[k] break self.soonest = min(self.timeout.values()) �������������������������������������������������������������������������������������mechanize-0.2.5/mechanize/_mozillacookiejar.py������������������������������������������������������0000644�0001750�0001750�00000014261�11545150644�020162� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""Mozilla / Netscape cookie loading / saving. Copyright 2002-2006 John J Lee <jjl@pobox.com> Copyright 1997-1999 Gisle Aas (original libwww-perl code) This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ import re, time, logging from _clientcookie import reraise_unmasked_exceptions, FileCookieJar, Cookie, \ MISSING_FILENAME_TEXT, LoadError debug = logging.getLogger("ClientCookie").debug class MozillaCookieJar(FileCookieJar): """ WARNING: you may want to backup your browser's cookies file if you use this class to save cookies. I *think* it works, but there have been bugs in the past! This class differs from CookieJar only in the format it uses to save and load cookies to and from a file. This class uses the Mozilla/Netscape `cookies.txt' format. lynx uses this file format, too. Don't expect cookies saved while the browser is running to be noticed by the browser (in fact, Mozilla on unix will overwrite your saved cookies if you change them on disk while it's running; on Windows, you probably can't save at all while the browser is running). Note that the Mozilla/Netscape format will downgrade RFC2965 cookies to Netscape cookies on saving. In particular, the cookie version and port number information is lost, together with information about whether or not Path, Port and Discard were specified by the Set-Cookie2 (or Set-Cookie) header, and whether or not the domain as set in the HTTP header started with a dot (yes, I'm aware some domains in Netscape files start with a dot and some don't -- trust me, you really don't want to know any more about this). Note that though Mozilla and Netscape use the same format, they use slightly different headers. The class saves cookies using the Netscape header by default (Mozilla can cope with that). """ magic_re = "#( Netscape)? HTTP Cookie File" header = """\ # Netscape HTTP Cookie File # http://www.netscape.com/newsref/std/cookie_spec.html # This is a generated file! Do not edit. """ def _really_load(self, f, filename, ignore_discard, ignore_expires): now = time.time() magic = f.readline() if not re.search(self.magic_re, magic): f.close() raise LoadError( "%s does not look like a Netscape format cookies file" % filename) try: while 1: line = f.readline() if line == "": break # last field may be absent, so keep any trailing tab if line.endswith("\n"): line = line[:-1] # skip comments and blank lines XXX what is $ for? if (line.strip().startswith("#") or line.strip().startswith("$") or line.strip() == ""): continue domain, domain_specified, path, secure, expires, name, value = \ line.split("\t", 6) secure = (secure == "TRUE") domain_specified = (domain_specified == "TRUE") if name == "": name = value value = None initial_dot = domain.startswith(".") if domain_specified != initial_dot: raise LoadError("domain and domain specified flag don't " "match in %s: %s" % (filename, line)) discard = False if expires == "": expires = None discard = True # assume path_specified is false c = Cookie(0, name, value, None, False, domain, domain_specified, initial_dot, path, False, secure, expires, discard, None, None, {}) if not ignore_discard and c.discard: continue if not ignore_expires and c.is_expired(now): continue self.set_cookie(c) except: reraise_unmasked_exceptions((IOError, LoadError)) raise LoadError("invalid Netscape format file %s: %s" % (filename, line)) def save(self, filename=None, ignore_discard=False, ignore_expires=False): if filename is None: if self.filename is not None: filename = self.filename else: raise ValueError(MISSING_FILENAME_TEXT) f = open(filename, "w") try: debug("Saving Netscape cookies.txt file") f.write(self.header) now = time.time() for cookie in self: if not ignore_discard and cookie.discard: debug(" Not saving %s: marked for discard", cookie.name) continue if not ignore_expires and cookie.is_expired(now): debug(" Not saving %s: expired", cookie.name) continue if cookie.secure: secure = "TRUE" else: secure = "FALSE" if cookie.domain.startswith("."): initial_dot = "TRUE" else: initial_dot = "FALSE" if cookie.expires is not None: expires = str(cookie.expires) else: expires = "" if cookie.value is None: # cookies.txt regards 'Set-Cookie: foo' as a cookie # with no name, whereas cookielib regards it as a # cookie with no value. name = "" value = cookie.name else: name = cookie.name value = cookie.value f.write( "\t".join([cookie.domain, initial_dot, cookie.path, secure, expires, name, value])+ "\n") finally: f.close() �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.2.5/mechanize/_form.py������������������������������������������������������������������0000644�0001750�0001750�00000354037�11545150644�015577� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""HTML form handling for web clients. HTML form handling for web clients: useful for parsing HTML forms, filling them in and returning the completed forms to the server. This code developed from a port of Gisle Aas' Perl module HTML::Form, from the libwww-perl library, but the interface is not the same. The most useful docstring is the one for HTMLForm. RFC 1866: HTML 2.0 RFC 1867: Form-based File Upload in HTML RFC 2388: Returning Values from Forms: multipart/form-data HTML 3.2 Specification, W3C Recommendation 14 January 1997 (for ISINDEX) HTML 4.01 Specification, W3C Recommendation 24 December 1999 Copyright 2002-2007 John J. Lee <jjl@pobox.com> Copyright 2005 Gary Poster Copyright 2005 Zope Corporation Copyright 1998-2000 Gisle Aas. This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ # TODO: # Clean up post the merge into mechanize # * Remove code that was duplicated in ClientForm and mechanize # * Remove weird import stuff # * Remove pre-Python 2.4 compatibility cruft # * Clean up tests # * Later release: Remove the ClientForm 0.1 backwards-compatibility switch # Remove parser testing hack # Clean action URI # Switch to unicode throughout # See Wichert Akkerman's 2004-01-22 message to c.l.py. # Apply recommendations from google code project CURLIES # Apply recommendations from HTML 5 spec # Add charset parameter to Content-type headers? How to find value?? # Functional tests to add: # Single and multiple file upload # File upload with missing name (check standards) # mailto: submission & enctype text/plain?? # Replace by_label etc. with moniker / selector concept. Allows, e.g., a # choice between selection by value / id / label / element contents. Or # choice between matching labels exactly or by substring. etc. __all__ = ['AmbiguityError', 'CheckboxControl', 'Control', 'ControlNotFoundError', 'FileControl', 'FormParser', 'HTMLForm', 'HiddenControl', 'IgnoreControl', 'ImageControl', 'IsindexControl', 'Item', 'ItemCountError', 'ItemNotFoundError', 'Label', 'ListControl', 'LocateError', 'Missing', 'ParseError', 'ParseFile', 'ParseFileEx', 'ParseResponse', 'ParseResponseEx','PasswordControl', 'RadioControl', 'ScalarControl', 'SelectControl', 'SubmitButtonControl', 'SubmitControl', 'TextControl', 'TextareaControl', 'XHTMLCompatibleFormParser'] import HTMLParser from cStringIO import StringIO import inspect import logging import random import re import sys import urllib import urlparse import warnings import _beautifulsoup import _request # from Python itself, for backwards compatibility of raised exceptions import sgmllib # bundled copy of sgmllib import _sgmllib_copy VERSION = "0.2.11" CHUNK = 1024 # size of chunks fed to parser, in bytes DEFAULT_ENCODING = "latin-1" _logger = logging.getLogger("mechanize.forms") OPTIMIZATION_HACK = True def debug(msg, *args, **kwds): if OPTIMIZATION_HACK: return caller_name = inspect.stack()[1][3] extended_msg = '%%s %s' % msg extended_args = (caller_name,)+args _logger.debug(extended_msg, *extended_args, **kwds) def _show_debug_messages(): global OPTIMIZATION_HACK OPTIMIZATION_HACK = False _logger.setLevel(logging.DEBUG) handler = logging.StreamHandler(sys.stdout) handler.setLevel(logging.DEBUG) _logger.addHandler(handler) def deprecation(message, stack_offset=0): warnings.warn(message, DeprecationWarning, stacklevel=3+stack_offset) class Missing: pass _compress_re = re.compile(r"\s+") def compress_text(text): return _compress_re.sub(" ", text.strip()) def normalize_line_endings(text): return re.sub(r"(?:(?<!\r)\n)|(?:\r(?!\n))", "\r\n", text) def unescape(data, entities, encoding=DEFAULT_ENCODING): if data is None or "&" not in data: return data def replace_entities(match, entities=entities, encoding=encoding): ent = match.group() if ent[1] == "#": return unescape_charref(ent[2:-1], encoding) repl = entities.get(ent) if repl is not None: if type(repl) != type(""): try: repl = repl.encode(encoding) except UnicodeError: repl = ent else: repl = ent return repl return re.sub(r"&#?[A-Za-z0-9]+?;", replace_entities, data) def unescape_charref(data, encoding): name, base = data, 10 if name.startswith("x"): name, base= name[1:], 16 uc = unichr(int(name, base)) if encoding is None: return uc else: try: repl = uc.encode(encoding) except UnicodeError: repl = "&#%s;" % data return repl def get_entitydefs(): import htmlentitydefs from codecs import latin_1_decode entitydefs = {} try: htmlentitydefs.name2codepoint except AttributeError: entitydefs = {} for name, char in htmlentitydefs.entitydefs.items(): uc = latin_1_decode(char)[0] if uc.startswith("&#") and uc.endswith(";"): uc = unescape_charref(uc[2:-1], None) entitydefs["&%s;" % name] = uc else: for name, codepoint in htmlentitydefs.name2codepoint.items(): entitydefs["&%s;" % name] = unichr(codepoint) return entitydefs def issequence(x): try: x[0] except (TypeError, KeyError): return False except IndexError: pass return True def isstringlike(x): try: x+"" except: return False else: return True def choose_boundary(): """Return a string usable as a multipart boundary.""" # follow IE and firefox nonce = "".join([str(random.randint(0, sys.maxint-1)) for i in 0,1,2]) return "-"*27 + nonce # This cut-n-pasted MimeWriter from standard library is here so can add # to HTTP headers rather than message body when appropriate. It also uses # \r\n in place of \n. This is a bit nasty. class MimeWriter: """Generic MIME writer. Methods: __init__() addheader() flushheaders() startbody() startmultipartbody() nextpart() lastpart() A MIME writer is much more primitive than a MIME parser. It doesn't seek around on the output file, and it doesn't use large amounts of buffer space, so you have to write the parts in the order they should occur on the output file. It does buffer the headers you add, allowing you to rearrange their order. General usage is: f = <open the output file> w = MimeWriter(f) ...call w.addheader(key, value) 0 or more times... followed by either: f = w.startbody(content_type) ...call f.write(data) for body data... or: w.startmultipartbody(subtype) for each part: subwriter = w.nextpart() ...use the subwriter's methods to create the subpart... w.lastpart() The subwriter is another MimeWriter instance, and should be treated in the same way as the toplevel MimeWriter. This way, writing recursive body parts is easy. Warning: don't forget to call lastpart()! XXX There should be more state so calls made in the wrong order are detected. Some special cases: - startbody() just returns the file passed to the constructor; but don't use this knowledge, as it may be changed. - startmultipartbody() actually returns a file as well; this can be used to write the initial 'if you can read this your mailer is not MIME-aware' message. - If you call flushheaders(), the headers accumulated so far are written out (and forgotten); this is useful if you don't need a body part at all, e.g. for a subpart of type message/rfc822 that's (mis)used to store some header-like information. - Passing a keyword argument 'prefix=<flag>' to addheader(), start*body() affects where the header is inserted; 0 means append at the end, 1 means insert at the start; default is append for addheader(), but insert for start*body(), which use it to determine where the Content-type header goes. """ def __init__(self, fp, http_hdrs=None): self._http_hdrs = http_hdrs self._fp = fp self._headers = [] self._boundary = [] self._first_part = True def addheader(self, key, value, prefix=0, add_to_http_hdrs=0): """ prefix is ignored if add_to_http_hdrs is true. """ lines = value.split("\r\n") while lines and not lines[-1]: del lines[-1] while lines and not lines[0]: del lines[0] if add_to_http_hdrs: value = "".join(lines) # 2.2 urllib2 doesn't normalize header case self._http_hdrs.append((key.capitalize(), value)) else: for i in range(1, len(lines)): lines[i] = " " + lines[i].strip() value = "\r\n".join(lines) + "\r\n" line = key.title() + ": " + value if prefix: self._headers.insert(0, line) else: self._headers.append(line) def flushheaders(self): self._fp.writelines(self._headers) self._headers = [] def startbody(self, ctype=None, plist=[], prefix=1, add_to_http_hdrs=0, content_type=1): """ prefix is ignored if add_to_http_hdrs is true. """ if content_type and ctype: for name, value in plist: ctype = ctype + ';\r\n %s=%s' % (name, value) self.addheader("Content-Type", ctype, prefix=prefix, add_to_http_hdrs=add_to_http_hdrs) self.flushheaders() if not add_to_http_hdrs: self._fp.write("\r\n") self._first_part = True return self._fp def startmultipartbody(self, subtype, boundary=None, plist=[], prefix=1, add_to_http_hdrs=0, content_type=1): boundary = boundary or choose_boundary() self._boundary.append(boundary) return self.startbody("multipart/" + subtype, [("boundary", boundary)] + plist, prefix=prefix, add_to_http_hdrs=add_to_http_hdrs, content_type=content_type) def nextpart(self): boundary = self._boundary[-1] if self._first_part: self._first_part = False else: self._fp.write("\r\n") self._fp.write("--" + boundary + "\r\n") return self.__class__(self._fp) def lastpart(self): if self._first_part: self.nextpart() boundary = self._boundary.pop() self._fp.write("\r\n--" + boundary + "--\r\n") class LocateError(ValueError): pass class AmbiguityError(LocateError): pass class ControlNotFoundError(LocateError): pass class ItemNotFoundError(LocateError): pass class ItemCountError(ValueError): pass # for backwards compatibility, ParseError derives from exceptions that were # raised by versions of ClientForm <= 0.2.5 # TODO: move to _html class ParseError(sgmllib.SGMLParseError, HTMLParser.HTMLParseError): def __init__(self, *args, **kwds): Exception.__init__(self, *args, **kwds) def __str__(self): return Exception.__str__(self) class _AbstractFormParser: """forms attribute contains HTMLForm instances on completion.""" # thanks to Moshe Zadka for an example of sgmllib/htmllib usage def __init__(self, entitydefs=None, encoding=DEFAULT_ENCODING): if entitydefs is None: entitydefs = get_entitydefs() self._entitydefs = entitydefs self._encoding = encoding self.base = None self.forms = [] self.labels = [] self._current_label = None self._current_form = None self._select = None self._optgroup = None self._option = None self._textarea = None # forms[0] will contain all controls that are outside of any form # self._global_form is an alias for self.forms[0] self._global_form = None self.start_form([]) self.end_form() self._current_form = self._global_form = self.forms[0] def do_base(self, attrs): debug("%s", attrs) for key, value in attrs: if key == "href": self.base = self.unescape_attr_if_required(value) def end_body(self): debug("") if self._current_label is not None: self.end_label() if self._current_form is not self._global_form: self.end_form() def start_form(self, attrs): debug("%s", attrs) if self._current_form is not self._global_form: raise ParseError("nested FORMs") name = None action = None enctype = "application/x-www-form-urlencoded" method = "GET" d = {} for key, value in attrs: if key == "name": name = self.unescape_attr_if_required(value) elif key == "action": action = self.unescape_attr_if_required(value) elif key == "method": method = self.unescape_attr_if_required(value.upper()) elif key == "enctype": enctype = self.unescape_attr_if_required(value.lower()) d[key] = self.unescape_attr_if_required(value) controls = [] self._current_form = (name, action, method, enctype), d, controls def end_form(self): debug("") if self._current_label is not None: self.end_label() if self._current_form is self._global_form: raise ParseError("end of FORM before start") self.forms.append(self._current_form) self._current_form = self._global_form def start_select(self, attrs): debug("%s", attrs) if self._select is not None: raise ParseError("nested SELECTs") if self._textarea is not None: raise ParseError("SELECT inside TEXTAREA") d = {} for key, val in attrs: d[key] = self.unescape_attr_if_required(val) self._select = d self._add_label(d) self._append_select_control({"__select": d}) def end_select(self): debug("") if self._select is None: raise ParseError("end of SELECT before start") if self._option is not None: self._end_option() self._select = None def start_optgroup(self, attrs): debug("%s", attrs) if self._select is None: raise ParseError("OPTGROUP outside of SELECT") d = {} for key, val in attrs: d[key] = self.unescape_attr_if_required(val) self._optgroup = d def end_optgroup(self): debug("") if self._optgroup is None: raise ParseError("end of OPTGROUP before start") self._optgroup = None def _start_option(self, attrs): debug("%s", attrs) if self._select is None: raise ParseError("OPTION outside of SELECT") if self._option is not None: self._end_option() d = {} for key, val in attrs: d[key] = self.unescape_attr_if_required(val) self._option = {} self._option.update(d) if (self._optgroup and self._optgroup.has_key("disabled") and not self._option.has_key("disabled")): self._option["disabled"] = None def _end_option(self): debug("") if self._option is None: raise ParseError("end of OPTION before start") contents = self._option.get("contents", "").strip() self._option["contents"] = contents if not self._option.has_key("value"): self._option["value"] = contents if not self._option.has_key("label"): self._option["label"] = contents # stuff dict of SELECT HTML attrs into a special private key # (gets deleted again later) self._option["__select"] = self._select self._append_select_control(self._option) self._option = None def _append_select_control(self, attrs): debug("%s", attrs) controls = self._current_form[2] name = self._select.get("name") controls.append(("select", name, attrs)) def start_textarea(self, attrs): debug("%s", attrs) if self._textarea is not None: raise ParseError("nested TEXTAREAs") if self._select is not None: raise ParseError("TEXTAREA inside SELECT") d = {} for key, val in attrs: d[key] = self.unescape_attr_if_required(val) self._add_label(d) self._textarea = d def end_textarea(self): debug("") if self._textarea is None: raise ParseError("end of TEXTAREA before start") controls = self._current_form[2] name = self._textarea.get("name") controls.append(("textarea", name, self._textarea)) self._textarea = None def start_label(self, attrs): debug("%s", attrs) if self._current_label: self.end_label() d = {} for key, val in attrs: d[key] = self.unescape_attr_if_required(val) taken = bool(d.get("for")) # empty id is invalid d["__text"] = "" d["__taken"] = taken if taken: self.labels.append(d) self._current_label = d def end_label(self): debug("") label = self._current_label if label is None: # something is ugly in the HTML, but we're ignoring it return self._current_label = None # if it is staying around, it is True in all cases del label["__taken"] def _add_label(self, d): #debug("%s", d) if self._current_label is not None: if not self._current_label["__taken"]: self._current_label["__taken"] = True d["__label"] = self._current_label def handle_data(self, data): debug("%s", data) if self._option is not None: # self._option is a dictionary of the OPTION element's HTML # attributes, but it has two special keys, one of which is the # special "contents" key contains text between OPTION tags (the # other is the "__select" key: see the end_option method) map = self._option key = "contents" elif self._textarea is not None: map = self._textarea key = "value" data = normalize_line_endings(data) # not if within option or textarea elif self._current_label is not None: map = self._current_label key = "__text" else: return if data and not map.has_key(key): # according to # http://www.w3.org/TR/html4/appendix/notes.html#h-B.3.1 line break # immediately after start tags or immediately before end tags must # be ignored, but real browsers only ignore a line break after a # start tag, so we'll do that. if data[0:2] == "\r\n": data = data[2:] elif data[0:1] in ["\n", "\r"]: data = data[1:] map[key] = data else: map[key] = map[key] + data def do_button(self, attrs): debug("%s", attrs) d = {} d["type"] = "submit" # default for key, val in attrs: d[key] = self.unescape_attr_if_required(val) controls = self._current_form[2] type = d["type"] name = d.get("name") # we don't want to lose information, so use a type string that # doesn't clash with INPUT TYPE={SUBMIT,RESET,BUTTON} # e.g. type for BUTTON/RESET is "resetbutton" # (type for INPUT/RESET is "reset") type = type+"button" self._add_label(d) controls.append((type, name, d)) def do_input(self, attrs): debug("%s", attrs) d = {} d["type"] = "text" # default for key, val in attrs: d[key] = self.unescape_attr_if_required(val) controls = self._current_form[2] type = d["type"] name = d.get("name") self._add_label(d) controls.append((type, name, d)) def do_isindex(self, attrs): debug("%s", attrs) d = {} for key, val in attrs: d[key] = self.unescape_attr_if_required(val) controls = self._current_form[2] self._add_label(d) # isindex doesn't have type or name HTML attributes controls.append(("isindex", None, d)) def handle_entityref(self, name): #debug("%s", name) self.handle_data(unescape( '&%s;' % name, self._entitydefs, self._encoding)) def handle_charref(self, name): #debug("%s", name) self.handle_data(unescape_charref(name, self._encoding)) def unescape_attr(self, name): #debug("%s", name) return unescape(name, self._entitydefs, self._encoding) def unescape_attrs(self, attrs): #debug("%s", attrs) escaped_attrs = {} for key, val in attrs.items(): try: val.items except AttributeError: escaped_attrs[key] = self.unescape_attr(val) else: # e.g. "__select" -- yuck! escaped_attrs[key] = self.unescape_attrs(val) return escaped_attrs def unknown_entityref(self, ref): self.handle_data("&%s;" % ref) def unknown_charref(self, ref): self.handle_data("&#%s;" % ref) class XHTMLCompatibleFormParser(_AbstractFormParser, HTMLParser.HTMLParser): """Good for XHTML, bad for tolerance of incorrect HTML.""" # thanks to Michael Howitz for this! def __init__(self, entitydefs=None, encoding=DEFAULT_ENCODING): HTMLParser.HTMLParser.__init__(self) _AbstractFormParser.__init__(self, entitydefs, encoding) def feed(self, data): try: HTMLParser.HTMLParser.feed(self, data) except HTMLParser.HTMLParseError, exc: raise ParseError(exc) def start_option(self, attrs): _AbstractFormParser._start_option(self, attrs) def end_option(self): _AbstractFormParser._end_option(self) def handle_starttag(self, tag, attrs): try: method = getattr(self, "start_" + tag) except AttributeError: try: method = getattr(self, "do_" + tag) except AttributeError: pass # unknown tag else: method(attrs) else: method(attrs) def handle_endtag(self, tag): try: method = getattr(self, "end_" + tag) except AttributeError: pass # unknown tag else: method() def unescape(self, name): # Use the entitydefs passed into constructor, not # HTMLParser.HTMLParser's entitydefs. return self.unescape_attr(name) def unescape_attr_if_required(self, name): return name # HTMLParser.HTMLParser already did it def unescape_attrs_if_required(self, attrs): return attrs # ditto def close(self): HTMLParser.HTMLParser.close(self) self.end_body() class _AbstractSgmllibParser(_AbstractFormParser): def do_option(self, attrs): _AbstractFormParser._start_option(self, attrs) # we override this attr to decode hex charrefs entity_or_charref = re.compile( '&(?:([a-zA-Z][-.a-zA-Z0-9]*)|#(x?[0-9a-fA-F]+))(;?)') def convert_entityref(self, name): return unescape("&%s;" % name, self._entitydefs, self._encoding) def convert_charref(self, name): return unescape_charref("%s" % name, self._encoding) def unescape_attr_if_required(self, name): return name # sgmllib already did it def unescape_attrs_if_required(self, attrs): return attrs # ditto class FormParser(_AbstractSgmllibParser, _sgmllib_copy.SGMLParser): """Good for tolerance of incorrect HTML, bad for XHTML.""" def __init__(self, entitydefs=None, encoding=DEFAULT_ENCODING): _sgmllib_copy.SGMLParser.__init__(self) _AbstractFormParser.__init__(self, entitydefs, encoding) def feed(self, data): try: _sgmllib_copy.SGMLParser.feed(self, data) except _sgmllib_copy.SGMLParseError, exc: raise ParseError(exc) def close(self): _sgmllib_copy.SGMLParser.close(self) self.end_body() class _AbstractBSFormParser(_AbstractSgmllibParser): bs_base_class = None def __init__(self, entitydefs=None, encoding=DEFAULT_ENCODING): _AbstractFormParser.__init__(self, entitydefs, encoding) self.bs_base_class.__init__(self) def handle_data(self, data): _AbstractFormParser.handle_data(self, data) self.bs_base_class.handle_data(self, data) def feed(self, data): try: self.bs_base_class.feed(self, data) except _sgmllib_copy.SGMLParseError, exc: raise ParseError(exc) def close(self): self.bs_base_class.close(self) self.end_body() class RobustFormParser(_AbstractBSFormParser, _beautifulsoup.BeautifulSoup): """Tries to be highly tolerant of incorrect HTML.""" bs_base_class = _beautifulsoup.BeautifulSoup class NestingRobustFormParser(_AbstractBSFormParser, _beautifulsoup.ICantBelieveItsBeautifulSoup): """Tries to be highly tolerant of incorrect HTML. Different from RobustFormParser in that it more often guesses nesting above missing end tags (see BeautifulSoup docs). """ bs_base_class = _beautifulsoup.ICantBelieveItsBeautifulSoup #FormParser = XHTMLCompatibleFormParser # testing hack #FormParser = RobustFormParser # testing hack def ParseResponseEx(response, select_default=False, form_parser_class=FormParser, request_class=_request.Request, entitydefs=None, encoding=DEFAULT_ENCODING, # private _urljoin=urlparse.urljoin, _urlparse=urlparse.urlparse, _urlunparse=urlparse.urlunparse, ): """Identical to ParseResponse, except that: 1. The returned list contains an extra item. The first form in the list contains all controls not contained in any FORM element. 2. The arguments ignore_errors and backwards_compat have been removed. 3. Backwards-compatibility mode (backwards_compat=True) is not available. """ return _ParseFileEx(response, response.geturl(), select_default, False, form_parser_class, request_class, entitydefs, False, encoding, _urljoin=_urljoin, _urlparse=_urlparse, _urlunparse=_urlunparse, ) def ParseFileEx(file, base_uri, select_default=False, form_parser_class=FormParser, request_class=_request.Request, entitydefs=None, encoding=DEFAULT_ENCODING, # private _urljoin=urlparse.urljoin, _urlparse=urlparse.urlparse, _urlunparse=urlparse.urlunparse, ): """Identical to ParseFile, except that: 1. The returned list contains an extra item. The first form in the list contains all controls not contained in any FORM element. 2. The arguments ignore_errors and backwards_compat have been removed. 3. Backwards-compatibility mode (backwards_compat=True) is not available. """ return _ParseFileEx(file, base_uri, select_default, False, form_parser_class, request_class, entitydefs, False, encoding, _urljoin=_urljoin, _urlparse=_urlparse, _urlunparse=_urlunparse, ) def ParseString(text, base_uri, *args, **kwds): fh = StringIO(text) return ParseFileEx(fh, base_uri, *args, **kwds) def ParseResponse(response, *args, **kwds): """Parse HTTP response and return a list of HTMLForm instances. The return value of mechanize.urlopen can be conveniently passed to this function as the response parameter. mechanize.ParseError is raised on parse errors. response: file-like object (supporting read() method) with a method geturl(), returning the URI of the HTTP response select_default: for multiple-selection SELECT controls and RADIO controls, pick the first item as the default if none are selected in the HTML form_parser_class: class to instantiate and use to pass request_class: class to return from .click() method (default is mechanize.Request) entitydefs: mapping like {"&": "&", ...} containing HTML entity definitions (a sensible default is used) encoding: character encoding used for encoding numeric character references when matching link text. mechanize does not attempt to find the encoding in a META HTTP-EQUIV attribute in the document itself (mechanize, for example, does do that and will pass the correct value to mechanize using this parameter). backwards_compat: boolean that determines whether the returned HTMLForm objects are backwards-compatible with old code. If backwards_compat is true: - ClientForm 0.1 code will continue to work as before. - Label searches that do not specify a nr (number or count) will always get the first match, even if other controls match. If backwards_compat is False, label searches that have ambiguous results will raise an AmbiguityError. - Item label matching is done by strict string comparison rather than substring matching. - De-selecting individual list items is allowed even if the Item is disabled. The backwards_compat argument will be removed in a future release. Pass a true value for select_default if you want the behaviour specified by RFC 1866 (the HTML 2.0 standard), which is to select the first item in a RADIO or multiple-selection SELECT control if none were selected in the HTML. Most browsers (including Microsoft Internet Explorer (IE) and Netscape Navigator) instead leave all items unselected in these cases. The W3C HTML 4.0 standard leaves this behaviour undefined in the case of multiple-selection SELECT controls, but insists that at least one RADIO button should be checked at all times, in contradiction to browser behaviour. There is a choice of parsers. mechanize.XHTMLCompatibleFormParser (uses HTMLParser.HTMLParser) works best for XHTML, mechanize.FormParser (uses bundled copy of sgmllib.SGMLParser) (the default) works better for ordinary grubby HTML. Note that HTMLParser is only available in Python 2.2 and later. You can pass your own class in here as a hack to work around bad HTML, but at your own risk: there is no well-defined interface. """ return _ParseFileEx(response, response.geturl(), *args, **kwds)[1:] def ParseFile(file, base_uri, *args, **kwds): """Parse HTML and return a list of HTMLForm instances. mechanize.ParseError is raised on parse errors. file: file-like object (supporting read() method) containing HTML with zero or more forms to be parsed base_uri: the URI of the document (note that the base URI used to submit the form will be that given in the BASE element if present, not that of the document) For the other arguments and further details, see ParseResponse.__doc__. """ return _ParseFileEx(file, base_uri, *args, **kwds)[1:] def _ParseFileEx(file, base_uri, select_default=False, ignore_errors=False, form_parser_class=FormParser, request_class=_request.Request, entitydefs=None, backwards_compat=True, encoding=DEFAULT_ENCODING, _urljoin=urlparse.urljoin, _urlparse=urlparse.urlparse, _urlunparse=urlparse.urlunparse, ): if backwards_compat: deprecation("operating in backwards-compatibility mode", 1) fp = form_parser_class(entitydefs, encoding) while 1: data = file.read(CHUNK) try: fp.feed(data) except ParseError, e: e.base_uri = base_uri raise if len(data) != CHUNK: break fp.close() if fp.base is not None: # HTML BASE element takes precedence over document URI base_uri = fp.base labels = [] # Label(label) for label in fp.labels] id_to_labels = {} for l in fp.labels: label = Label(l) labels.append(label) for_id = l["for"] coll = id_to_labels.get(for_id) if coll is None: id_to_labels[for_id] = [label] else: coll.append(label) forms = [] for (name, action, method, enctype), attrs, controls in fp.forms: if action is None: action = base_uri else: action = _urljoin(base_uri, action) # would be nice to make HTMLForm class (form builder) pluggable form = HTMLForm( action, method, enctype, name, attrs, request_class, forms, labels, id_to_labels, backwards_compat) form._urlparse = _urlparse form._urlunparse = _urlunparse for ii in range(len(controls)): type, name, attrs = controls[ii] # index=ii*10 allows ImageControl to return multiple ordered pairs form.new_control( type, name, attrs, select_default=select_default, index=ii*10) forms.append(form) for form in forms: form.fixup() return forms class Label: def __init__(self, attrs): self.id = attrs.get("for") self._text = attrs.get("__text").strip() self._ctext = compress_text(self._text) self.attrs = attrs self._backwards_compat = False # maintained by HTMLForm def __getattr__(self, name): if name == "text": if self._backwards_compat: return self._text else: return self._ctext return getattr(Label, name) def __setattr__(self, name, value): if name == "text": # don't see any need for this, so make it read-only raise AttributeError("text attribute is read-only") self.__dict__[name] = value def __str__(self): return "<Label(id=%r, text=%r)>" % (self.id, self.text) def _get_label(attrs): text = attrs.get("__label") if text is not None: return Label(text) else: return None class Control: """An HTML form control. An HTMLForm contains a sequence of Controls. The Controls in an HTMLForm are accessed using the HTMLForm.find_control method or the HTMLForm.controls attribute. Control instances are usually constructed using the ParseFile / ParseResponse functions. If you use those functions, you can ignore the rest of this paragraph. A Control is only properly initialised after the fixup method has been called. In fact, this is only strictly necessary for ListControl instances. This is necessary because ListControls are built up from ListControls each containing only a single item, and their initial value(s) can only be known after the sequence is complete. The types and values that are acceptable for assignment to the value attribute are defined by subclasses. If the disabled attribute is true, this represents the state typically represented by browsers by 'greying out' a control. If the disabled attribute is true, the Control will raise AttributeError if an attempt is made to change its value. In addition, the control will not be considered 'successful' as defined by the W3C HTML 4 standard -- ie. it will contribute no data to the return value of the HTMLForm.click* methods. To enable a control, set the disabled attribute to a false value. If the readonly attribute is true, the Control will raise AttributeError if an attempt is made to change its value. To make a control writable, set the readonly attribute to a false value. All controls have the disabled and readonly attributes, not only those that may have the HTML attributes of the same names. On assignment to the value attribute, the following exceptions are raised: TypeError, AttributeError (if the value attribute should not be assigned to, because the control is disabled, for example) and ValueError. If the name or value attributes are None, or the value is an empty list, or if the control is disabled, the control is not successful. Public attributes: type: string describing type of control (see the keys of the HTMLForm.type2class dictionary for the allowable values) (readonly) name: name of control (readonly) value: current value of control (subclasses may allow a single value, a sequence of values, or either) disabled: disabled state readonly: readonly state id: value of id HTML attribute """ def __init__(self, type, name, attrs, index=None): """ type: string describing type of control (see the keys of the HTMLForm.type2class dictionary for the allowable values) name: control name attrs: HTML attributes of control's HTML element """ raise NotImplementedError() def add_to_form(self, form): self._form = form form.controls.append(self) def fixup(self): pass def is_of_kind(self, kind): raise NotImplementedError() def clear(self): raise NotImplementedError() def __getattr__(self, name): raise NotImplementedError() def __setattr__(self, name, value): raise NotImplementedError() def pairs(self): """Return list of (key, value) pairs suitable for passing to urlencode. """ return [(k, v) for (i, k, v) in self._totally_ordered_pairs()] def _totally_ordered_pairs(self): """Return list of (key, value, index) tuples. Like pairs, but allows preserving correct ordering even where several controls are involved. """ raise NotImplementedError() def _write_mime_data(self, mw, name, value): """Write data for a subitem of this control to a MimeWriter.""" # called by HTMLForm mw2 = mw.nextpart() mw2.addheader("Content-Disposition", 'form-data; name="%s"' % name, 1) f = mw2.startbody(prefix=0) f.write(value) def __str__(self): raise NotImplementedError() def get_labels(self): """Return all labels (Label instances) for this control. If the control was surrounded by a <label> tag, that will be the first label; all other labels, connected by 'for' and 'id', are in the order that appear in the HTML. """ res = [] if self._label: res.append(self._label) if self.id: res.extend(self._form._id_to_labels.get(self.id, ())) return res #--------------------------------------------------- class ScalarControl(Control): """Control whose value is not restricted to one of a prescribed set. Some ScalarControls don't accept any value attribute. Otherwise, takes a single value, which must be string-like. Additional read-only public attribute: attrs: dictionary mapping the names of original HTML attributes of the control to their values """ def __init__(self, type, name, attrs, index=None): self._index = index self._label = _get_label(attrs) self.__dict__["type"] = type.lower() self.__dict__["name"] = name self._value = attrs.get("value") self.disabled = attrs.has_key("disabled") self.readonly = attrs.has_key("readonly") self.id = attrs.get("id") self.attrs = attrs.copy() self._clicked = False self._urlparse = urlparse.urlparse self._urlunparse = urlparse.urlunparse def __getattr__(self, name): if name == "value": return self.__dict__["_value"] else: raise AttributeError("%s instance has no attribute '%s'" % (self.__class__.__name__, name)) def __setattr__(self, name, value): if name == "value": if not isstringlike(value): raise TypeError("must assign a string") elif self.readonly: raise AttributeError("control '%s' is readonly" % self.name) elif self.disabled: raise AttributeError("control '%s' is disabled" % self.name) self.__dict__["_value"] = value elif name in ("name", "type"): raise AttributeError("%s attribute is readonly" % name) else: self.__dict__[name] = value def _totally_ordered_pairs(self): name = self.name value = self.value if name is None or value is None or self.disabled: return [] return [(self._index, name, value)] def clear(self): if self.readonly: raise AttributeError("control '%s' is readonly" % self.name) self.__dict__["_value"] = None def __str__(self): name = self.name value = self.value if name is None: name = "<None>" if value is None: value = "<None>" infos = [] if self.disabled: infos.append("disabled") if self.readonly: infos.append("readonly") info = ", ".join(infos) if info: info = " (%s)" % info return "<%s(%s=%s)%s>" % (self.__class__.__name__, name, value, info) #--------------------------------------------------- class TextControl(ScalarControl): """Textual input control. Covers: INPUT/TEXT INPUT/PASSWORD INPUT/HIDDEN TEXTAREA """ def __init__(self, type, name, attrs, index=None): ScalarControl.__init__(self, type, name, attrs, index) if self.type == "hidden": self.readonly = True if self._value is None: self._value = "" def is_of_kind(self, kind): return kind == "text" #--------------------------------------------------- class FileControl(ScalarControl): """File upload with INPUT TYPE=FILE. The value attribute of a FileControl is always None. Use add_file instead. Additional public method: add_file """ def __init__(self, type, name, attrs, index=None): ScalarControl.__init__(self, type, name, attrs, index) self._value = None self._upload_data = [] def is_of_kind(self, kind): return kind == "file" def clear(self): if self.readonly: raise AttributeError("control '%s' is readonly" % self.name) self._upload_data = [] def __setattr__(self, name, value): if name in ("value", "name", "type"): raise AttributeError("%s attribute is readonly" % name) else: self.__dict__[name] = value def add_file(self, file_object, content_type=None, filename=None): if not hasattr(file_object, "read"): raise TypeError("file-like object must have read method") if content_type is not None and not isstringlike(content_type): raise TypeError("content type must be None or string-like") if filename is not None and not isstringlike(filename): raise TypeError("filename must be None or string-like") if content_type is None: content_type = "application/octet-stream" self._upload_data.append((file_object, content_type, filename)) def _totally_ordered_pairs(self): # XXX should it be successful even if unnamed? if self.name is None or self.disabled: return [] return [(self._index, self.name, "")] # If enctype is application/x-www-form-urlencoded and there's a FILE # control present, what should be sent? Strictly, it should be 'name=data' # (see HTML 4.01 spec., section 17.13.2), but code sends "name=" ATM. What # about multiple file upload? def _write_mime_data(self, mw, _name, _value): # called by HTMLForm # assert _name == self.name and _value == '' if len(self._upload_data) < 2: if len(self._upload_data) == 0: file_object = StringIO() content_type = "application/octet-stream" filename = "" else: file_object, content_type, filename = self._upload_data[0] if filename is None: filename = "" mw2 = mw.nextpart() fn_part = '; filename="%s"' % filename disp = 'form-data; name="%s"%s' % (self.name, fn_part) mw2.addheader("Content-Disposition", disp, prefix=1) fh = mw2.startbody(content_type, prefix=0) fh.write(file_object.read()) else: # multiple files mw2 = mw.nextpart() disp = 'form-data; name="%s"' % self.name mw2.addheader("Content-Disposition", disp, prefix=1) fh = mw2.startmultipartbody("mixed", prefix=0) for file_object, content_type, filename in self._upload_data: mw3 = mw2.nextpart() if filename is None: filename = "" fn_part = '; filename="%s"' % filename disp = "file%s" % fn_part mw3.addheader("Content-Disposition", disp, prefix=1) fh2 = mw3.startbody(content_type, prefix=0) fh2.write(file_object.read()) mw2.lastpart() def __str__(self): name = self.name if name is None: name = "<None>" if not self._upload_data: value = "<No files added>" else: value = [] for file, ctype, filename in self._upload_data: if filename is None: value.append("<Unnamed file>") else: value.append(filename) value = ", ".join(value) info = [] if self.disabled: info.append("disabled") if self.readonly: info.append("readonly") info = ", ".join(info) if info: info = " (%s)" % info return "<%s(%s=%s)%s>" % (self.__class__.__name__, name, value, info) #--------------------------------------------------- class IsindexControl(ScalarControl): """ISINDEX control. ISINDEX is the odd-one-out of HTML form controls. In fact, it isn't really part of regular HTML forms at all, and predates it. You're only allowed one ISINDEX per HTML document. ISINDEX and regular form submission are mutually exclusive -- either submit a form, or the ISINDEX. Having said this, since ISINDEX controls may appear in forms (which is probably bad HTML), ParseFile / ParseResponse will include them in the HTMLForm instances it returns. You can set the ISINDEX's value, as with any other control (but note that ISINDEX controls have no name, so you'll need to use the type argument of set_value!). When you submit the form, the ISINDEX will not be successful (ie., no data will get returned to the server as a result of its presence), unless you click on the ISINDEX control, in which case the ISINDEX gets submitted instead of the form: form.set_value("my isindex value", type="isindex") mechanize.urlopen(form.click(type="isindex")) ISINDEX elements outside of FORMs are ignored. If you want to submit one by hand, do it like so: url = urlparse.urljoin(page_uri, "?"+urllib.quote_plus("my isindex value")) result = mechanize.urlopen(url) """ def __init__(self, type, name, attrs, index=None): ScalarControl.__init__(self, type, name, attrs, index) if self._value is None: self._value = "" def is_of_kind(self, kind): return kind in ["text", "clickable"] def _totally_ordered_pairs(self): return [] def _click(self, form, coord, return_type, request_class=_request.Request): # Relative URL for ISINDEX submission: instead of "foo=bar+baz", # want "bar+baz". # This doesn't seem to be specified in HTML 4.01 spec. (ISINDEX is # deprecated in 4.01, but it should still say how to submit it). # Submission of ISINDEX is explained in the HTML 3.2 spec, though. parts = self._urlparse(form.action) rest, (query, frag) = parts[:-2], parts[-2:] parts = rest + (urllib.quote_plus(self.value), None) url = self._urlunparse(parts) req_data = url, None, [] if return_type == "pairs": return [] elif return_type == "request_data": return req_data else: return request_class(url) def __str__(self): value = self.value if value is None: value = "<None>" infos = [] if self.disabled: infos.append("disabled") if self.readonly: infos.append("readonly") info = ", ".join(infos) if info: info = " (%s)" % info return "<%s(%s)%s>" % (self.__class__.__name__, value, info) #--------------------------------------------------- class IgnoreControl(ScalarControl): """Control that we're not interested in. Covers: INPUT/RESET BUTTON/RESET INPUT/BUTTON BUTTON/BUTTON These controls are always unsuccessful, in the terminology of HTML 4 (ie. they never require any information to be returned to the server). BUTTON/BUTTON is used to generate events for script embedded in HTML. The value attribute of IgnoreControl is always None. """ def __init__(self, type, name, attrs, index=None): ScalarControl.__init__(self, type, name, attrs, index) self._value = None def is_of_kind(self, kind): return False def __setattr__(self, name, value): if name == "value": raise AttributeError( "control '%s' is ignored, hence read-only" % self.name) elif name in ("name", "type"): raise AttributeError("%s attribute is readonly" % name) else: self.__dict__[name] = value #--------------------------------------------------- # ListControls # helpers and subsidiary classes class Item: def __init__(self, control, attrs, index=None): label = _get_label(attrs) self.__dict__.update({ "name": attrs["value"], "_labels": label and [label] or [], "attrs": attrs, "_control": control, "disabled": attrs.has_key("disabled"), "_selected": False, "id": attrs.get("id"), "_index": index, }) control.items.append(self) def get_labels(self): """Return all labels (Label instances) for this item. For items that represent radio buttons or checkboxes, if the item was surrounded by a <label> tag, that will be the first label; all other labels, connected by 'for' and 'id', are in the order that appear in the HTML. For items that represent select options, if the option had a label attribute, that will be the first label. If the option has contents (text within the option tags) and it is not the same as the label attribute (if any), that will be a label. There is nothing in the spec to my knowledge that makes an option with an id unable to be the target of a label's for attribute, so those are included, if any, for the sake of consistency and completeness. """ res = [] res.extend(self._labels) if self.id: res.extend(self._control._form._id_to_labels.get(self.id, ())) return res def __getattr__(self, name): if name=="selected": return self._selected raise AttributeError(name) def __setattr__(self, name, value): if name == "selected": self._control._set_selected_state(self, value) elif name == "disabled": self.__dict__["disabled"] = bool(value) else: raise AttributeError(name) def __str__(self): res = self.name if self.selected: res = "*" + res if self.disabled: res = "(%s)" % res return res def __repr__(self): # XXX appending the attrs without distinguishing them from name and id # is silly attrs = [("name", self.name), ("id", self.id)]+self.attrs.items() return "<%s %s>" % ( self.__class__.__name__, " ".join(["%s=%r" % (k, v) for k, v in attrs]) ) def disambiguate(items, nr, **kwds): msgs = [] for key, value in kwds.items(): msgs.append("%s=%r" % (key, value)) msg = " ".join(msgs) if not items: raise ItemNotFoundError(msg) if nr is None: if len(items) > 1: raise AmbiguityError(msg) nr = 0 if len(items) <= nr: raise ItemNotFoundError(msg) return items[nr] class ListControl(Control): """Control representing a sequence of items. The value attribute of a ListControl represents the successful list items in the control. The successful list items are those that are selected and not disabled. ListControl implements both list controls that take a length-1 value (single-selection) and those that take length >1 values (multiple-selection). ListControls accept sequence values only. Some controls only accept sequences of length 0 or 1 (RADIO, and single-selection SELECT). In those cases, ItemCountError is raised if len(sequence) > 1. CHECKBOXes and multiple-selection SELECTs (those having the "multiple" HTML attribute) accept sequences of any length. Note the following mistake: control.value = some_value assert control.value == some_value # not necessarily true The reason for this is that the value attribute always gives the list items in the order they were listed in the HTML. ListControl items can also be referred to by their labels instead of names. Use the label argument to .get(), and the .set_value_by_label(), .get_value_by_label() methods. Note that, rather confusingly, though SELECT controls are represented in HTML by SELECT elements (which contain OPTION elements, representing individual list items), CHECKBOXes and RADIOs are not represented by *any* element. Instead, those controls are represented by a collection of INPUT elements. For example, this is a SELECT control, named "control1": <select name="control1"> <option>foo</option> <option value="1">bar</option> </select> and this is a CHECKBOX control, named "control2": <input type="checkbox" name="control2" value="foo" id="cbe1"> <input type="checkbox" name="control2" value="bar" id="cbe2"> The id attribute of a CHECKBOX or RADIO ListControl is always that of its first element (for example, "cbe1" above). Additional read-only public attribute: multiple. """ # ListControls are built up by the parser from their component items by # creating one ListControl per item, consolidating them into a single # master ListControl held by the HTMLForm: # -User calls form.new_control(...) # -Form creates Control, and calls control.add_to_form(self). # -Control looks for a Control with the same name and type in the form, # and if it finds one, merges itself with that control by calling # control.merge_control(self). The first Control added to the form, of # a particular name and type, is the only one that survives in the # form. # -Form calls control.fixup for all its controls. ListControls in the # form know they can now safely pick their default values. # To create a ListControl without an HTMLForm, use: # control.merge_control(new_control) # (actually, it's much easier just to use ParseFile) _label = None def __init__(self, type, name, attrs={}, select_default=False, called_as_base_class=False, index=None): """ select_default: for RADIO and multiple-selection SELECT controls, pick the first item as the default if no 'selected' HTML attribute is present """ if not called_as_base_class: raise NotImplementedError() self.__dict__["type"] = type.lower() self.__dict__["name"] = name self._value = attrs.get("value") self.disabled = False self.readonly = False self.id = attrs.get("id") self._closed = False # As Controls are merged in with .merge_control(), self.attrs will # refer to each Control in turn -- always the most recently merged # control. Each merged-in Control instance corresponds to a single # list item: see ListControl.__doc__. self.items = [] self._form = None self._select_default = select_default self._clicked = False def clear(self): self.value = [] def is_of_kind(self, kind): if kind == "list": return True elif kind == "multilist": return bool(self.multiple) elif kind == "singlelist": return not self.multiple else: return False def get_items(self, name=None, label=None, id=None, exclude_disabled=False): """Return matching items by name or label. For argument docs, see the docstring for .get() """ if name is not None and not isstringlike(name): raise TypeError("item name must be string-like") if label is not None and not isstringlike(label): raise TypeError("item label must be string-like") if id is not None and not isstringlike(id): raise TypeError("item id must be string-like") items = [] # order is important compat = self._form.backwards_compat for o in self.items: if exclude_disabled and o.disabled: continue if name is not None and o.name != name: continue if label is not None: for l in o.get_labels(): if ((compat and l.text == label) or (not compat and l.text.find(label) > -1)): break else: continue if id is not None and o.id != id: continue items.append(o) return items def get(self, name=None, label=None, id=None, nr=None, exclude_disabled=False): """Return item by name or label, disambiguating if necessary with nr. All arguments must be passed by name, with the exception of 'name', which may be used as a positional argument. If name is specified, then the item must have the indicated name. If label is specified, then the item must have a label whose whitespace-compressed, stripped, text substring-matches the indicated label string (e.g. label="please choose" will match " Do please choose an item "). If id is specified, then the item must have the indicated id. nr is an optional 0-based index of the items matching the query. If nr is the default None value and more than item is found, raises AmbiguityError (unless the HTMLForm instance's backwards_compat attribute is true). If no item is found, or if items are found but nr is specified and not found, raises ItemNotFoundError. Optionally excludes disabled items. """ if nr is None and self._form.backwards_compat: nr = 0 # :-/ items = self.get_items(name, label, id, exclude_disabled) return disambiguate(items, nr, name=name, label=label, id=id) def _get(self, name, by_label=False, nr=None, exclude_disabled=False): # strictly for use by deprecated methods if by_label: name, label = None, name else: name, label = name, None return self.get(name, label, nr, exclude_disabled) def toggle(self, name, by_label=False, nr=None): """Deprecated: given a name or label and optional disambiguating index nr, toggle the matching item's selection. Selecting items follows the behavior described in the docstring of the 'get' method. if the item is disabled, or this control is disabled or readonly, raise AttributeError. """ deprecation( "item = control.get(...); item.selected = not item.selected") o = self._get(name, by_label, nr) self._set_selected_state(o, not o.selected) def set(self, selected, name, by_label=False, nr=None): """Deprecated: given a name or label and optional disambiguating index nr, set the matching item's selection to the bool value of selected. Selecting items follows the behavior described in the docstring of the 'get' method. if the item is disabled, or this control is disabled or readonly, raise AttributeError. """ deprecation( "control.get(...).selected = <boolean>") self._set_selected_state(self._get(name, by_label, nr), selected) def _set_selected_state(self, item, action): # action: # bool False: off # bool True: on if self.disabled: raise AttributeError("control '%s' is disabled" % self.name) if self.readonly: raise AttributeError("control '%s' is readonly" % self.name) action == bool(action) compat = self._form.backwards_compat if not compat and item.disabled: raise AttributeError("item is disabled") else: if compat and item.disabled and action: raise AttributeError("item is disabled") if self.multiple: item.__dict__["_selected"] = action else: if not action: item.__dict__["_selected"] = False else: for o in self.items: o.__dict__["_selected"] = False item.__dict__["_selected"] = True def toggle_single(self, by_label=None): """Deprecated: toggle the selection of the single item in this control. Raises ItemCountError if the control does not contain only one item. by_label argument is ignored, and included only for backwards compatibility. """ deprecation( "control.items[0].selected = not control.items[0].selected") if len(self.items) != 1: raise ItemCountError( "'%s' is not a single-item control" % self.name) item = self.items[0] self._set_selected_state(item, not item.selected) def set_single(self, selected, by_label=None): """Deprecated: set the selection of the single item in this control. Raises ItemCountError if the control does not contain only one item. by_label argument is ignored, and included only for backwards compatibility. """ deprecation( "control.items[0].selected = <boolean>") if len(self.items) != 1: raise ItemCountError( "'%s' is not a single-item control" % self.name) self._set_selected_state(self.items[0], selected) def get_item_disabled(self, name, by_label=False, nr=None): """Get disabled state of named list item in a ListControl.""" deprecation( "control.get(...).disabled") return self._get(name, by_label, nr).disabled def set_item_disabled(self, disabled, name, by_label=False, nr=None): """Set disabled state of named list item in a ListControl. disabled: boolean disabled state """ deprecation( "control.get(...).disabled = <boolean>") self._get(name, by_label, nr).disabled = disabled def set_all_items_disabled(self, disabled): """Set disabled state of all list items in a ListControl. disabled: boolean disabled state """ for o in self.items: o.disabled = disabled def get_item_attrs(self, name, by_label=False, nr=None): """Return dictionary of HTML attributes for a single ListControl item. The HTML element types that describe list items are: OPTION for SELECT controls, INPUT for the rest. These elements have HTML attributes that you may occasionally want to know about -- for example, the "alt" HTML attribute gives a text string describing the item (graphical browsers usually display this as a tooltip). The returned dictionary maps HTML attribute names to values. The names and values are taken from the original HTML. """ deprecation( "control.get(...).attrs") return self._get(name, by_label, nr).attrs def close_control(self): self._closed = True def add_to_form(self, form): assert self._form is None or form == self._form, ( "can't add control to more than one form") self._form = form if self.name is None: # always count nameless elements as separate controls Control.add_to_form(self, form) else: for ii in range(len(form.controls)-1, -1, -1): control = form.controls[ii] if control.name == self.name and control.type == self.type: if control._closed: Control.add_to_form(self, form) else: control.merge_control(self) break else: Control.add_to_form(self, form) def merge_control(self, control): assert bool(control.multiple) == bool(self.multiple) # usually, isinstance(control, self.__class__) self.items.extend(control.items) def fixup(self): """ ListControls are built up from component list items (which are also ListControls) during parsing. This method should be called after all items have been added. See ListControl.__doc__ for the reason this is required. """ # Need to set default selection where no item was indicated as being # selected by the HTML: # CHECKBOX: # Nothing should be selected. # SELECT/single, SELECT/multiple and RADIO: # RFC 1866 (HTML 2.0): says first item should be selected. # W3C HTML 4.01 Specification: says that client behaviour is # undefined in this case. For RADIO, exactly one must be selected, # though which one is undefined. # Both Netscape and Microsoft Internet Explorer (IE) choose first # item for SELECT/single. However, both IE5 and Mozilla (both 1.0 # and Firebird 0.6) leave all items unselected for RADIO and # SELECT/multiple. # Since both Netscape and IE all choose the first item for # SELECT/single, we do the same. OTOH, both Netscape and IE # leave SELECT/multiple with nothing selected, in violation of RFC 1866 # (but not in violation of the W3C HTML 4 standard); the same is true # of RADIO (which *is* in violation of the HTML 4 standard). We follow # RFC 1866 if the _select_default attribute is set, and Netscape and IE # otherwise. RFC 1866 and HTML 4 are always violated insofar as you # can deselect all items in a RadioControl. for o in self.items: # set items' controls to self, now that we've merged o.__dict__["_control"] = self def __getattr__(self, name): if name == "value": compat = self._form.backwards_compat if self.name is None: return [] return [o.name for o in self.items if o.selected and (not o.disabled or compat)] else: raise AttributeError("%s instance has no attribute '%s'" % (self.__class__.__name__, name)) def __setattr__(self, name, value): if name == "value": if self.disabled: raise AttributeError("control '%s' is disabled" % self.name) if self.readonly: raise AttributeError("control '%s' is readonly" % self.name) self._set_value(value) elif name in ("name", "type", "multiple"): raise AttributeError("%s attribute is readonly" % name) else: self.__dict__[name] = value def _set_value(self, value): if value is None or isstringlike(value): raise TypeError("ListControl, must set a sequence") if not value: compat = self._form.backwards_compat for o in self.items: if not o.disabled or compat: o.selected = False elif self.multiple: self._multiple_set_value(value) elif len(value) > 1: raise ItemCountError( "single selection list, must set sequence of " "length 0 or 1") else: self._single_set_value(value) def _get_items(self, name, target=1): all_items = self.get_items(name) items = [o for o in all_items if not o.disabled] if len(items) < target: if len(all_items) < target: raise ItemNotFoundError( "insufficient items with name %r" % name) else: raise AttributeError( "insufficient non-disabled items with name %s" % name) on = [] off = [] for o in items: if o.selected: on.append(o) else: off.append(o) return on, off def _single_set_value(self, value): assert len(value) == 1 on, off = self._get_items(value[0]) assert len(on) <= 1 if not on: off[0].selected = True def _multiple_set_value(self, value): compat = self._form.backwards_compat turn_on = [] # transactional-ish turn_off = [item for item in self.items if item.selected and (not item.disabled or compat)] names = {} for nn in value: if nn in names.keys(): names[nn] += 1 else: names[nn] = 1 for name, count in names.items(): on, off = self._get_items(name, count) for i in range(count): if on: item = on[0] del on[0] del turn_off[turn_off.index(item)] else: item = off[0] del off[0] turn_on.append(item) for item in turn_off: item.selected = False for item in turn_on: item.selected = True def set_value_by_label(self, value): """Set the value of control by item labels. value is expected to be an iterable of strings that are substrings of the item labels that should be selected. Before substring matching is performed, the original label text is whitespace-compressed (consecutive whitespace characters are converted to a single space character) and leading and trailing whitespace is stripped. Ambiguous labels are accepted without complaint if the form's backwards_compat is True; otherwise, it will not complain as long as all ambiguous labels share the same item name (e.g. OPTION value). """ if isstringlike(value): raise TypeError(value) if not self.multiple and len(value) > 1: raise ItemCountError( "single selection list, must set sequence of " "length 0 or 1") items = [] for nn in value: found = self.get_items(label=nn) if len(found) > 1: if not self._form.backwards_compat: # ambiguous labels are fine as long as item names (e.g. # OPTION values) are same opt_name = found[0].name if [o for o in found[1:] if o.name != opt_name]: raise AmbiguityError(nn) else: # OK, we'll guess :-( Assume first available item. found = found[:1] for o in found: # For the multiple-item case, we could try to be smarter, # saving them up and trying to resolve, but that's too much. if self._form.backwards_compat or o not in items: items.append(o) break else: # all of them are used raise ItemNotFoundError(nn) # now we have all the items that should be on # let's just turn everything off and then back on. self.value = [] for o in items: o.selected = True def get_value_by_label(self): """Return the value of the control as given by normalized labels.""" res = [] compat = self._form.backwards_compat for o in self.items: if (not o.disabled or compat) and o.selected: for l in o.get_labels(): if l.text: res.append(l.text) break else: res.append(None) return res def possible_items(self, by_label=False): """Deprecated: return the names or labels of all possible items. Includes disabled items, which may be misleading for some use cases. """ deprecation( "[item.name for item in self.items]") if by_label: res = [] for o in self.items: for l in o.get_labels(): if l.text: res.append(l.text) break else: res.append(None) return res return [o.name for o in self.items] def _totally_ordered_pairs(self): if self.disabled or self.name is None: return [] else: return [(o._index, self.name, o.name) for o in self.items if o.selected and not o.disabled] def __str__(self): name = self.name if name is None: name = "<None>" display = [str(o) for o in self.items] infos = [] if self.disabled: infos.append("disabled") if self.readonly: infos.append("readonly") info = ", ".join(infos) if info: info = " (%s)" % info return "<%s(%s=[%s])%s>" % (self.__class__.__name__, name, ", ".join(display), info) class RadioControl(ListControl): """ Covers: INPUT/RADIO """ def __init__(self, type, name, attrs, select_default=False, index=None): attrs.setdefault("value", "on") ListControl.__init__(self, type, name, attrs, select_default, called_as_base_class=True, index=index) self.__dict__["multiple"] = False o = Item(self, attrs, index) o.__dict__["_selected"] = attrs.has_key("checked") def fixup(self): ListControl.fixup(self) found = [o for o in self.items if o.selected and not o.disabled] if not found: if self._select_default: for o in self.items: if not o.disabled: o.selected = True break else: # Ensure only one item selected. Choose the last one, # following IE and Firefox. for o in found[:-1]: o.selected = False def get_labels(self): return [] class CheckboxControl(ListControl): """ Covers: INPUT/CHECKBOX """ def __init__(self, type, name, attrs, select_default=False, index=None): attrs.setdefault("value", "on") ListControl.__init__(self, type, name, attrs, select_default, called_as_base_class=True, index=index) self.__dict__["multiple"] = True o = Item(self, attrs, index) o.__dict__["_selected"] = attrs.has_key("checked") def get_labels(self): return [] class SelectControl(ListControl): """ Covers: SELECT (and OPTION) OPTION 'values', in HTML parlance, are Item 'names' in mechanize parlance. SELECT control values and labels are subject to some messy defaulting rules. For example, if the HTML representation of the control is: <SELECT name=year> <OPTION value=0 label="2002">current year</OPTION> <OPTION value=1>2001</OPTION> <OPTION>2000</OPTION> </SELECT> The items, in order, have labels "2002", "2001" and "2000", whereas their names (the OPTION values) are "0", "1" and "2000" respectively. Note that the value of the last OPTION in this example defaults to its contents, as specified by RFC 1866, as do the labels of the second and third OPTIONs. The OPTION labels are sometimes more meaningful than the OPTION values, which can make for more maintainable code. Additional read-only public attribute: attrs The attrs attribute is a dictionary of the original HTML attributes of the SELECT element. Other ListControls do not have this attribute, because in other cases the control as a whole does not correspond to any single HTML element. control.get(...).attrs may be used as usual to get at the HTML attributes of the HTML elements corresponding to individual list items (for SELECT controls, these are OPTION elements). Another special case is that the Item.attrs dictionaries have a special key "contents" which does not correspond to any real HTML attribute, but rather contains the contents of the OPTION element: <OPTION>this bit</OPTION> """ # HTML attributes here are treated slightly differently from other list # controls: # -The SELECT HTML attributes dictionary is stuffed into the OPTION # HTML attributes dictionary under the "__select" key. # -The content of each OPTION element is stored under the special # "contents" key of the dictionary. # After all this, the dictionary is passed to the SelectControl constructor # as the attrs argument, as usual. However: # -The first SelectControl constructed when building up a SELECT control # has a constructor attrs argument containing only the __select key -- so # this SelectControl represents an empty SELECT control. # -Subsequent SelectControls have both OPTION HTML-attribute in attrs and # the __select dictionary containing the SELECT HTML-attributes. def __init__(self, type, name, attrs, select_default=False, index=None): # fish out the SELECT HTML attributes from the OPTION HTML attributes # dictionary self.attrs = attrs["__select"].copy() self.__dict__["_label"] = _get_label(self.attrs) self.__dict__["id"] = self.attrs.get("id") self.__dict__["multiple"] = self.attrs.has_key("multiple") # the majority of the contents, label, and value dance already happened contents = attrs.get("contents") attrs = attrs.copy() del attrs["__select"] ListControl.__init__(self, type, name, self.attrs, select_default, called_as_base_class=True, index=index) self.disabled = self.attrs.has_key("disabled") self.readonly = self.attrs.has_key("readonly") if attrs.has_key("value"): # otherwise it is a marker 'select started' token o = Item(self, attrs, index) o.__dict__["_selected"] = attrs.has_key("selected") # add 'label' label and contents label, if different. If both are # provided, the 'label' label is used for display in HTML # 4.0-compliant browsers (and any lower spec? not sure) while the # contents are used for display in older or less-compliant # browsers. We make label objects for both, if the values are # different. label = attrs.get("label") if label: o._labels.append(Label({"__text": label})) if contents and contents != label: o._labels.append(Label({"__text": contents})) elif contents: o._labels.append(Label({"__text": contents})) def fixup(self): ListControl.fixup(self) # Firefox doesn't exclude disabled items from those considered here # (i.e. from 'found', for both branches of the if below). Note that # IE6 doesn't support the disabled attribute on OPTIONs at all. found = [o for o in self.items if o.selected] if not found: if not self.multiple or self._select_default: for o in self.items: if not o.disabled: was_disabled = self.disabled self.disabled = False try: o.selected = True finally: o.disabled = was_disabled break elif not self.multiple: # Ensure only one item selected. Choose the last one, # following IE and Firefox. for o in found[:-1]: o.selected = False #--------------------------------------------------- class SubmitControl(ScalarControl): """ Covers: INPUT/SUBMIT BUTTON/SUBMIT """ def __init__(self, type, name, attrs, index=None): ScalarControl.__init__(self, type, name, attrs, index) # IE5 defaults SUBMIT value to "Submit Query"; Firebird 0.6 leaves it # blank, Konqueror 3.1 defaults to "Submit". HTML spec. doesn't seem # to define this. if self.value is None: self.value = "" self.readonly = True def get_labels(self): res = [] if self.value: res.append(Label({"__text": self.value})) res.extend(ScalarControl.get_labels(self)) return res def is_of_kind(self, kind): return kind == "clickable" def _click(self, form, coord, return_type, request_class=_request.Request): self._clicked = coord r = form._switch_click(return_type, request_class) self._clicked = False return r def _totally_ordered_pairs(self): if not self._clicked: return [] return ScalarControl._totally_ordered_pairs(self) #--------------------------------------------------- class ImageControl(SubmitControl): """ Covers: INPUT/IMAGE Coordinates are specified using one of the HTMLForm.click* methods. """ def __init__(self, type, name, attrs, index=None): SubmitControl.__init__(self, type, name, attrs, index) self.readonly = False def _totally_ordered_pairs(self): clicked = self._clicked if self.disabled or not clicked: return [] name = self.name if name is None: return [] pairs = [ (self._index, "%s.x" % name, str(clicked[0])), (self._index+1, "%s.y" % name, str(clicked[1])), ] value = self._value if value: pairs.append((self._index+2, name, value)) return pairs get_labels = ScalarControl.get_labels # aliases, just to make str(control) and str(form) clearer class PasswordControl(TextControl): pass class HiddenControl(TextControl): pass class TextareaControl(TextControl): pass class SubmitButtonControl(SubmitControl): pass def is_listcontrol(control): return control.is_of_kind("list") class HTMLForm: """Represents a single HTML <form> ... </form> element. A form consists of a sequence of controls that usually have names, and which can take on various values. The values of the various types of controls represent variously: text, zero-or-one-of-many or many-of-many choices, and files to be uploaded. Some controls can be clicked on to submit the form, and clickable controls' values sometimes include the coordinates of the click. Forms can be filled in with data to be returned to the server, and then submitted, using the click method to generate a request object suitable for passing to mechanize.urlopen (or the click_request_data or click_pairs methods for integration with third-party code). import mechanize forms = mechanize.ParseFile(html, base_uri) form = forms[0] form["query"] = "Python" form.find_control("nr_results").get("lots").selected = True response = mechanize.urlopen(form.click()) Usually, HTMLForm instances are not created directly. Instead, the ParseFile or ParseResponse factory functions are used. If you do construct HTMLForm objects yourself, however, note that an HTMLForm instance is only properly initialised after the fixup method has been called (ParseFile and ParseResponse do this for you). See ListControl.__doc__ for the reason this is required. Indexing a form (form["control_name"]) returns the named Control's value attribute. Assignment to a form index (form["control_name"] = something) is equivalent to assignment to the named Control's value attribute. If you need to be more specific than just supplying the control's name, use the set_value and get_value methods. ListControl values are lists of item names (specifically, the names of the items that are selected and not disabled, and hence are "successful" -- ie. cause data to be returned to the server). The list item's name is the value of the corresponding HTML element's"value" attribute. Example: <INPUT type="CHECKBOX" name="cheeses" value="leicester"></INPUT> <INPUT type="CHECKBOX" name="cheeses" value="cheddar"></INPUT> defines a CHECKBOX control with name "cheeses" which has two items, named "leicester" and "cheddar". Another example: <SELECT name="more_cheeses"> <OPTION>1</OPTION> <OPTION value="2" label="CHEDDAR">cheddar</OPTION> </SELECT> defines a SELECT control with name "more_cheeses" which has two items, named "1" and "2" (because the OPTION element's value HTML attribute defaults to the element contents -- see SelectControl.__doc__ for more on these defaulting rules). To select, deselect or otherwise manipulate individual list items, use the HTMLForm.find_control() and ListControl.get() methods. To set the whole value, do as for any other control: use indexing or the set_/get_value methods. Example: # select *only* the item named "cheddar" form["cheeses"] = ["cheddar"] # select "cheddar", leave other items unaffected form.find_control("cheeses").get("cheddar").selected = True Some controls (RADIO and SELECT without the multiple attribute) can only have zero or one items selected at a time. Some controls (CHECKBOX and SELECT with the multiple attribute) can have multiple items selected at a time. To set the whole value of a ListControl, assign a sequence to a form index: form["cheeses"] = ["cheddar", "leicester"] If the ListControl is not multiple-selection, the assigned list must be of length one. To check if a control has an item, if an item is selected, or if an item is successful (selected and not disabled), respectively: "cheddar" in [item.name for item in form.find_control("cheeses").items] "cheddar" in [item.name for item in form.find_control("cheeses").items and item.selected] "cheddar" in form["cheeses"] # (or "cheddar" in form.get_value("cheeses")) Note that some list items may be disabled (see below). Note the following mistake: form[control_name] = control_value assert form[control_name] == control_value # not necessarily true The reason for this is that form[control_name] always gives the list items in the order they were listed in the HTML. List items (hence list values, too) can be referred to in terms of list item labels rather than list item names using the appropriate label arguments. Note that each item may have several labels. The question of default values of OPTION contents, labels and values is somewhat complicated: see SelectControl.__doc__ and ListControl.get_item_attrs.__doc__ if you think you need to know. Controls can be disabled or readonly. In either case, the control's value cannot be changed until you clear those flags (see example below). Disabled is the state typically represented by browsers by 'greying out' a control. Disabled controls are not 'successful' -- they don't cause data to get returned to the server. Readonly controls usually appear in browsers as read-only text boxes. Readonly controls are successful. List items can also be disabled. Attempts to select or deselect disabled items fail with AttributeError. If a lot of controls are readonly, it can be useful to do this: form.set_all_readonly(False) To clear a control's value attribute, so that it is not successful (until a value is subsequently set): form.clear("cheeses") More examples: control = form.find_control("cheeses") control.disabled = False control.readonly = False control.get("gruyere").disabled = True control.items[0].selected = True See the various Control classes for further documentation. Many methods take name, type, kind, id, label and nr arguments to specify the control to be operated on: see HTMLForm.find_control.__doc__. ControlNotFoundError (subclass of ValueError) is raised if the specified control can't be found. This includes occasions where a non-ListControl is found, but the method (set, for example) requires a ListControl. ItemNotFoundError (subclass of ValueError) is raised if a list item can't be found. ItemCountError (subclass of ValueError) is raised if an attempt is made to select more than one item and the control doesn't allow that, or set/get_single are called and the control contains more than one item. AttributeError is raised if a control or item is readonly or disabled and an attempt is made to alter its value. Security note: Remember that any passwords you store in HTMLForm instances will be saved to disk in the clear if you pickle them (directly or indirectly). The simplest solution to this is to avoid pickling HTMLForm objects. You could also pickle before filling in any password, or just set the password to "" before pickling. Public attributes: action: full (absolute URI) form action method: "GET" or "POST" enctype: form transfer encoding MIME type name: name of form (None if no name was specified) attrs: dictionary mapping original HTML form attributes to their values controls: list of Control instances; do not alter this list (instead, call form.new_control to make a Control and add it to the form, or control.add_to_form if you already have a Control instance) Methods for form filling: ------------------------- Most of the these methods have very similar arguments. See HTMLForm.find_control.__doc__ for details of the name, type, kind, label and nr arguments. def find_control(self, name=None, type=None, kind=None, id=None, predicate=None, nr=None, label=None) get_value(name=None, type=None, kind=None, id=None, nr=None, by_label=False, # by_label is deprecated label=None) set_value(value, name=None, type=None, kind=None, id=None, nr=None, by_label=False, # by_label is deprecated label=None) clear_all() clear(name=None, type=None, kind=None, id=None, nr=None, label=None) set_all_readonly(readonly) Method applying only to FileControls: add_file(file_object, content_type="application/octet-stream", filename=None, name=None, id=None, nr=None, label=None) Methods applying only to clickable controls: click(name=None, type=None, id=None, nr=0, coord=(1,1), label=None) click_request_data(name=None, type=None, id=None, nr=0, coord=(1,1), label=None) click_pairs(name=None, type=None, id=None, nr=0, coord=(1,1), label=None) """ type2class = { "text": TextControl, "password": PasswordControl, "hidden": HiddenControl, "textarea": TextareaControl, "isindex": IsindexControl, "file": FileControl, "button": IgnoreControl, "buttonbutton": IgnoreControl, "reset": IgnoreControl, "resetbutton": IgnoreControl, "submit": SubmitControl, "submitbutton": SubmitButtonControl, "image": ImageControl, "radio": RadioControl, "checkbox": CheckboxControl, "select": SelectControl, } #--------------------------------------------------- # Initialisation. Use ParseResponse / ParseFile instead. def __init__(self, action, method="GET", enctype="application/x-www-form-urlencoded", name=None, attrs=None, request_class=_request.Request, forms=None, labels=None, id_to_labels=None, backwards_compat=True): """ In the usual case, use ParseResponse (or ParseFile) to create new HTMLForm objects. action: full (absolute URI) form action method: "GET" or "POST" enctype: form transfer encoding MIME type name: name of form attrs: dictionary mapping original HTML form attributes to their values """ self.action = action self.method = method self.enctype = enctype self.name = name if attrs is not None: self.attrs = attrs.copy() else: self.attrs = {} self.controls = [] self._request_class = request_class # these attributes are used by zope.testbrowser self._forms = forms # this is a semi-public API! self._labels = labels # this is a semi-public API! self._id_to_labels = id_to_labels # this is a semi-public API! self.backwards_compat = backwards_compat # note __setattr__ self._urlunparse = urlparse.urlunparse self._urlparse = urlparse.urlparse def __getattr__(self, name): if name == "backwards_compat": return self._backwards_compat return getattr(HTMLForm, name) def __setattr__(self, name, value): # yuck if name == "backwards_compat": name = "_backwards_compat" value = bool(value) for cc in self.controls: try: items = cc.items except AttributeError: continue else: for ii in items: for ll in ii.get_labels(): ll._backwards_compat = value self.__dict__[name] = value def new_control(self, type, name, attrs, ignore_unknown=False, select_default=False, index=None): """Adds a new control to the form. This is usually called by ParseFile and ParseResponse. Don't call it youself unless you're building your own Control instances. Note that controls representing lists of items are built up from controls holding only a single list item. See ListControl.__doc__ for further information. type: type of control (see Control.__doc__ for a list) attrs: HTML attributes of control ignore_unknown: if true, use a dummy Control instance for controls of unknown type; otherwise, use a TextControl select_default: for RADIO and multiple-selection SELECT controls, pick the first item as the default if no 'selected' HTML attribute is present (this defaulting happens when the HTMLForm.fixup method is called) index: index of corresponding element in HTML (see MoreFormTests.test_interspersed_controls for motivation) """ type = type.lower() klass = self.type2class.get(type) if klass is None: if ignore_unknown: klass = IgnoreControl else: klass = TextControl a = attrs.copy() if issubclass(klass, ListControl): control = klass(type, name, a, select_default, index) else: control = klass(type, name, a, index) if type == "select" and len(attrs) == 1: for ii in range(len(self.controls)-1, -1, -1): ctl = self.controls[ii] if ctl.type == "select": ctl.close_control() break control.add_to_form(self) control._urlparse = self._urlparse control._urlunparse = self._urlunparse def fixup(self): """Normalise form after all controls have been added. This is usually called by ParseFile and ParseResponse. Don't call it youself unless you're building your own Control instances. This method should only be called once, after all controls have been added to the form. """ for control in self.controls: control.fixup() self.backwards_compat = self._backwards_compat #--------------------------------------------------- def __str__(self): header = "%s%s %s %s" % ( (self.name and self.name+" " or ""), self.method, self.action, self.enctype) rep = [header] for control in self.controls: rep.append(" %s" % str(control)) return "<%s>" % "\n".join(rep) #--------------------------------------------------- # Form-filling methods. def __getitem__(self, name): return self.find_control(name).value def __contains__(self, name): return bool(self.find_control(name)) def __setitem__(self, name, value): control = self.find_control(name) try: control.value = value except AttributeError, e: raise ValueError(str(e)) def get_value(self, name=None, type=None, kind=None, id=None, nr=None, by_label=False, # by_label is deprecated label=None): """Return value of control. If only name and value arguments are supplied, equivalent to form[name] """ if by_label: deprecation("form.get_value_by_label(...)") c = self.find_control(name, type, kind, id, label=label, nr=nr) if by_label: try: meth = c.get_value_by_label except AttributeError: raise NotImplementedError( "control '%s' does not yet support by_label" % c.name) else: return meth() else: return c.value def set_value(self, value, name=None, type=None, kind=None, id=None, nr=None, by_label=False, # by_label is deprecated label=None): """Set value of control. If only name and value arguments are supplied, equivalent to form[name] = value """ if by_label: deprecation("form.get_value_by_label(...)") c = self.find_control(name, type, kind, id, label=label, nr=nr) if by_label: try: meth = c.set_value_by_label except AttributeError: raise NotImplementedError( "control '%s' does not yet support by_label" % c.name) else: meth(value) else: c.value = value def get_value_by_label( self, name=None, type=None, kind=None, id=None, label=None, nr=None): """ All arguments should be passed by name. """ c = self.find_control(name, type, kind, id, label=label, nr=nr) return c.get_value_by_label() def set_value_by_label( self, value, name=None, type=None, kind=None, id=None, label=None, nr=None): """ All arguments should be passed by name. """ c = self.find_control(name, type, kind, id, label=label, nr=nr) c.set_value_by_label(value) def set_all_readonly(self, readonly): for control in self.controls: control.readonly = bool(readonly) def clear_all(self): """Clear the value attributes of all controls in the form. See HTMLForm.clear.__doc__. """ for control in self.controls: control.clear() def clear(self, name=None, type=None, kind=None, id=None, nr=None, label=None): """Clear the value attribute of a control. As a result, the affected control will not be successful until a value is subsequently set. AttributeError is raised on readonly controls. """ c = self.find_control(name, type, kind, id, label=label, nr=nr) c.clear() #--------------------------------------------------- # Form-filling methods applying only to ListControls. def possible_items(self, # deprecated name=None, type=None, kind=None, id=None, nr=None, by_label=False, label=None): """Return a list of all values that the specified control can take.""" c = self._find_list_control(name, type, kind, id, label, nr) return c.possible_items(by_label) def set(self, selected, item_name, # deprecated name=None, type=None, kind=None, id=None, nr=None, by_label=False, label=None): """Select / deselect named list item. selected: boolean selected state """ self._find_list_control(name, type, kind, id, label, nr).set( selected, item_name, by_label) def toggle(self, item_name, # deprecated name=None, type=None, kind=None, id=None, nr=None, by_label=False, label=None): """Toggle selected state of named list item.""" self._find_list_control(name, type, kind, id, label, nr).toggle( item_name, by_label) def set_single(self, selected, # deprecated name=None, type=None, kind=None, id=None, nr=None, by_label=None, label=None): """Select / deselect list item in a control having only one item. If the control has multiple list items, ItemCountError is raised. This is just a convenience method, so you don't need to know the item's name -- the item name in these single-item controls is usually something meaningless like "1" or "on". For example, if a checkbox has a single item named "on", the following two calls are equivalent: control.toggle("on") control.toggle_single() """ # by_label ignored and deprecated self._find_list_control( name, type, kind, id, label, nr).set_single(selected) def toggle_single(self, name=None, type=None, kind=None, id=None, nr=None, by_label=None, label=None): # deprecated """Toggle selected state of list item in control having only one item. The rest is as for HTMLForm.set_single.__doc__. """ # by_label ignored and deprecated self._find_list_control(name, type, kind, id, label, nr).toggle_single() #--------------------------------------------------- # Form-filling method applying only to FileControls. def add_file(self, file_object, content_type=None, filename=None, name=None, id=None, nr=None, label=None): """Add a file to be uploaded. file_object: file-like object (with read method) from which to read data to upload content_type: MIME content type of data to upload filename: filename to pass to server If filename is None, no filename is sent to the server. If content_type is None, the content type is guessed based on the filename and the data from read from the file object. XXX At the moment, guessed content type is always application/octet-stream. Use sndhdr, imghdr modules. Should also try to guess HTML, XML, and plain text. Note the following useful HTML attributes of file upload controls (see HTML 4.01 spec, section 17): accept: comma-separated list of content types that the server will handle correctly; you can use this to filter out non-conforming files size: XXX IIRC, this is indicative of whether form wants multiple or single files maxlength: XXX hint of max content length in bytes? """ self.find_control(name, "file", id=id, label=label, nr=nr).add_file( file_object, content_type, filename) #--------------------------------------------------- # Form submission methods, applying only to clickable controls. def click(self, name=None, type=None, id=None, nr=0, coord=(1,1), request_class=_request.Request, label=None): """Return request that would result from clicking on a control. The request object is a mechanize.Request instance, which you can pass to mechanize.urlopen. Only some control types (INPUT/SUBMIT & BUTTON/SUBMIT buttons and IMAGEs) can be clicked. Will click on the first clickable control, subject to the name, type and nr arguments (as for find_control). If no name, type, id or number is specified and there are no clickable controls, a request will be returned for the form in its current, un-clicked, state. IndexError is raised if any of name, type, id or nr is specified but no matching control is found. ValueError is raised if the HTMLForm has an enctype attribute that is not recognised. You can optionally specify a coordinate to click at, which only makes a difference if you clicked on an image. """ return self._click(name, type, id, label, nr, coord, "request", self._request_class) def click_request_data(self, name=None, type=None, id=None, nr=0, coord=(1,1), request_class=_request.Request, label=None): """As for click method, but return a tuple (url, data, headers). You can use this data to send a request to the server. This is useful if you're using httplib or urllib rather than mechanize. Otherwise, use the click method. # Untested. Have to subclass to add headers, I think -- so use # mechanize instead! import urllib url, data, hdrs = form.click_request_data() r = urllib.urlopen(url, data) # Untested. I don't know of any reason to use httplib -- you can get # just as much control with mechanize. import httplib, urlparse url, data, hdrs = form.click_request_data() tup = urlparse(url) host, path = tup[1], urlparse.urlunparse((None, None)+tup[2:]) conn = httplib.HTTPConnection(host) if data: httplib.request("POST", path, data, hdrs) else: httplib.request("GET", path, headers=hdrs) r = conn.getresponse() """ return self._click(name, type, id, label, nr, coord, "request_data", self._request_class) def click_pairs(self, name=None, type=None, id=None, nr=0, coord=(1,1), label=None): """As for click_request_data, but returns a list of (key, value) pairs. You can use this list as an argument to urllib.urlencode. This is usually only useful if you're using httplib or urllib rather than mechanize. It may also be useful if you want to manually tweak the keys and/or values, but this should not be necessary. Otherwise, use the click method. Note that this method is only useful for forms of MIME type x-www-form-urlencoded. In particular, it does not return the information required for file upload. If you need file upload and are not using mechanize, use click_request_data. """ return self._click(name, type, id, label, nr, coord, "pairs", self._request_class) #--------------------------------------------------- def find_control(self, name=None, type=None, kind=None, id=None, predicate=None, nr=None, label=None): """Locate and return some specific control within the form. At least one of the name, type, kind, predicate and nr arguments must be supplied. If no matching control is found, ControlNotFoundError is raised. If name is specified, then the control must have the indicated name. If type is specified then the control must have the specified type (in addition to the types possible for <input> HTML tags: "text", "password", "hidden", "submit", "image", "button", "radio", "checkbox", "file" we also have "reset", "buttonbutton", "submitbutton", "resetbutton", "textarea", "select" and "isindex"). If kind is specified, then the control must fall into the specified group, each of which satisfies a particular interface. The types are "text", "list", "multilist", "singlelist", "clickable" and "file". If id is specified, then the control must have the indicated id. If predicate is specified, then the control must match that function. The predicate function is passed the control as its single argument, and should return a boolean value indicating whether the control matched. nr, if supplied, is the sequence number of the control (where 0 is the first). Note that control 0 is the first control matching all the other arguments (if supplied); it is not necessarily the first control in the form. If no nr is supplied, AmbiguityError is raised if multiple controls match the other arguments (unless the .backwards-compat attribute is true). If label is specified, then the control must have this label. Note that radio controls and checkboxes never have labels: their items do. """ if ((name is None) and (type is None) and (kind is None) and (id is None) and (label is None) and (predicate is None) and (nr is None)): raise ValueError( "at least one argument must be supplied to specify control") return self._find_control(name, type, kind, id, label, predicate, nr) #--------------------------------------------------- # Private methods. def _find_list_control(self, name=None, type=None, kind=None, id=None, label=None, nr=None): if ((name is None) and (type is None) and (kind is None) and (id is None) and (label is None) and (nr is None)): raise ValueError( "at least one argument must be supplied to specify control") return self._find_control(name, type, kind, id, label, is_listcontrol, nr) def _find_control(self, name, type, kind, id, label, predicate, nr): if ((name is not None) and (name is not Missing) and not isstringlike(name)): raise TypeError("control name must be string-like") if (type is not None) and not isstringlike(type): raise TypeError("control type must be string-like") if (kind is not None) and not isstringlike(kind): raise TypeError("control kind must be string-like") if (id is not None) and not isstringlike(id): raise TypeError("control id must be string-like") if (label is not None) and not isstringlike(label): raise TypeError("control label must be string-like") if (predicate is not None) and not callable(predicate): raise TypeError("control predicate must be callable") if (nr is not None) and nr < 0: raise ValueError("control number must be a positive integer") orig_nr = nr found = None ambiguous = False if nr is None and self.backwards_compat: nr = 0 for control in self.controls: if ((name is not None and name != control.name) and (name is not Missing or control.name is not None)): continue if type is not None and type != control.type: continue if kind is not None and not control.is_of_kind(kind): continue if id is not None and id != control.id: continue if predicate and not predicate(control): continue if label: for l in control.get_labels(): if l.text.find(label) > -1: break else: continue if nr is not None: if nr == 0: return control # early exit: unambiguous due to nr nr -= 1 continue if found: ambiguous = True break found = control if found and not ambiguous: return found description = [] if name is not None: description.append("name %s" % repr(name)) if type is not None: description.append("type '%s'" % type) if kind is not None: description.append("kind '%s'" % kind) if id is not None: description.append("id '%s'" % id) if label is not None: description.append("label '%s'" % label) if predicate is not None: description.append("predicate %s" % predicate) if orig_nr: description.append("nr %d" % orig_nr) description = ", ".join(description) if ambiguous: raise AmbiguityError("more than one control matching "+description) elif not found: raise ControlNotFoundError("no control matching "+description) assert False def _click(self, name, type, id, label, nr, coord, return_type, request_class=_request.Request): try: control = self._find_control( name, type, "clickable", id, label, None, nr) except ControlNotFoundError: if ((name is not None) or (type is not None) or (id is not None) or (label is not None) or (nr != 0)): raise # no clickable controls, but no control was explicitly requested, # so return state without clicking any control return self._switch_click(return_type, request_class) else: return control._click(self, coord, return_type, request_class) def _pairs(self): """Return sequence of (key, value) pairs suitable for urlencoding.""" return [(k, v) for (i, k, v, c_i) in self._pairs_and_controls()] def _pairs_and_controls(self): """Return sequence of (index, key, value, control_index) of totally ordered pairs suitable for urlencoding. control_index is the index of the control in self.controls """ pairs = [] for control_index in range(len(self.controls)): control = self.controls[control_index] for ii, key, val in control._totally_ordered_pairs(): pairs.append((ii, key, val, control_index)) # stable sort by ONLY first item in tuple pairs.sort() return pairs def _request_data(self): """Return a tuple (url, data, headers).""" method = self.method.upper() #scheme, netloc, path, parameters, query, frag = urlparse.urlparse(self.action) parts = self._urlparse(self.action) rest, (query, frag) = parts[:-2], parts[-2:] if method == "GET": if self.enctype != "application/x-www-form-urlencoded": raise ValueError( "unknown GET form encoding type '%s'" % self.enctype) parts = rest + (urllib.urlencode(self._pairs()), None) uri = self._urlunparse(parts) return uri, None, [] elif method == "POST": parts = rest + (query, None) uri = self._urlunparse(parts) if self.enctype == "application/x-www-form-urlencoded": return (uri, urllib.urlencode(self._pairs()), [("Content-Type", self.enctype)]) elif self.enctype == "multipart/form-data": data = StringIO() http_hdrs = [] mw = MimeWriter(data, http_hdrs) mw.startmultipartbody("form-data", add_to_http_hdrs=True, prefix=0) for ii, k, v, control_index in self._pairs_and_controls(): self.controls[control_index]._write_mime_data(mw, k, v) mw.lastpart() return uri, data.getvalue(), http_hdrs else: raise ValueError( "unknown POST form encoding type '%s'" % self.enctype) else: raise ValueError("Unknown method '%s'" % method) def _switch_click(self, return_type, request_class=_request.Request): # This is called by HTMLForm and clickable Controls to hide switching # on return_type. if return_type == "pairs": return self._pairs() elif return_type == "request_data": return self._request_data() else: req_data = self._request_data() req = request_class(req_data[0], req_data[1]) for key, val in req_data[2]: add_hdr = req.add_header if key.lower() == "content-type": try: add_hdr = req.add_unredirected_header except AttributeError: # pre-2.4 and not using ClientCookie pass add_hdr(key, val) return req �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.2.5/mechanize/_firefox3cookiejar.py�����������������������������������������������������0000644�0001750�0001750�00000020231�11545150644�020232� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""Firefox 3 "cookies.sqlite" cookie persistence. Copyright 2008 John J Lee <jjl@pobox.com> This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ import logging import time from _clientcookie import CookieJar, Cookie, MappingIterator from _util import isstringlike, experimental debug = logging.getLogger("mechanize.cookies").debug class Firefox3CookieJar(CookieJar): """Firefox 3 cookie jar. The cookies are stored in Firefox 3's "cookies.sqlite" format. Constructor arguments: filename: filename of cookies.sqlite (typically found at the top level of a firefox profile directory) autoconnect: as a convenience, connect to the SQLite cookies database at Firefox3CookieJar construction time (default True) policy: an object satisfying the mechanize.CookiePolicy interface Note that this is NOT a FileCookieJar, and there are no .load(), .save() or .restore() methods. The database is in sync with the cookiejar object's state after each public method call. Following Firefox's own behaviour, session cookies are never saved to the database. The file is created, and an sqlite database written to it, if it does not already exist. The moz_cookies database table is created if it does not already exist. """ # XXX # handle DatabaseError exceptions # add a FileCookieJar (explicit .save() / .revert() / .load() methods) def __init__(self, filename, autoconnect=True, policy=None): experimental("Firefox3CookieJar is experimental code") CookieJar.__init__(self, policy) if filename is not None and not isstringlike(filename): raise ValueError("filename must be string-like") self.filename = filename self._conn = None if autoconnect: self.connect() def connect(self): import sqlite3 # not available in Python 2.4 stdlib self._conn = sqlite3.connect(self.filename) self._conn.isolation_level = "DEFERRED" self._create_table_if_necessary() def close(self): self._conn.close() def _transaction(self, func): try: cur = self._conn.cursor() try: result = func(cur) finally: cur.close() except: self._conn.rollback() raise else: self._conn.commit() return result def _execute(self, query, params=()): return self._transaction(lambda cur: cur.execute(query, params)) def _query(self, query, params=()): # XXX should we bother with a transaction? cur = self._conn.cursor() try: cur.execute(query, params) return cur.fetchall() finally: cur.close() def _create_table_if_necessary(self): self._execute("""\ CREATE TABLE IF NOT EXISTS moz_cookies (id INTEGER PRIMARY KEY, name TEXT, value TEXT, host TEXT, path TEXT,expiry INTEGER, lastAccessed INTEGER, isSecure INTEGER, isHttpOnly INTEGER)""") def _cookie_from_row(self, row): (pk, name, value, domain, path, expires, last_accessed, secure, http_only) = row version = 0 domain = domain.encode("ascii", "ignore") path = path.encode("ascii", "ignore") name = name.encode("ascii", "ignore") value = value.encode("ascii", "ignore") secure = bool(secure) # last_accessed isn't a cookie attribute, so isn't added to rest rest = {} if http_only: rest["HttpOnly"] = None if name == "": name = value value = None initial_dot = domain.startswith(".") domain_specified = initial_dot discard = False if expires == "": expires = None discard = True return Cookie(version, name, value, None, False, domain, domain_specified, initial_dot, path, False, secure, expires, discard, None, None, rest) def clear(self, domain=None, path=None, name=None): CookieJar.clear(self, domain, path, name) where_parts = [] sql_params = [] if domain is not None: where_parts.append("host = ?") sql_params.append(domain) if path is not None: where_parts.append("path = ?") sql_params.append(path) if name is not None: where_parts.append("name = ?") sql_params.append(name) where = " AND ".join(where_parts) if where: where = " WHERE " + where def clear(cur): cur.execute("DELETE FROM moz_cookies%s" % where, tuple(sql_params)) self._transaction(clear) def _row_from_cookie(self, cookie, cur): expires = cookie.expires if cookie.discard: expires = "" domain = unicode(cookie.domain) path = unicode(cookie.path) name = unicode(cookie.name) value = unicode(cookie.value) secure = bool(int(cookie.secure)) if value is None: value = name name = "" last_accessed = int(time.time()) http_only = cookie.has_nonstandard_attr("HttpOnly") query = cur.execute("""SELECT MAX(id) + 1 from moz_cookies""") pk = query.fetchone()[0] if pk is None: pk = 1 return (pk, name, value, domain, path, expires, last_accessed, secure, http_only) def set_cookie(self, cookie): if cookie.discard: CookieJar.set_cookie(self, cookie) return def set_cookie(cur): # XXX # is this RFC 2965-correct? # could this do an UPDATE instead? row = self._row_from_cookie(cookie, cur) name, unused, domain, path = row[1:5] cur.execute("""\ DELETE FROM moz_cookies WHERE host = ? AND path = ? AND name = ?""", (domain, path, name)) cur.execute("""\ INSERT INTO moz_cookies VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?) """, row) self._transaction(set_cookie) def __iter__(self): # session (non-persistent) cookies for cookie in MappingIterator(self._cookies): yield cookie # persistent cookies for row in self._query("""\ SELECT * FROM moz_cookies ORDER BY name, path, host"""): yield self._cookie_from_row(row) def _cookies_for_request(self, request): session_cookies = CookieJar._cookies_for_request(self, request) def get_cookies(cur): query = cur.execute("SELECT host from moz_cookies") domains = [row[0] for row in query.fetchall()] cookies = [] for domain in domains: cookies += self._persistent_cookies_for_domain(domain, request, cur) return cookies persistent_coookies = self._transaction(get_cookies) return session_cookies + persistent_coookies def _persistent_cookies_for_domain(self, domain, request, cur): cookies = [] if not self._policy.domain_return_ok(domain, request): return [] debug("Checking %s for cookies to return", domain) query = cur.execute("""\ SELECT * from moz_cookies WHERE host = ? ORDER BY path""", (domain,)) cookies = [self._cookie_from_row(row) for row in query.fetchall()] last_path = None r = [] for cookie in cookies: if (cookie.path != last_path and not self._policy.path_return_ok(cookie.path, request)): last_path = cookie.path continue if not self._policy.return_ok(cookie, request): debug(" not returning cookie") continue debug(" it's a match") r.append(cookie) return r �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.2.5/mechanize/_http.py������������������������������������������������������������������0000644�0001750�0001750�00000034022�11545150644�015600� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""HTTP related handlers. Note that some other HTTP handlers live in more specific modules: _auth.py, _gzip.py, etc. Copyright 2002-2006 John J Lee <jjl@pobox.com> This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ import HTMLParser from cStringIO import StringIO import htmlentitydefs import logging import robotparser import socket import time import _sgmllib_copy as sgmllib from _urllib2_fork import HTTPError, BaseHandler from _headersutil import is_html from _html import unescape, unescape_charref from _request import Request from _response import response_seek_wrapper import _rfc3986 import _sockettimeout debug = logging.getLogger("mechanize").debug debug_robots = logging.getLogger("mechanize.robots").debug # monkeypatch urllib2.HTTPError to show URL ## import urllib2 ## def urllib2_str(self): ## return 'HTTP Error %s: %s (%s)' % ( ## self.code, self.msg, self.geturl()) ## urllib2.HTTPError.__str__ = urllib2_str CHUNK = 1024 # size of chunks fed to HTML HEAD parser, in bytes DEFAULT_ENCODING = 'latin-1' # XXX would self.reset() work, instead of raising this exception? class EndOfHeadError(Exception): pass class AbstractHeadParser: # only these elements are allowed in or before HEAD of document head_elems = ("html", "head", "title", "base", "script", "style", "meta", "link", "object") _entitydefs = htmlentitydefs.name2codepoint _encoding = DEFAULT_ENCODING def __init__(self): self.http_equiv = [] def start_meta(self, attrs): http_equiv = content = None for key, value in attrs: if key == "http-equiv": http_equiv = self.unescape_attr_if_required(value) elif key == "content": content = self.unescape_attr_if_required(value) if http_equiv is not None and content is not None: self.http_equiv.append((http_equiv, content)) def end_head(self): raise EndOfHeadError() def handle_entityref(self, name): #debug("%s", name) self.handle_data(unescape( '&%s;' % name, self._entitydefs, self._encoding)) def handle_charref(self, name): #debug("%s", name) self.handle_data(unescape_charref(name, self._encoding)) def unescape_attr(self, name): #debug("%s", name) return unescape(name, self._entitydefs, self._encoding) def unescape_attrs(self, attrs): #debug("%s", attrs) escaped_attrs = {} for key, val in attrs.items(): escaped_attrs[key] = self.unescape_attr(val) return escaped_attrs def unknown_entityref(self, ref): self.handle_data("&%s;" % ref) def unknown_charref(self, ref): self.handle_data("&#%s;" % ref) class XHTMLCompatibleHeadParser(AbstractHeadParser, HTMLParser.HTMLParser): def __init__(self): HTMLParser.HTMLParser.__init__(self) AbstractHeadParser.__init__(self) def handle_starttag(self, tag, attrs): if tag not in self.head_elems: raise EndOfHeadError() try: method = getattr(self, 'start_' + tag) except AttributeError: try: method = getattr(self, 'do_' + tag) except AttributeError: pass # unknown tag else: method(attrs) else: method(attrs) def handle_endtag(self, tag): if tag not in self.head_elems: raise EndOfHeadError() try: method = getattr(self, 'end_' + tag) except AttributeError: pass # unknown tag else: method() def unescape(self, name): # Use the entitydefs passed into constructor, not # HTMLParser.HTMLParser's entitydefs. return self.unescape_attr(name) def unescape_attr_if_required(self, name): return name # HTMLParser.HTMLParser already did it class HeadParser(AbstractHeadParser, sgmllib.SGMLParser): def _not_called(self): assert False def __init__(self): sgmllib.SGMLParser.__init__(self) AbstractHeadParser.__init__(self) def handle_starttag(self, tag, method, attrs): if tag not in self.head_elems: raise EndOfHeadError() if tag == "meta": method(attrs) def unknown_starttag(self, tag, attrs): self.handle_starttag(tag, self._not_called, attrs) def handle_endtag(self, tag, method): if tag in self.head_elems: method() else: raise EndOfHeadError() def unescape_attr_if_required(self, name): return self.unescape_attr(name) def parse_head(fileobj, parser): """Return a list of key, value pairs.""" while 1: data = fileobj.read(CHUNK) try: parser.feed(data) except EndOfHeadError: break if len(data) != CHUNK: # this should only happen if there is no HTML body, or if # CHUNK is big break return parser.http_equiv class HTTPEquivProcessor(BaseHandler): """Append META HTTP-EQUIV headers to regular HTTP headers.""" handler_order = 300 # before handlers that look at HTTP headers def __init__(self, head_parser_class=HeadParser, i_want_broken_xhtml_support=False, ): self.head_parser_class = head_parser_class self._allow_xhtml = i_want_broken_xhtml_support def http_response(self, request, response): if not hasattr(response, "seek"): response = response_seek_wrapper(response) http_message = response.info() url = response.geturl() ct_hdrs = http_message.getheaders("content-type") if is_html(ct_hdrs, url, self._allow_xhtml): try: try: html_headers = parse_head(response, self.head_parser_class()) finally: response.seek(0) except (HTMLParser.HTMLParseError, sgmllib.SGMLParseError): pass else: for hdr, val in html_headers: # add a header http_message.dict[hdr.lower()] = val text = hdr + ": " + val for line in text.split("\n"): http_message.headers.append(line + "\n") return response https_response = http_response class MechanizeRobotFileParser(robotparser.RobotFileParser): def __init__(self, url='', opener=None): robotparser.RobotFileParser.__init__(self, url) self._opener = opener self._timeout = _sockettimeout._GLOBAL_DEFAULT_TIMEOUT def set_opener(self, opener=None): import _opener if opener is None: opener = _opener.OpenerDirector() self._opener = opener def set_timeout(self, timeout): self._timeout = timeout def read(self): """Reads the robots.txt URL and feeds it to the parser.""" if self._opener is None: self.set_opener() req = Request(self.url, unverifiable=True, visit=False, timeout=self._timeout) try: f = self._opener.open(req) except HTTPError, f: pass except (IOError, socket.error, OSError), exc: debug_robots("ignoring error opening %r: %s" % (self.url, exc)) return lines = [] line = f.readline() while line: lines.append(line.strip()) line = f.readline() status = f.code if status == 401 or status == 403: self.disallow_all = True debug_robots("disallow all") elif status >= 400: self.allow_all = True debug_robots("allow all") elif status == 200 and lines: debug_robots("parse lines") self.parse(lines) class RobotExclusionError(HTTPError): def __init__(self, request, *args): apply(HTTPError.__init__, (self,)+args) self.request = request class HTTPRobotRulesProcessor(BaseHandler): # before redirections, after everything else handler_order = 800 try: from httplib import HTTPMessage except: from mimetools import Message http_response_class = Message else: http_response_class = HTTPMessage def __init__(self, rfp_class=MechanizeRobotFileParser): self.rfp_class = rfp_class self.rfp = None self._host = None def http_request(self, request): scheme = request.get_type() if scheme not in ["http", "https"]: # robots exclusion only applies to HTTP return request if request.get_selector() == "/robots.txt": # /robots.txt is always OK to fetch return request host = request.get_host() # robots.txt requests don't need to be allowed by robots.txt :-) origin_req = getattr(request, "_origin_req", None) if (origin_req is not None and origin_req.get_selector() == "/robots.txt" and origin_req.get_host() == host ): return request if host != self._host: self.rfp = self.rfp_class() try: self.rfp.set_opener(self.parent) except AttributeError: debug("%r instance does not support set_opener" % self.rfp.__class__) self.rfp.set_url(scheme+"://"+host+"/robots.txt") self.rfp.set_timeout(request.timeout) self.rfp.read() self._host = host ua = request.get_header("User-agent", "") if self.rfp.can_fetch(ua, request.get_full_url()): return request else: # XXX This should really have raised URLError. Too late now... msg = "request disallowed by robots.txt" raise RobotExclusionError( request, request.get_full_url(), 403, msg, self.http_response_class(StringIO()), StringIO(msg)) https_request = http_request class HTTPRefererProcessor(BaseHandler): """Add Referer header to requests. This only makes sense if you use each RefererProcessor for a single chain of requests only (so, for example, if you use a single HTTPRefererProcessor to fetch a series of URLs extracted from a single page, this will break). There's a proper implementation of this in mechanize.Browser. """ def __init__(self): self.referer = None def http_request(self, request): if ((self.referer is not None) and not request.has_header("Referer")): request.add_unredirected_header("Referer", self.referer) return request def http_response(self, request, response): self.referer = response.geturl() return response https_request = http_request https_response = http_response def clean_refresh_url(url): # e.g. Firefox 1.5 does (something like) this if ((url.startswith('"') and url.endswith('"')) or (url.startswith("'") and url.endswith("'"))): url = url[1:-1] return _rfc3986.clean_url(url, "latin-1") # XXX encoding def parse_refresh_header(refresh): """ >>> parse_refresh_header("1; url=http://example.com/") (1.0, 'http://example.com/') >>> parse_refresh_header("1; url='http://example.com/'") (1.0, 'http://example.com/') >>> parse_refresh_header("1") (1.0, None) >>> parse_refresh_header("blah") # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): ValueError: invalid literal for float(): blah """ ii = refresh.find(";") if ii != -1: pause, newurl_spec = float(refresh[:ii]), refresh[ii+1:] jj = newurl_spec.find("=") key = None if jj != -1: key, newurl = newurl_spec[:jj], newurl_spec[jj+1:] newurl = clean_refresh_url(newurl) if key is None or key.strip().lower() != "url": raise ValueError() else: pause, newurl = float(refresh), None return pause, newurl class HTTPRefreshProcessor(BaseHandler): """Perform HTTP Refresh redirections. Note that if a non-200 HTTP code has occurred (for example, a 30x redirect), this processor will do nothing. By default, only zero-time Refresh headers are redirected. Use the max_time attribute / constructor argument to allow Refresh with longer pauses. Use the honor_time attribute / constructor argument to control whether the requested pause is honoured (with a time.sleep()) or skipped in favour of immediate redirection. Public attributes: max_time: see above honor_time: see above """ handler_order = 1000 def __init__(self, max_time=0, honor_time=True): self.max_time = max_time self.honor_time = honor_time self._sleep = time.sleep def http_response(self, request, response): code, msg, hdrs = response.code, response.msg, response.info() if code == 200 and hdrs.has_key("refresh"): refresh = hdrs.getheaders("refresh")[0] try: pause, newurl = parse_refresh_header(refresh) except ValueError: debug("bad Refresh header: %r" % refresh) return response if newurl is None: newurl = response.geturl() if (self.max_time is None) or (pause <= self.max_time): if pause > 1E-3 and self.honor_time: self._sleep(pause) hdrs["location"] = newurl # hardcoded http is NOT a bug response = self.parent.error( "http", request, response, "refresh", msg, hdrs) else: debug("Refresh header ignored: %r" % refresh) return response https_response = http_response ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.2.5/mechanize/_gzip.py������������������������������������������������������������������0000644�0001750�0001750�00000006365�11545150644�015603� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������from cStringIO import StringIO import _response import _urllib2_fork # GzipConsumer was taken from Fredrik Lundh's effbot.org-0.1-20041009 library class GzipConsumer: def __init__(self, consumer): self.__consumer = consumer self.__decoder = None self.__data = "" def __getattr__(self, key): return getattr(self.__consumer, key) def feed(self, data): if self.__decoder is None: # check if we have a full gzip header data = self.__data + data try: i = 10 flag = ord(data[3]) if flag & 4: # extra x = ord(data[i]) + 256*ord(data[i+1]) i = i + 2 + x if flag & 8: # filename while ord(data[i]): i = i + 1 i = i + 1 if flag & 16: # comment while ord(data[i]): i = i + 1 i = i + 1 if flag & 2: # crc i = i + 2 if len(data) < i: raise IndexError("not enough data") if data[:3] != "\x1f\x8b\x08": raise IOError("invalid gzip data") data = data[i:] except IndexError: self.__data = data return # need more data import zlib self.__data = "" self.__decoder = zlib.decompressobj(-zlib.MAX_WBITS) data = self.__decoder.decompress(data) if data: self.__consumer.feed(data) def close(self): if self.__decoder: data = self.__decoder.flush() if data: self.__consumer.feed(data) self.__consumer.close() # -------------------------------------------------------------------- # the rest of this module is John Lee's stupid code, not # Fredrik's nice code :-) class stupid_gzip_consumer: def __init__(self): self.data = [] def feed(self, data): self.data.append(data) class stupid_gzip_wrapper(_response.closeable_response): def __init__(self, response): self._response = response c = stupid_gzip_consumer() gzc = GzipConsumer(c) gzc.feed(response.read()) self.__data = StringIO("".join(c.data)) def read(self, size=-1): return self.__data.read(size) def readline(self, size=-1): return self.__data.readline(size) def readlines(self, sizehint=-1): return self.__data.readlines(sizehint) def __getattr__(self, name): # delegate unknown methods/attributes return getattr(self._response, name) class HTTPGzipProcessor(_urllib2_fork.BaseHandler): handler_order = 200 # response processing before HTTPEquivProcessor def http_request(self, request): request.add_header("Accept-Encoding", "gzip") return request def http_response(self, request, response): # post-process response enc_hdrs = response.info().getheaders("Content-encoding") for enc_hdr in enc_hdrs: if ("gzip" in enc_hdr) or ("compress" in enc_hdr): return stupid_gzip_wrapper(response) return response https_response = http_response ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.2.5/mechanize/_html.py������������������������������������������������������������������0000644�0001750�0001750�00000050630�11545150644�015570� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""HTML handling. Copyright 2003-2006 John J. Lee <jjl@pobox.com> This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ import codecs import copy import htmlentitydefs import re import _sgmllib_copy as sgmllib import _beautifulsoup import _form from _headersutil import split_header_words, is_html as _is_html import _request import _rfc3986 DEFAULT_ENCODING = "latin-1" COMPRESS_RE = re.compile(r"\s+") class CachingGeneratorFunction(object): """Caching wrapper around a no-arguments iterable.""" def __init__(self, iterable): self._cache = [] # wrap iterable to make it non-restartable (otherwise, repeated # __call__ would give incorrect results) self._iterator = iter(iterable) def __call__(self): cache = self._cache for item in cache: yield item for item in self._iterator: cache.append(item) yield item class EncodingFinder: def __init__(self, default_encoding): self._default_encoding = default_encoding def encoding(self, response): # HTTPEquivProcessor may be in use, so both HTTP and HTTP-EQUIV # headers may be in the response. HTTP-EQUIV headers come last, # so try in order from first to last. for ct in response.info().getheaders("content-type"): for k, v in split_header_words([ct])[0]: if k == "charset": encoding = v try: codecs.lookup(v) except LookupError: continue else: return encoding return self._default_encoding class ResponseTypeFinder: def __init__(self, allow_xhtml): self._allow_xhtml = allow_xhtml def is_html(self, response, encoding): ct_hdrs = response.info().getheaders("content-type") url = response.geturl() # XXX encoding return _is_html(ct_hdrs, url, self._allow_xhtml) class Args(object): # idea for this argument-processing trick is from Peter Otten def __init__(self, args_map): self.__dict__["dictionary"] = dict(args_map) def __getattr__(self, key): try: return self.dictionary[key] except KeyError: return getattr(self.__class__, key) def __setattr__(self, key, value): if key == "dictionary": raise AttributeError() self.dictionary[key] = value def form_parser_args( select_default=False, form_parser_class=None, request_class=None, backwards_compat=False, ): return Args(locals()) class Link: def __init__(self, base_url, url, text, tag, attrs): assert None not in [url, tag, attrs] self.base_url = base_url self.absolute_url = _rfc3986.urljoin(base_url, url) self.url, self.text, self.tag, self.attrs = url, text, tag, attrs def __cmp__(self, other): try: for name in "url", "text", "tag", "attrs": if getattr(self, name) != getattr(other, name): return -1 except AttributeError: return -1 return 0 def __repr__(self): return "Link(base_url=%r, url=%r, text=%r, tag=%r, attrs=%r)" % ( self.base_url, self.url, self.text, self.tag, self.attrs) class LinksFactory: def __init__(self, link_parser_class=None, link_class=Link, urltags=None, ): import _pullparser if link_parser_class is None: link_parser_class = _pullparser.TolerantPullParser self.link_parser_class = link_parser_class self.link_class = link_class if urltags is None: urltags = { "a": "href", "area": "href", "frame": "src", "iframe": "src", } self.urltags = urltags self._response = None self._encoding = None def set_response(self, response, base_url, encoding): self._response = response self._encoding = encoding self._base_url = base_url def links(self): """Return an iterator that provides links of the document.""" response = self._response encoding = self._encoding base_url = self._base_url p = self.link_parser_class(response, encoding=encoding) try: for token in p.tags(*(self.urltags.keys()+["base"])): if token.type == "endtag": continue if token.data == "base": base_href = dict(token.attrs).get("href") if base_href is not None: base_url = base_href continue attrs = dict(token.attrs) tag = token.data text = None # XXX use attr_encoding for ref'd doc if that doc does not # provide one by other means #attr_encoding = attrs.get("charset") url = attrs.get(self.urltags[tag]) # XXX is "" a valid URL? if not url: # Probably an <A NAME="blah"> link or <AREA NOHREF...>. # For our purposes a link is something with a URL, so # ignore this. continue url = _rfc3986.clean_url(url, encoding) if tag == "a": if token.type != "startendtag": # hmm, this'd break if end tag is missing text = p.get_compressed_text(("endtag", tag)) # but this doesn't work for e.g. # <a href="blah"><b>Andy</b></a> #text = p.get_compressed_text() yield Link(base_url, url, text, tag, token.attrs) except sgmllib.SGMLParseError, exc: raise _form.ParseError(exc) class FormsFactory: """Makes a sequence of objects satisfying HTMLForm interface. After calling .forms(), the .global_form attribute is a form object containing all controls not a descendant of any FORM element. For constructor argument docs, see ParseResponse argument docs. """ def __init__(self, select_default=False, form_parser_class=None, request_class=None, backwards_compat=False, ): self.select_default = select_default if form_parser_class is None: form_parser_class = _form.FormParser self.form_parser_class = form_parser_class if request_class is None: request_class = _request.Request self.request_class = request_class self.backwards_compat = backwards_compat self._response = None self.encoding = None self.global_form = None def set_response(self, response, encoding): self._response = response self.encoding = encoding self.global_form = None def forms(self): encoding = self.encoding forms = _form.ParseResponseEx( self._response, select_default=self.select_default, form_parser_class=self.form_parser_class, request_class=self.request_class, encoding=encoding, _urljoin=_rfc3986.urljoin, _urlparse=_rfc3986.urlsplit, _urlunparse=_rfc3986.urlunsplit, ) self.global_form = forms[0] return forms[1:] class TitleFactory: def __init__(self): self._response = self._encoding = None def set_response(self, response, encoding): self._response = response self._encoding = encoding def _get_title_text(self, parser): import _pullparser text = [] tok = None while 1: try: tok = parser.get_token() except _pullparser.NoMoreTokensError: break if tok.type == "data": text.append(str(tok)) elif tok.type == "entityref": t = unescape("&%s;" % tok.data, parser._entitydefs, parser.encoding) text.append(t) elif tok.type == "charref": t = unescape_charref(tok.data, parser.encoding) text.append(t) elif tok.type in ["starttag", "endtag", "startendtag"]: tag_name = tok.data if tok.type == "endtag" and tag_name == "title": break text.append(str(tok)) return COMPRESS_RE.sub(" ", "".join(text).strip()) def title(self): import _pullparser p = _pullparser.TolerantPullParser( self._response, encoding=self._encoding) try: try: p.get_tag("title") except _pullparser.NoMoreTokensError: return None else: return self._get_title_text(p) except sgmllib.SGMLParseError, exc: raise _form.ParseError(exc) def unescape(data, entities, encoding): if data is None or "&" not in data: return data def replace_entities(match): ent = match.group() if ent[1] == "#": return unescape_charref(ent[2:-1], encoding) repl = entities.get(ent[1:-1]) if repl is not None: repl = unichr(repl) if type(repl) != type(""): try: repl = repl.encode(encoding) except UnicodeError: repl = ent else: repl = ent return repl return re.sub(r"&#?[A-Za-z0-9]+?;", replace_entities, data) def unescape_charref(data, encoding): name, base = data, 10 if name.startswith("x"): name, base= name[1:], 16 uc = unichr(int(name, base)) if encoding is None: return uc else: try: repl = uc.encode(encoding) except UnicodeError: repl = "&#%s;" % data return repl class MechanizeBs(_beautifulsoup.BeautifulSoup): _entitydefs = htmlentitydefs.name2codepoint # don't want the magic Microsoft-char workaround PARSER_MASSAGE = [(re.compile('(<[^<>]*)/>'), lambda(x):x.group(1) + ' />'), (re.compile('<!\s+([^<>]*)>'), lambda(x):'<!' + x.group(1) + '>') ] def __init__(self, encoding, text=None, avoidParserProblems=True, initialTextIsEverything=True): self._encoding = encoding _beautifulsoup.BeautifulSoup.__init__( self, text, avoidParserProblems, initialTextIsEverything) def handle_charref(self, ref): t = unescape("&#%s;"%ref, self._entitydefs, self._encoding) self.handle_data(t) def handle_entityref(self, ref): t = unescape("&%s;"%ref, self._entitydefs, self._encoding) self.handle_data(t) def unescape_attrs(self, attrs): escaped_attrs = [] for key, val in attrs: val = unescape(val, self._entitydefs, self._encoding) escaped_attrs.append((key, val)) return escaped_attrs class RobustLinksFactory: compress_re = COMPRESS_RE def __init__(self, link_parser_class=None, link_class=Link, urltags=None, ): if link_parser_class is None: link_parser_class = MechanizeBs self.link_parser_class = link_parser_class self.link_class = link_class if urltags is None: urltags = { "a": "href", "area": "href", "frame": "src", "iframe": "src", } self.urltags = urltags self._bs = None self._encoding = None self._base_url = None def set_soup(self, soup, base_url, encoding): self._bs = soup self._base_url = base_url self._encoding = encoding def links(self): bs = self._bs base_url = self._base_url encoding = self._encoding for ch in bs.recursiveChildGenerator(): if (isinstance(ch, _beautifulsoup.Tag) and ch.name in self.urltags.keys()+["base"]): link = ch attrs = bs.unescape_attrs(link.attrs) attrs_dict = dict(attrs) if link.name == "base": base_href = attrs_dict.get("href") if base_href is not None: base_url = base_href continue url_attr = self.urltags[link.name] url = attrs_dict.get(url_attr) if not url: continue url = _rfc3986.clean_url(url, encoding) text = link.fetchText(lambda t: True) if not text: # follow _pullparser's weird behaviour rigidly if link.name == "a": text = "" else: text = None else: text = self.compress_re.sub(" ", " ".join(text).strip()) yield Link(base_url, url, text, link.name, attrs) class RobustFormsFactory(FormsFactory): def __init__(self, *args, **kwds): args = form_parser_args(*args, **kwds) if args.form_parser_class is None: args.form_parser_class = _form.RobustFormParser FormsFactory.__init__(self, **args.dictionary) def set_response(self, response, encoding): self._response = response self.encoding = encoding class RobustTitleFactory: def __init__(self): self._bs = self._encoding = None def set_soup(self, soup, encoding): self._bs = soup self._encoding = encoding def title(self): title = self._bs.first("title") if title == _beautifulsoup.Null: return None else: inner_html = "".join([str(node) for node in title.contents]) return COMPRESS_RE.sub(" ", inner_html.strip()) class Factory: """Factory for forms, links, etc. This interface may expand in future. Public methods: set_request_class(request_class) set_response(response) forms() links() Public attributes: Note that accessing these attributes may raise ParseError. encoding: string specifying the encoding of response if it contains a text document (this value is left unspecified for documents that do not have an encoding, e.g. an image file) is_html: true if response contains an HTML document (XHTML may be regarded as HTML too) title: page title, or None if no title or not HTML global_form: form object containing all controls that are not descendants of any FORM element, or None if the forms_factory does not support supplying a global form """ LAZY_ATTRS = ["encoding", "is_html", "title", "global_form"] def __init__(self, forms_factory, links_factory, title_factory, encoding_finder=EncodingFinder(DEFAULT_ENCODING), response_type_finder=ResponseTypeFinder(allow_xhtml=False), ): """ Pass keyword arguments only. default_encoding: character encoding to use if encoding cannot be determined (or guessed) from the response. You should turn on HTTP-EQUIV handling if you want the best chance of getting this right without resorting to this default. The default value of this parameter (currently latin-1) may change in future. """ self._forms_factory = forms_factory self._links_factory = links_factory self._title_factory = title_factory self._encoding_finder = encoding_finder self._response_type_finder = response_type_finder self.set_response(None) def set_request_class(self, request_class): """Set request class (mechanize.Request by default). HTMLForm instances returned by .forms() will return instances of this class when .click()ed. """ self._forms_factory.request_class = request_class def set_response(self, response): """Set response. The response must either be None or implement the same interface as objects returned by mechanize.urlopen(). """ self._response = response self._forms_genf = self._links_genf = None self._get_title = None for name in self.LAZY_ATTRS: try: delattr(self, name) except AttributeError: pass def __getattr__(self, name): if name not in self.LAZY_ATTRS: return getattr(self.__class__, name) if name == "encoding": self.encoding = self._encoding_finder.encoding( copy.copy(self._response)) return self.encoding elif name == "is_html": self.is_html = self._response_type_finder.is_html( copy.copy(self._response), self.encoding) return self.is_html elif name == "title": if self.is_html: self.title = self._title_factory.title() else: self.title = None return self.title elif name == "global_form": self.forms() return self.global_form def forms(self): """Return iterable over HTMLForm-like objects. Raises mechanize.ParseError on failure. """ # this implementation sets .global_form as a side-effect, for benefit # of __getattr__ impl if self._forms_genf is None: try: self._forms_genf = CachingGeneratorFunction( self._forms_factory.forms()) except: # XXXX define exception! self.set_response(self._response) raise self.global_form = getattr( self._forms_factory, "global_form", None) return self._forms_genf() def links(self): """Return iterable over mechanize.Link-like objects. Raises mechanize.ParseError on failure. """ if self._links_genf is None: try: self._links_genf = CachingGeneratorFunction( self._links_factory.links()) except: # XXXX define exception! self.set_response(self._response) raise return self._links_genf() class DefaultFactory(Factory): """Based on sgmllib.""" def __init__(self, i_want_broken_xhtml_support=False): Factory.__init__( self, forms_factory=FormsFactory(), links_factory=LinksFactory(), title_factory=TitleFactory(), response_type_finder=ResponseTypeFinder( allow_xhtml=i_want_broken_xhtml_support), ) def set_response(self, response): Factory.set_response(self, response) if response is not None: self._forms_factory.set_response( copy.copy(response), self.encoding) self._links_factory.set_response( copy.copy(response), response.geturl(), self.encoding) self._title_factory.set_response( copy.copy(response), self.encoding) class RobustFactory(Factory): """Based on BeautifulSoup, hopefully a bit more robust to bad HTML than is DefaultFactory. """ def __init__(self, i_want_broken_xhtml_support=False, soup_class=None): Factory.__init__( self, forms_factory=RobustFormsFactory(), links_factory=RobustLinksFactory(), title_factory=RobustTitleFactory(), response_type_finder=ResponseTypeFinder( allow_xhtml=i_want_broken_xhtml_support), ) if soup_class is None: soup_class = MechanizeBs self._soup_class = soup_class def set_response(self, response): Factory.set_response(self, response) if response is not None: data = response.read() soup = self._soup_class(self.encoding, data) self._forms_factory.set_response( copy.copy(response), self.encoding) self._links_factory.set_soup( soup, response.geturl(), self.encoding) self._title_factory.set_soup(soup, self.encoding) ��������������������������������������������������������������������������������������������������������mechanize-0.2.5/mechanize/_clientcookie.py����������������������������������������������������������0000644�0001750�0001750�00000177736�11545150644�017315� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""HTTP cookie handling for web clients. This module originally developed from my port of Gisle Aas' Perl module HTTP::Cookies, from the libwww-perl library. Docstrings, comments and debug strings in this code refer to the attributes of the HTTP cookie system as cookie-attributes, to distinguish them clearly from Python attributes. CookieJar____ / \ \ FileCookieJar \ \ / | \ \ \ MozillaCookieJar | LWPCookieJar \ \ | | \ | ---MSIEBase | \ | / | | \ | / MSIEDBCookieJar BSDDBCookieJar |/ MSIECookieJar Comments to John J Lee <jjl@pobox.com>. Copyright 2002-2006 John J Lee <jjl@pobox.com> Copyright 1997-1999 Gisle Aas (original libwww-perl code) Copyright 2002-2003 Johnny Lee (original MSIE Perl code) This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ import sys, re, copy, time, urllib, types, logging try: import threading _threading = threading; del threading except ImportError: import dummy_threading _threading = dummy_threading; del dummy_threading MISSING_FILENAME_TEXT = ("a filename was not supplied (nor was the CookieJar " "instance initialised with one)") DEFAULT_HTTP_PORT = "80" from _headersutil import split_header_words, parse_ns_headers from _util import isstringlike import _rfc3986 debug = logging.getLogger("mechanize.cookies").debug def reraise_unmasked_exceptions(unmasked=()): # There are a few catch-all except: statements in this module, for # catching input that's bad in unexpected ways. # This function re-raises some exceptions we don't want to trap. import mechanize, warnings if not mechanize.USE_BARE_EXCEPT: raise unmasked = unmasked + (KeyboardInterrupt, SystemExit, MemoryError) etype = sys.exc_info()[0] if issubclass(etype, unmasked): raise # swallowed an exception import traceback, StringIO f = StringIO.StringIO() traceback.print_exc(None, f) msg = f.getvalue() warnings.warn("mechanize bug!\n%s" % msg, stacklevel=2) IPV4_RE = re.compile(r"\.\d+$") def is_HDN(text): """Return True if text is a host domain name.""" # XXX # This may well be wrong. Which RFC is HDN defined in, if any (for # the purposes of RFC 2965)? # For the current implementation, what about IPv6? Remember to look # at other uses of IPV4_RE also, if change this. return not (IPV4_RE.search(text) or text == "" or text[0] == "." or text[-1] == ".") def domain_match(A, B): """Return True if domain A domain-matches domain B, according to RFC 2965. A and B may be host domain names or IP addresses. RFC 2965, section 1: Host names can be specified either as an IP address or a HDN string. Sometimes we compare one host name with another. (Such comparisons SHALL be case-insensitive.) Host A's name domain-matches host B's if * their host name strings string-compare equal; or * A is a HDN string and has the form NB, where N is a non-empty name string, B has the form .B', and B' is a HDN string. (So, x.y.com domain-matches .Y.com but not Y.com.) Note that domain-match is not a commutative operation: a.b.c.com domain-matches .c.com, but not the reverse. """ # Note that, if A or B are IP addresses, the only relevant part of the # definition of the domain-match algorithm is the direct string-compare. A = A.lower() B = B.lower() if A == B: return True if not is_HDN(A): return False i = A.rfind(B) has_form_nb = not (i == -1 or i == 0) return ( has_form_nb and B.startswith(".") and is_HDN(B[1:]) ) def liberal_is_HDN(text): """Return True if text is a sort-of-like a host domain name. For accepting/blocking domains. """ return not IPV4_RE.search(text) def user_domain_match(A, B): """For blocking/accepting domains. A and B may be host domain names or IP addresses. """ A = A.lower() B = B.lower() if not (liberal_is_HDN(A) and liberal_is_HDN(B)): if A == B: # equal IP addresses return True return False initial_dot = B.startswith(".") if initial_dot and A.endswith(B): return True if not initial_dot and A == B: return True return False cut_port_re = re.compile(r":\d+$") def request_host(request): """Return request-host, as defined by RFC 2965. Variation from RFC: returned value is lowercased, for convenient comparison. """ url = request.get_full_url() host = _rfc3986.urlsplit(url)[1] if host is None: host = request.get_header("Host", "") # remove port, if present return cut_port_re.sub("", host, 1) def request_host_lc(request): return request_host(request).lower() def eff_request_host(request): """Return a tuple (request-host, effective request-host name).""" erhn = req_host = request_host(request) if req_host.find(".") == -1 and not IPV4_RE.search(req_host): erhn = req_host + ".local" return req_host, erhn def eff_request_host_lc(request): req_host, erhn = eff_request_host(request) return req_host.lower(), erhn.lower() def effective_request_host(request): """Return the effective request-host, as defined by RFC 2965.""" return eff_request_host(request)[1] def request_path(request): """Return path component of request-URI, as defined by RFC 2965.""" url = request.get_full_url() path = escape_path(_rfc3986.urlsplit(url)[2]) if not path.startswith("/"): path = "/" + path return path def request_port(request): host = request.get_host() i = host.find(':') if i >= 0: port = host[i+1:] try: int(port) except ValueError: debug("nonnumeric port: '%s'", port) return None else: port = DEFAULT_HTTP_PORT return port def request_is_unverifiable(request): try: return request.is_unverifiable() except AttributeError: if hasattr(request, "unverifiable"): return request.unverifiable else: raise # Characters in addition to A-Z, a-z, 0-9, '_', '.', and '-' that don't # need to be escaped to form a valid HTTP URL (RFCs 2396 and 1738). HTTP_PATH_SAFE = "%/;:@&=+$,!~*'()" ESCAPED_CHAR_RE = re.compile(r"%([0-9a-fA-F][0-9a-fA-F])") def uppercase_escaped_char(match): return "%%%s" % match.group(1).upper() def escape_path(path): """Escape any invalid characters in HTTP URL, and uppercase all escapes.""" # There's no knowing what character encoding was used to create URLs # containing %-escapes, but since we have to pick one to escape invalid # path characters, we pick UTF-8, as recommended in the HTML 4.0 # specification: # http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1 # And here, kind of: draft-fielding-uri-rfc2396bis-03 # (And in draft IRI specification: draft-duerst-iri-05) # (And here, for new URI schemes: RFC 2718) if isinstance(path, types.UnicodeType): path = path.encode("utf-8") path = urllib.quote(path, HTTP_PATH_SAFE) path = ESCAPED_CHAR_RE.sub(uppercase_escaped_char, path) return path def reach(h): """Return reach of host h, as defined by RFC 2965, section 1. The reach R of a host name H is defined as follows: * If - H is the host domain name of a host; and, - H has the form A.B; and - A has no embedded (that is, interior) dots; and - B has at least one embedded dot, or B is the string "local". then the reach of H is .B. * Otherwise, the reach of H is H. >>> reach("www.acme.com") '.acme.com' >>> reach("acme.com") 'acme.com' >>> reach("acme.local") '.local' """ i = h.find(".") if i >= 0: #a = h[:i] # this line is only here to show what a is b = h[i+1:] i = b.find(".") if is_HDN(h) and (i >= 0 or b == "local"): return "."+b return h def is_third_party(request): """ RFC 2965, section 3.3.6: An unverifiable transaction is to a third-party host if its request- host U does not domain-match the reach R of the request-host O in the origin transaction. """ req_host = request_host_lc(request) # the origin request's request-host was stuffed into request by # _urllib2_support.AbstractHTTPHandler return not domain_match(req_host, reach(request.origin_req_host)) try: all except NameError: # python 2.4 def all(iterable): for x in iterable: if not x: return False return True class Cookie: """HTTP Cookie. This class represents both Netscape and RFC 2965 cookies. This is deliberately a very simple class. It just holds attributes. It's possible to construct Cookie instances that don't comply with the cookie standards. CookieJar.make_cookies is the factory function for Cookie objects -- it deals with cookie parsing, supplying defaults, and normalising to the representation used in this class. CookiePolicy is responsible for checking them to see whether they should be accepted from and returned to the server. version: integer; name: string; value: string (may be None); port: string; None indicates no attribute was supplied (e.g. "Port", rather than eg. "Port=80"); otherwise, a port string (eg. "80") or a port list string (e.g. "80,8080") port_specified: boolean; true if a value was supplied with the Port cookie-attribute domain: string; domain_specified: boolean; true if Domain was explicitly set domain_initial_dot: boolean; true if Domain as set in HTTP header by server started with a dot (yes, this really is necessary!) path: string; path_specified: boolean; true if Path was explicitly set secure: boolean; true if should only be returned over secure connection expires: integer; seconds since epoch (RFC 2965 cookies should calculate this value from the Max-Age attribute) discard: boolean, true if this is a session cookie; (if no expires value, this should be true) comment: string; comment_url: string; rfc2109: boolean; true if cookie arrived in a Set-Cookie: (not Set-Cookie2:) header, but had a version cookie-attribute of 1 rest: mapping of other cookie-attributes Note that the port may be present in the headers, but unspecified ("Port" rather than"Port=80", for example); if this is the case, port is None. """ _attrs = ("version", "name", "value", "port", "port_specified", "domain", "domain_specified", "domain_initial_dot", "path", "path_specified", "secure", "expires", "discard", "comment", "comment_url", "rfc2109", "_rest") def __init__(self, version, name, value, port, port_specified, domain, domain_specified, domain_initial_dot, path, path_specified, secure, expires, discard, comment, comment_url, rest, rfc2109=False, ): if version is not None: version = int(version) if expires is not None: expires = int(expires) if port is None and port_specified is True: raise ValueError("if port is None, port_specified must be false") self.version = version self.name = name self.value = value self.port = port self.port_specified = port_specified # normalise case, as per RFC 2965 section 3.3.3 self.domain = domain.lower() self.domain_specified = domain_specified # Sigh. We need to know whether the domain given in the # cookie-attribute had an initial dot, in order to follow RFC 2965 # (as clarified in draft errata). Needed for the returned $Domain # value. self.domain_initial_dot = domain_initial_dot self.path = path self.path_specified = path_specified self.secure = secure self.expires = expires self.discard = discard self.comment = comment self.comment_url = comment_url self.rfc2109 = rfc2109 self._rest = copy.copy(rest) def has_nonstandard_attr(self, name): return self._rest.has_key(name) def get_nonstandard_attr(self, name, default=None): return self._rest.get(name, default) def set_nonstandard_attr(self, name, value): self._rest[name] = value def nonstandard_attr_keys(self): return self._rest.keys() def is_expired(self, now=None): if now is None: now = time.time() return (self.expires is not None) and (self.expires <= now) def __eq__(self, other): return all(getattr(self, a) == getattr(other, a) for a in self._attrs) def __ne__(self, other): return not (self == other) def __str__(self): if self.port is None: p = "" else: p = ":"+self.port limit = self.domain + p + self.path if self.value is not None: namevalue = "%s=%s" % (self.name, self.value) else: namevalue = self.name return "<Cookie %s for %s>" % (namevalue, limit) def __repr__(self): args = [] for name in ["version", "name", "value", "port", "port_specified", "domain", "domain_specified", "domain_initial_dot", "path", "path_specified", "secure", "expires", "discard", "comment", "comment_url", ]: attr = getattr(self, name) args.append("%s=%s" % (name, repr(attr))) args.append("rest=%s" % repr(self._rest)) args.append("rfc2109=%s" % repr(self.rfc2109)) return "Cookie(%s)" % ", ".join(args) class CookiePolicy: """Defines which cookies get accepted from and returned to server. May also modify cookies. The subclass DefaultCookiePolicy defines the standard rules for Netscape and RFC 2965 cookies -- override that if you want a customised policy. As well as implementing set_ok and return_ok, implementations of this interface must also supply the following attributes, indicating which protocols should be used, and how. These can be read and set at any time, though whether that makes complete sense from the protocol point of view is doubtful. Public attributes: netscape: implement netscape protocol rfc2965: implement RFC 2965 protocol rfc2109_as_netscape: WARNING: This argument will change or go away if is not accepted into the Python standard library in this form! If true, treat RFC 2109 cookies as though they were Netscape cookies. The default is for this attribute to be None, which means treat 2109 cookies as RFC 2965 cookies unless RFC 2965 handling is switched off (which it is, by default), and as Netscape cookies otherwise. hide_cookie2: don't add Cookie2 header to requests (the presence of this header indicates to the server that we understand RFC 2965 cookies) """ def set_ok(self, cookie, request): """Return true if (and only if) cookie should be accepted from server. Currently, pre-expired cookies never get this far -- the CookieJar class deletes such cookies itself. cookie: mechanize.Cookie object request: object implementing the interface defined by CookieJar.extract_cookies.__doc__ """ raise NotImplementedError() def return_ok(self, cookie, request): """Return true if (and only if) cookie should be returned to server. cookie: mechanize.Cookie object request: object implementing the interface defined by CookieJar.add_cookie_header.__doc__ """ raise NotImplementedError() def domain_return_ok(self, domain, request): """Return false if cookies should not be returned, given cookie domain. This is here as an optimization, to remove the need for checking every cookie with a particular domain (which may involve reading many files). The default implementations of domain_return_ok and path_return_ok (return True) leave all the work to return_ok. If domain_return_ok returns true for the cookie domain, path_return_ok is called for the cookie path. Otherwise, path_return_ok and return_ok are never called for that cookie domain. If path_return_ok returns true, return_ok is called with the Cookie object itself for a full check. Otherwise, return_ok is never called for that cookie path. Note that domain_return_ok is called for every *cookie* domain, not just for the *request* domain. For example, the function might be called with both ".acme.com" and "www.acme.com" if the request domain is "www.acme.com". The same goes for path_return_ok. For argument documentation, see the docstring for return_ok. """ return True def path_return_ok(self, path, request): """Return false if cookies should not be returned, given cookie path. See the docstring for domain_return_ok. """ return True class DefaultCookiePolicy(CookiePolicy): """Implements the standard rules for accepting and returning cookies. Both RFC 2965 and Netscape cookies are covered. RFC 2965 handling is switched off by default. The easiest way to provide your own policy is to override this class and call its methods in your overriden implementations before adding your own additional checks. import mechanize class MyCookiePolicy(mechanize.DefaultCookiePolicy): def set_ok(self, cookie, request): if not mechanize.DefaultCookiePolicy.set_ok( self, cookie, request): return False if i_dont_want_to_store_this_cookie(): return False return True In addition to the features required to implement the CookiePolicy interface, this class allows you to block and allow domains from setting and receiving cookies. There are also some strictness switches that allow you to tighten up the rather loose Netscape protocol rules a little bit (at the cost of blocking some benign cookies). A domain blacklist and whitelist is provided (both off by default). Only domains not in the blacklist and present in the whitelist (if the whitelist is active) participate in cookie setting and returning. Use the blocked_domains constructor argument, and blocked_domains and set_blocked_domains methods (and the corresponding argument and methods for allowed_domains). If you set a whitelist, you can turn it off again by setting it to None. Domains in block or allow lists that do not start with a dot must string-compare equal. For example, "acme.com" matches a blacklist entry of "acme.com", but "www.acme.com" does not. Domains that do start with a dot are matched by more specific domains too. For example, both "www.acme.com" and "www.munitions.acme.com" match ".acme.com" (but "acme.com" itself does not). IP addresses are an exception, and must match exactly. For example, if blocked_domains contains "192.168.1.2" and ".168.1.2" 192.168.1.2 is blocked, but 193.168.1.2 is not. Additional Public Attributes: General strictness switches strict_domain: don't allow sites to set two-component domains with country-code top-level domains like .co.uk, .gov.uk, .co.nz. etc. This is far from perfect and isn't guaranteed to work! RFC 2965 protocol strictness switches strict_rfc2965_unverifiable: follow RFC 2965 rules on unverifiable transactions (usually, an unverifiable transaction is one resulting from a redirect or an image hosted on another site); if this is false, cookies are NEVER blocked on the basis of verifiability Netscape protocol strictness switches strict_ns_unverifiable: apply RFC 2965 rules on unverifiable transactions even to Netscape cookies strict_ns_domain: flags indicating how strict to be with domain-matching rules for Netscape cookies: DomainStrictNoDots: when setting cookies, host prefix must not contain a dot (e.g. www.foo.bar.com can't set a cookie for .bar.com, because www.foo contains a dot) DomainStrictNonDomain: cookies that did not explicitly specify a Domain cookie-attribute can only be returned to a domain that string-compares equal to the domain that set the cookie (e.g. rockets.acme.com won't be returned cookies from acme.com that had no Domain cookie-attribute) DomainRFC2965Match: when setting cookies, require a full RFC 2965 domain-match DomainLiberal and DomainStrict are the most useful combinations of the above flags, for convenience strict_ns_set_initial_dollar: ignore cookies in Set-Cookie: headers that have names starting with '$' strict_ns_set_path: don't allow setting cookies whose path doesn't path-match request URI """ DomainStrictNoDots = 1 DomainStrictNonDomain = 2 DomainRFC2965Match = 4 DomainLiberal = 0 DomainStrict = DomainStrictNoDots|DomainStrictNonDomain def __init__(self, blocked_domains=None, allowed_domains=None, netscape=True, rfc2965=False, # WARNING: this argument will change or go away if is not # accepted into the Python standard library in this form! # default, ie. treat 2109 as netscape iff not rfc2965 rfc2109_as_netscape=None, hide_cookie2=False, strict_domain=False, strict_rfc2965_unverifiable=True, strict_ns_unverifiable=False, strict_ns_domain=DomainLiberal, strict_ns_set_initial_dollar=False, strict_ns_set_path=False, ): """ Constructor arguments should be used as keyword arguments only. blocked_domains: sequence of domain names that we never accept cookies from, nor return cookies to allowed_domains: if not None, this is a sequence of the only domains for which we accept and return cookies For other arguments, see CookiePolicy.__doc__ and DefaultCookiePolicy.__doc__.. """ self.netscape = netscape self.rfc2965 = rfc2965 self.rfc2109_as_netscape = rfc2109_as_netscape self.hide_cookie2 = hide_cookie2 self.strict_domain = strict_domain self.strict_rfc2965_unverifiable = strict_rfc2965_unverifiable self.strict_ns_unverifiable = strict_ns_unverifiable self.strict_ns_domain = strict_ns_domain self.strict_ns_set_initial_dollar = strict_ns_set_initial_dollar self.strict_ns_set_path = strict_ns_set_path if blocked_domains is not None: self._blocked_domains = tuple(blocked_domains) else: self._blocked_domains = () if allowed_domains is not None: allowed_domains = tuple(allowed_domains) self._allowed_domains = allowed_domains def blocked_domains(self): """Return the sequence of blocked domains (as a tuple).""" return self._blocked_domains def set_blocked_domains(self, blocked_domains): """Set the sequence of blocked domains.""" self._blocked_domains = tuple(blocked_domains) def is_blocked(self, domain): for blocked_domain in self._blocked_domains: if user_domain_match(domain, blocked_domain): return True return False def allowed_domains(self): """Return None, or the sequence of allowed domains (as a tuple).""" return self._allowed_domains def set_allowed_domains(self, allowed_domains): """Set the sequence of allowed domains, or None.""" if allowed_domains is not None: allowed_domains = tuple(allowed_domains) self._allowed_domains = allowed_domains def is_not_allowed(self, domain): if self._allowed_domains is None: return False for allowed_domain in self._allowed_domains: if user_domain_match(domain, allowed_domain): return False return True def set_ok(self, cookie, request): """ If you override set_ok, be sure to call this method. If it returns false, so should your subclass (assuming your subclass wants to be more strict about which cookies to accept). """ debug(" - checking cookie %s", cookie) assert cookie.name is not None for n in "version", "verifiability", "name", "path", "domain", "port": fn_name = "set_ok_"+n fn = getattr(self, fn_name) if not fn(cookie, request): return False return True def set_ok_version(self, cookie, request): if cookie.version is None: # Version is always set to 0 by parse_ns_headers if it's a Netscape # cookie, so this must be an invalid RFC 2965 cookie. debug(" Set-Cookie2 without version attribute (%s)", cookie) return False if cookie.version > 0 and not self.rfc2965: debug(" RFC 2965 cookies are switched off") return False elif cookie.version == 0 and not self.netscape: debug(" Netscape cookies are switched off") return False return True def set_ok_verifiability(self, cookie, request): if request_is_unverifiable(request) and is_third_party(request): if cookie.version > 0 and self.strict_rfc2965_unverifiable: debug(" third-party RFC 2965 cookie during " "unverifiable transaction") return False elif cookie.version == 0 and self.strict_ns_unverifiable: debug(" third-party Netscape cookie during " "unverifiable transaction") return False return True def set_ok_name(self, cookie, request): # Try and stop servers setting V0 cookies designed to hack other # servers that know both V0 and V1 protocols. if (cookie.version == 0 and self.strict_ns_set_initial_dollar and cookie.name.startswith("$")): debug(" illegal name (starts with '$'): '%s'", cookie.name) return False return True def set_ok_path(self, cookie, request): if cookie.path_specified: req_path = request_path(request) if ((cookie.version > 0 or (cookie.version == 0 and self.strict_ns_set_path)) and not req_path.startswith(cookie.path)): debug(" path attribute %s is not a prefix of request " "path %s", cookie.path, req_path) return False return True def set_ok_countrycode_domain(self, cookie, request): """Return False if explicit cookie domain is not acceptable. Called by set_ok_domain, for convenience of overriding by subclasses. """ if cookie.domain_specified and self.strict_domain: domain = cookie.domain # since domain was specified, we know that: assert domain.startswith(".") if domain.count(".") == 2: # domain like .foo.bar i = domain.rfind(".") tld = domain[i+1:] sld = domain[1:i] if (sld.lower() in [ "co", "ac", "com", "edu", "org", "net", "gov", "mil", "int", "aero", "biz", "cat", "coop", "info", "jobs", "mobi", "museum", "name", "pro", "travel", ] and len(tld) == 2): # domain like .co.uk return False return True def set_ok_domain(self, cookie, request): if self.is_blocked(cookie.domain): debug(" domain %s is in user block-list", cookie.domain) return False if self.is_not_allowed(cookie.domain): debug(" domain %s is not in user allow-list", cookie.domain) return False if not self.set_ok_countrycode_domain(cookie, request): debug(" country-code second level domain %s", cookie.domain) return False if cookie.domain_specified: req_host, erhn = eff_request_host_lc(request) domain = cookie.domain if domain.startswith("."): undotted_domain = domain[1:] else: undotted_domain = domain embedded_dots = (undotted_domain.find(".") >= 0) if not embedded_dots and domain != ".local": debug(" non-local domain %s contains no embedded dot", domain) return False if cookie.version == 0: if (not erhn.endswith(domain) and (not erhn.startswith(".") and not ("."+erhn).endswith(domain))): debug(" effective request-host %s (even with added " "initial dot) does not end end with %s", erhn, domain) return False if (cookie.version > 0 or (self.strict_ns_domain & self.DomainRFC2965Match)): if not domain_match(erhn, domain): debug(" effective request-host %s does not domain-match " "%s", erhn, domain) return False if (cookie.version > 0 or (self.strict_ns_domain & self.DomainStrictNoDots)): host_prefix = req_host[:-len(domain)] if (host_prefix.find(".") >= 0 and not IPV4_RE.search(req_host)): debug(" host prefix %s for domain %s contains a dot", host_prefix, domain) return False return True def set_ok_port(self, cookie, request): if cookie.port_specified: req_port = request_port(request) if req_port is None: req_port = "80" else: req_port = str(req_port) for p in cookie.port.split(","): try: int(p) except ValueError: debug(" bad port %s (not numeric)", p) return False if p == req_port: break else: debug(" request port (%s) not found in %s", req_port, cookie.port) return False return True def return_ok(self, cookie, request): """ If you override return_ok, be sure to call this method. If it returns false, so should your subclass (assuming your subclass wants to be more strict about which cookies to return). """ # Path has already been checked by path_return_ok, and domain blocking # done by domain_return_ok. debug(" - checking cookie %s", cookie) for n in ("version", "verifiability", "secure", "expires", "port", "domain"): fn_name = "return_ok_"+n fn = getattr(self, fn_name) if not fn(cookie, request): return False return True def return_ok_version(self, cookie, request): if cookie.version > 0 and not self.rfc2965: debug(" RFC 2965 cookies are switched off") return False elif cookie.version == 0 and not self.netscape: debug(" Netscape cookies are switched off") return False return True def return_ok_verifiability(self, cookie, request): if request_is_unverifiable(request) and is_third_party(request): if cookie.version > 0 and self.strict_rfc2965_unverifiable: debug(" third-party RFC 2965 cookie during unverifiable " "transaction") return False elif cookie.version == 0 and self.strict_ns_unverifiable: debug(" third-party Netscape cookie during unverifiable " "transaction") return False return True def return_ok_secure(self, cookie, request): if cookie.secure and request.get_type() != "https": debug(" secure cookie with non-secure request") return False return True def return_ok_expires(self, cookie, request): if cookie.is_expired(self._now): debug(" cookie expired") return False return True def return_ok_port(self, cookie, request): if cookie.port: req_port = request_port(request) if req_port is None: req_port = "80" for p in cookie.port.split(","): if p == req_port: break else: debug(" request port %s does not match cookie port %s", req_port, cookie.port) return False return True def return_ok_domain(self, cookie, request): req_host, erhn = eff_request_host_lc(request) domain = cookie.domain # strict check of non-domain cookies: Mozilla does this, MSIE5 doesn't if (cookie.version == 0 and (self.strict_ns_domain & self.DomainStrictNonDomain) and not cookie.domain_specified and domain != erhn): debug(" cookie with unspecified domain does not string-compare " "equal to request domain") return False if cookie.version > 0 and not domain_match(erhn, domain): debug(" effective request-host name %s does not domain-match " "RFC 2965 cookie domain %s", erhn, domain) return False if cookie.version == 0 and not ("."+erhn).endswith(domain): debug(" request-host %s does not match Netscape cookie domain " "%s", req_host, domain) return False return True def domain_return_ok(self, domain, request): # Liberal check of domain. This is here as an optimization to avoid # having to load lots of MSIE cookie files unless necessary. # Munge req_host and erhn to always start with a dot, so as to err on # the side of letting cookies through. dotted_req_host, dotted_erhn = eff_request_host_lc(request) if not dotted_req_host.startswith("."): dotted_req_host = "."+dotted_req_host if not dotted_erhn.startswith("."): dotted_erhn = "."+dotted_erhn if not (dotted_req_host.endswith(domain) or dotted_erhn.endswith(domain)): #debug(" request domain %s does not match cookie domain %s", # req_host, domain) return False if self.is_blocked(domain): debug(" domain %s is in user block-list", domain) return False if self.is_not_allowed(domain): debug(" domain %s is not in user allow-list", domain) return False return True def path_return_ok(self, path, request): debug("- checking cookie path=%s", path) req_path = request_path(request) if not req_path.startswith(path): debug(" %s does not path-match %s", req_path, path) return False return True def vals_sorted_by_key(adict): keys = adict.keys() keys.sort() return map(adict.get, keys) class MappingIterator: """Iterates over nested mapping, depth-first, in sorted order by key.""" def __init__(self, mapping): self._s = [(vals_sorted_by_key(mapping), 0, None)] # LIFO stack def __iter__(self): return self def next(self): # this is hairy because of lack of generators while 1: try: vals, i, prev_item = self._s.pop() except IndexError: raise StopIteration() if i < len(vals): item = vals[i] i = i + 1 self._s.append((vals, i, prev_item)) try: item.items except AttributeError: # non-mapping break else: # mapping self._s.append((vals_sorted_by_key(item), 0, item)) continue return item # Used as second parameter to dict.get method, to distinguish absent # dict key from one with a None value. class Absent: pass class CookieJar: """Collection of HTTP cookies. You may not need to know about this class: try mechanize.urlopen(). The major methods are extract_cookies and add_cookie_header; these are all you are likely to need. CookieJar supports the iterator protocol: for cookie in cookiejar: # do something with cookie Methods: add_cookie_header(request) extract_cookies(response, request) get_policy() set_policy(policy) cookies_for_request(request) make_cookies(response, request) set_cookie_if_ok(cookie, request) set_cookie(cookie) clear_session_cookies() clear_expired_cookies() clear(domain=None, path=None, name=None) Public attributes policy: CookiePolicy object """ non_word_re = re.compile(r"\W") quote_re = re.compile(r"([\"\\])") strict_domain_re = re.compile(r"\.?[^.]*") domain_re = re.compile(r"[^.]*") dots_re = re.compile(r"^\.+") def __init__(self, policy=None): """ See CookieJar.__doc__ for argument documentation. """ if policy is None: policy = DefaultCookiePolicy() self._policy = policy self._cookies = {} # for __getitem__ iteration in pre-2.2 Pythons self._prev_getitem_index = 0 def get_policy(self): return self._policy def set_policy(self, policy): self._policy = policy def _cookies_for_domain(self, domain, request): cookies = [] if not self._policy.domain_return_ok(domain, request): return [] debug("Checking %s for cookies to return", domain) cookies_by_path = self._cookies[domain] for path in cookies_by_path.keys(): if not self._policy.path_return_ok(path, request): continue cookies_by_name = cookies_by_path[path] for cookie in cookies_by_name.values(): if not self._policy.return_ok(cookie, request): debug(" not returning cookie") continue debug(" it's a match") cookies.append(cookie) return cookies def cookies_for_request(self, request): """Return a list of cookies to be returned to server. The returned list of cookie instances is sorted in the order they should appear in the Cookie: header for return to the server. See add_cookie_header.__doc__ for the interface required of the request argument. New in version 0.1.10 """ self._policy._now = self._now = int(time.time()) cookies = self._cookies_for_request(request) # add cookies in order of most specific (i.e. longest) path first def decreasing_size(a, b): return cmp(len(b.path), len(a.path)) cookies.sort(decreasing_size) return cookies def _cookies_for_request(self, request): """Return a list of cookies to be returned to server.""" # this method still exists (alongside cookies_for_request) because it # is part of an implied protected interface for subclasses of cookiejar # XXX document that implied interface, or provide another way of # implementing cookiejars than subclassing cookies = [] for domain in self._cookies.keys(): cookies.extend(self._cookies_for_domain(domain, request)) return cookies def _cookie_attrs(self, cookies): """Return a list of cookie-attributes to be returned to server. The $Version attribute is also added when appropriate (currently only once per request). >>> jar = CookieJar() >>> ns_cookie = Cookie(0, "foo", '"bar"', None, False, ... "example.com", False, False, ... "/", False, False, None, True, ... None, None, {}) >>> jar._cookie_attrs([ns_cookie]) ['foo="bar"'] >>> rfc2965_cookie = Cookie(1, "foo", "bar", None, False, ... ".example.com", True, False, ... "/", False, False, None, True, ... None, None, {}) >>> jar._cookie_attrs([rfc2965_cookie]) ['$Version=1', 'foo=bar', '$Domain="example.com"'] """ version_set = False attrs = [] for cookie in cookies: # set version of Cookie header # XXX # What should it be if multiple matching Set-Cookie headers have # different versions themselves? # Answer: there is no answer; was supposed to be settled by # RFC 2965 errata, but that may never appear... version = cookie.version if not version_set: version_set = True if version > 0: attrs.append("$Version=%s" % version) # quote cookie value if necessary # (not for Netscape protocol, which already has any quotes # intact, due to the poorly-specified Netscape Cookie: syntax) if ((cookie.value is not None) and self.non_word_re.search(cookie.value) and version > 0): value = self.quote_re.sub(r"\\\1", cookie.value) else: value = cookie.value # add cookie-attributes to be returned in Cookie header if cookie.value is None: attrs.append(cookie.name) else: attrs.append("%s=%s" % (cookie.name, value)) if version > 0: if cookie.path_specified: attrs.append('$Path="%s"' % cookie.path) if cookie.domain.startswith("."): domain = cookie.domain if (not cookie.domain_initial_dot and domain.startswith(".")): domain = domain[1:] attrs.append('$Domain="%s"' % domain) if cookie.port is not None: p = "$Port" if cookie.port_specified: p = p + ('="%s"' % cookie.port) attrs.append(p) return attrs def add_cookie_header(self, request): """Add correct Cookie: header to request (mechanize.Request object). The Cookie2 header is also added unless policy.hide_cookie2 is true. The request object (usually a mechanize.Request instance) must support the methods get_full_url, get_host, is_unverifiable, get_type, has_header, get_header, header_items and add_unredirected_header, as documented by urllib2. """ debug("add_cookie_header") cookies = self.cookies_for_request(request) attrs = self._cookie_attrs(cookies) if attrs: if not request.has_header("Cookie"): request.add_unredirected_header("Cookie", "; ".join(attrs)) # if necessary, advertise that we know RFC 2965 if self._policy.rfc2965 and not self._policy.hide_cookie2: for cookie in cookies: if cookie.version != 1 and not request.has_header("Cookie2"): request.add_unredirected_header("Cookie2", '$Version="1"') break self.clear_expired_cookies() def _normalized_cookie_tuples(self, attrs_set): """Return list of tuples containing normalised cookie information. attrs_set is the list of lists of key,value pairs extracted from the Set-Cookie or Set-Cookie2 headers. Tuples are name, value, standard, rest, where name and value are the cookie name and value, standard is a dictionary containing the standard cookie-attributes (discard, secure, version, expires or max-age, domain, path and port) and rest is a dictionary containing the rest of the cookie-attributes. """ cookie_tuples = [] boolean_attrs = "discard", "secure" value_attrs = ("version", "expires", "max-age", "domain", "path", "port", "comment", "commenturl") for cookie_attrs in attrs_set: name, value = cookie_attrs[0] # Build dictionary of standard cookie-attributes (standard) and # dictionary of other cookie-attributes (rest). # Note: expiry time is normalised to seconds since epoch. V0 # cookies should have the Expires cookie-attribute, and V1 cookies # should have Max-Age, but since V1 includes RFC 2109 cookies (and # since V0 cookies may be a mish-mash of Netscape and RFC 2109), we # accept either (but prefer Max-Age). max_age_set = False bad_cookie = False standard = {} rest = {} for k, v in cookie_attrs[1:]: lc = k.lower() # don't lose case distinction for unknown fields if lc in value_attrs or lc in boolean_attrs: k = lc if k in boolean_attrs and v is None: # boolean cookie-attribute is present, but has no value # (like "discard", rather than "port=80") v = True if standard.has_key(k): # only first value is significant continue if k == "domain": if v is None: debug(" missing value for domain attribute") bad_cookie = True break # RFC 2965 section 3.3.3 v = v.lower() if k == "expires": if max_age_set: # Prefer max-age to expires (like Mozilla) continue if v is None: debug(" missing or invalid value for expires " "attribute: treating as session cookie") continue if k == "max-age": max_age_set = True if v is None: debug(" missing value for max-age attribute") bad_cookie = True break try: v = int(v) except ValueError: debug(" missing or invalid (non-numeric) value for " "max-age attribute") bad_cookie = True break # convert RFC 2965 Max-Age to seconds since epoch # XXX Strictly you're supposed to follow RFC 2616 # age-calculation rules. Remember that zero Max-Age is a # is a request to discard (old and new) cookie, though. k = "expires" v = self._now + v if (k in value_attrs) or (k in boolean_attrs): if (v is None and k not in ["port", "comment", "commenturl"]): debug(" missing value for %s attribute" % k) bad_cookie = True break standard[k] = v else: rest[k] = v if bad_cookie: continue cookie_tuples.append((name, value, standard, rest)) return cookie_tuples def _cookie_from_cookie_tuple(self, tup, request): # standard is dict of standard cookie-attributes, rest is dict of the # rest of them name, value, standard, rest = tup domain = standard.get("domain", Absent) path = standard.get("path", Absent) port = standard.get("port", Absent) expires = standard.get("expires", Absent) # set the easy defaults version = standard.get("version", None) if version is not None: try: version = int(version) except ValueError: return None # invalid version, ignore cookie secure = standard.get("secure", False) # (discard is also set if expires is Absent) discard = standard.get("discard", False) comment = standard.get("comment", None) comment_url = standard.get("commenturl", None) # set default path if path is not Absent and path != "": path_specified = True path = escape_path(path) else: path_specified = False path = request_path(request) i = path.rfind("/") if i != -1: if version == 0: # Netscape spec parts company from reality here path = path[:i] else: path = path[:i+1] if len(path) == 0: path = "/" # set default domain domain_specified = domain is not Absent # but first we have to remember whether it starts with a dot domain_initial_dot = False if domain_specified: domain_initial_dot = bool(domain.startswith(".")) if domain is Absent: req_host, erhn = eff_request_host_lc(request) domain = erhn elif not domain.startswith("."): domain = "."+domain # set default port port_specified = False if port is not Absent: if port is None: # Port attr present, but has no value: default to request port. # Cookie should then only be sent back on that port. port = request_port(request) else: port_specified = True port = re.sub(r"\s+", "", port) else: # No port attr present. Cookie can be sent back on any port. port = None # set default expires and discard if expires is Absent: expires = None discard = True return Cookie(version, name, value, port, port_specified, domain, domain_specified, domain_initial_dot, path, path_specified, secure, expires, discard, comment, comment_url, rest) def _cookies_from_attrs_set(self, attrs_set, request): cookie_tuples = self._normalized_cookie_tuples(attrs_set) cookies = [] for tup in cookie_tuples: cookie = self._cookie_from_cookie_tuple(tup, request) if cookie: cookies.append(cookie) return cookies def _process_rfc2109_cookies(self, cookies): if self._policy.rfc2109_as_netscape is None: rfc2109_as_netscape = not self._policy.rfc2965 else: rfc2109_as_netscape = self._policy.rfc2109_as_netscape for cookie in cookies: if cookie.version == 1: cookie.rfc2109 = True if rfc2109_as_netscape: # treat 2109 cookies as Netscape cookies rather than # as RFC2965 cookies cookie.version = 0 def _make_cookies(self, response, request): # get cookie-attributes for RFC 2965 and Netscape protocols headers = response.info() rfc2965_hdrs = headers.getheaders("Set-Cookie2") ns_hdrs = headers.getheaders("Set-Cookie") rfc2965 = self._policy.rfc2965 netscape = self._policy.netscape if ((not rfc2965_hdrs and not ns_hdrs) or (not ns_hdrs and not rfc2965) or (not rfc2965_hdrs and not netscape) or (not netscape and not rfc2965)): return [] # no relevant cookie headers: quick exit try: cookies = self._cookies_from_attrs_set( split_header_words(rfc2965_hdrs), request) except: reraise_unmasked_exceptions() cookies = [] if ns_hdrs and netscape: try: # RFC 2109 and Netscape cookies ns_cookies = self._cookies_from_attrs_set( parse_ns_headers(ns_hdrs), request) except: reraise_unmasked_exceptions() ns_cookies = [] self._process_rfc2109_cookies(ns_cookies) # Look for Netscape cookies (from Set-Cookie headers) that match # corresponding RFC 2965 cookies (from Set-Cookie2 headers). # For each match, keep the RFC 2965 cookie and ignore the Netscape # cookie (RFC 2965 section 9.1). Actually, RFC 2109 cookies are # bundled in with the Netscape cookies for this purpose, which is # reasonable behaviour. if rfc2965: lookup = {} for cookie in cookies: lookup[(cookie.domain, cookie.path, cookie.name)] = None def no_matching_rfc2965(ns_cookie, lookup=lookup): key = ns_cookie.domain, ns_cookie.path, ns_cookie.name return not lookup.has_key(key) ns_cookies = filter(no_matching_rfc2965, ns_cookies) if ns_cookies: cookies.extend(ns_cookies) return cookies def make_cookies(self, response, request): """Return sequence of Cookie objects extracted from response object. See extract_cookies.__doc__ for the interface required of the response and request arguments. """ self._policy._now = self._now = int(time.time()) return [cookie for cookie in self._make_cookies(response, request) if cookie.expires is None or not cookie.expires <= self._now] def set_cookie_if_ok(self, cookie, request): """Set a cookie if policy says it's OK to do so. cookie: mechanize.Cookie instance request: see extract_cookies.__doc__ for the required interface """ self._policy._now = self._now = int(time.time()) if self._policy.set_ok(cookie, request): self.set_cookie(cookie) def set_cookie(self, cookie): """Set a cookie, without checking whether or not it should be set. cookie: mechanize.Cookie instance """ c = self._cookies if not c.has_key(cookie.domain): c[cookie.domain] = {} c2 = c[cookie.domain] if not c2.has_key(cookie.path): c2[cookie.path] = {} c3 = c2[cookie.path] c3[cookie.name] = cookie def extract_cookies(self, response, request): """Extract cookies from response, where allowable given the request. Look for allowable Set-Cookie: and Set-Cookie2: headers in the response object passed as argument. Any of these headers that are found are used to update the state of the object (subject to the policy.set_ok method's approval). The response object (usually be the result of a call to mechanize.urlopen, or similar) should support an info method, which returns a mimetools.Message object (in fact, the 'mimetools.Message object' may be any object that provides a getheaders method). The request object (usually a mechanize.Request instance) must support the methods get_full_url, get_type, get_host, and is_unverifiable, as documented by mechanize, and the port attribute (the port number). The request is used to set default values for cookie-attributes as well as for checking that the cookie is OK to be set. """ debug("extract_cookies: %s", response.info()) self._policy._now = self._now = int(time.time()) for cookie in self._make_cookies(response, request): if cookie.expires is not None and cookie.expires <= self._now: # Expiry date in past is request to delete cookie. This can't be # in DefaultCookiePolicy, because can't delete cookies there. try: self.clear(cookie.domain, cookie.path, cookie.name) except KeyError: pass debug("Expiring cookie, domain='%s', path='%s', name='%s'", cookie.domain, cookie.path, cookie.name) elif self._policy.set_ok(cookie, request): debug(" setting cookie: %s", cookie) self.set_cookie(cookie) def clear(self, domain=None, path=None, name=None): """Clear some cookies. Invoking this method without arguments will clear all cookies. If given a single argument, only cookies belonging to that domain will be removed. If given two arguments, cookies belonging to the specified path within that domain are removed. If given three arguments, then the cookie with the specified name, path and domain is removed. Raises KeyError if no matching cookie exists. """ if name is not None: if (domain is None) or (path is None): raise ValueError( "domain and path must be given to remove a cookie by name") del self._cookies[domain][path][name] elif path is not None: if domain is None: raise ValueError( "domain must be given to remove cookies by path") del self._cookies[domain][path] elif domain is not None: del self._cookies[domain] else: self._cookies = {} def clear_session_cookies(self): """Discard all session cookies. Discards all cookies held by object which had either no Max-Age or Expires cookie-attribute or an explicit Discard cookie-attribute, or which otherwise have ended up with a true discard attribute. For interactive browsers, the end of a session usually corresponds to closing the browser window. Note that the save method won't save session cookies anyway, unless you ask otherwise by passing a true ignore_discard argument. """ for cookie in self: if cookie.discard: self.clear(cookie.domain, cookie.path, cookie.name) def clear_expired_cookies(self): """Discard all expired cookies. You probably don't need to call this method: expired cookies are never sent back to the server (provided you're using DefaultCookiePolicy), this method is called by CookieJar itself every so often, and the save method won't save expired cookies anyway (unless you ask otherwise by passing a true ignore_expires argument). """ now = time.time() for cookie in self: if cookie.is_expired(now): self.clear(cookie.domain, cookie.path, cookie.name) def __getitem__(self, i): if i == 0: self._getitem_iterator = self.__iter__() elif self._prev_getitem_index != i-1: raise IndexError( "CookieJar.__getitem__ only supports sequential iteration") self._prev_getitem_index = i try: return self._getitem_iterator.next() except StopIteration: raise IndexError() def __iter__(self): return MappingIterator(self._cookies) def __len__(self): """Return number of contained cookies.""" i = 0 for cookie in self: i = i + 1 return i def __repr__(self): r = [] for cookie in self: r.append(repr(cookie)) return "<%s[%s]>" % (self.__class__, ", ".join(r)) def __str__(self): r = [] for cookie in self: r.append(str(cookie)) return "<%s[%s]>" % (self.__class__, ", ".join(r)) class LoadError(Exception): pass class FileCookieJar(CookieJar): """CookieJar that can be loaded from and saved to a file. Additional methods save(filename=None, ignore_discard=False, ignore_expires=False) load(filename=None, ignore_discard=False, ignore_expires=False) revert(filename=None, ignore_discard=False, ignore_expires=False) Additional public attributes filename: filename for loading and saving cookies Additional public readable attributes delayload: request that cookies are lazily loaded from disk; this is only a hint since this only affects performance, not behaviour (unless the cookies on disk are changing); a CookieJar object may ignore it (in fact, only MSIECookieJar lazily loads cookies at the moment) """ def __init__(self, filename=None, delayload=False, policy=None): """ See FileCookieJar.__doc__ for argument documentation. Cookies are NOT loaded from the named file until either the load or revert method is called. """ CookieJar.__init__(self, policy) if filename is not None and not isstringlike(filename): raise ValueError("filename must be string-like") self.filename = filename self.delayload = bool(delayload) def save(self, filename=None, ignore_discard=False, ignore_expires=False): """Save cookies to a file. filename: name of file in which to save cookies ignore_discard: save even cookies set to be discarded ignore_expires: save even cookies that have expired The file is overwritten if it already exists, thus wiping all its cookies. Saved cookies can be restored later using the load or revert methods. If filename is not specified, self.filename is used; if self.filename is None, ValueError is raised. """ raise NotImplementedError() def load(self, filename=None, ignore_discard=False, ignore_expires=False): """Load cookies from a file. Old cookies are kept unless overwritten by newly loaded ones. Arguments are as for .save(). If filename is not specified, self.filename is used; if self.filename is None, ValueError is raised. The named file must be in the format understood by the class, or LoadError will be raised. This format will be identical to that written by the save method, unless the load format is not sufficiently well understood (as is the case for MSIECookieJar). """ if filename is None: if self.filename is not None: filename = self.filename else: raise ValueError(MISSING_FILENAME_TEXT) f = open(filename) try: self._really_load(f, filename, ignore_discard, ignore_expires) finally: f.close() def revert(self, filename=None, ignore_discard=False, ignore_expires=False): """Clear all cookies and reload cookies from a saved file. Raises LoadError (or IOError) if reversion is not successful; the object's state will not be altered if this happens. """ if filename is None: if self.filename is not None: filename = self.filename else: raise ValueError(MISSING_FILENAME_TEXT) old_state = copy.deepcopy(self._cookies) self._cookies = {} try: self.load(filename, ignore_discard, ignore_expires) except (LoadError, IOError): self._cookies = old_state raise ����������������������������������mechanize-0.2.5/mechanize/_lwpcookiejar.py����������������������������������������������������������0000644�0001750�0001750�00000015753�11545150644�017324� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""Load / save to libwww-perl (LWP) format files. Actually, the format is slightly extended from that used by LWP's (libwww-perl's) HTTP::Cookies, to avoid losing some RFC 2965 information not recorded by LWP. It uses the version string "2.0", though really there isn't an LWP Cookies 2.0 format. This indicates that there is extra information in here (domain_dot and port_spec) while still being compatible with libwww-perl, I hope. Copyright 2002-2006 John J Lee <jjl@pobox.com> Copyright 1997-1999 Gisle Aas (original libwww-perl code) This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ import time, re, logging from _clientcookie import reraise_unmasked_exceptions, FileCookieJar, Cookie, \ MISSING_FILENAME_TEXT, LoadError from _headersutil import join_header_words, split_header_words from _util import iso2time, time2isoz debug = logging.getLogger("mechanize").debug def lwp_cookie_str(cookie): """Return string representation of Cookie in an the LWP cookie file format. Actually, the format is extended a bit -- see module docstring. """ h = [(cookie.name, cookie.value), ("path", cookie.path), ("domain", cookie.domain)] if cookie.port is not None: h.append(("port", cookie.port)) if cookie.path_specified: h.append(("path_spec", None)) if cookie.port_specified: h.append(("port_spec", None)) if cookie.domain_initial_dot: h.append(("domain_dot", None)) if cookie.secure: h.append(("secure", None)) if cookie.expires: h.append(("expires", time2isoz(float(cookie.expires)))) if cookie.discard: h.append(("discard", None)) if cookie.comment: h.append(("comment", cookie.comment)) if cookie.comment_url: h.append(("commenturl", cookie.comment_url)) if cookie.rfc2109: h.append(("rfc2109", None)) keys = cookie.nonstandard_attr_keys() keys.sort() for k in keys: h.append((k, str(cookie.get_nonstandard_attr(k)))) h.append(("version", str(cookie.version))) return join_header_words([h]) class LWPCookieJar(FileCookieJar): """ The LWPCookieJar saves a sequence of"Set-Cookie3" lines. "Set-Cookie3" is the format used by the libwww-perl libary, not known to be compatible with any browser, but which is easy to read and doesn't lose information about RFC 2965 cookies. Additional methods as_lwp_str(ignore_discard=True, ignore_expired=True) """ magic_re = r"^\#LWP-Cookies-(\d+\.\d+)" def as_lwp_str(self, ignore_discard=True, ignore_expires=True): """Return cookies as a string of "\n"-separated "Set-Cookie3" headers. ignore_discard and ignore_expires: see docstring for FileCookieJar.save """ now = time.time() r = [] for cookie in self: if not ignore_discard and cookie.discard: debug(" Not saving %s: marked for discard", cookie.name) continue if not ignore_expires and cookie.is_expired(now): debug(" Not saving %s: expired", cookie.name) continue r.append("Set-Cookie3: %s" % lwp_cookie_str(cookie)) return "\n".join(r+[""]) def save(self, filename=None, ignore_discard=False, ignore_expires=False): if filename is None: if self.filename is not None: filename = self.filename else: raise ValueError(MISSING_FILENAME_TEXT) f = open(filename, "w") try: debug("Saving LWP cookies file") # There really isn't an LWP Cookies 2.0 format, but this indicates # that there is extra information in here (domain_dot and # port_spec) while still being compatible with libwww-perl, I hope. f.write("#LWP-Cookies-2.0\n") f.write(self.as_lwp_str(ignore_discard, ignore_expires)) finally: f.close() def _really_load(self, f, filename, ignore_discard, ignore_expires): magic = f.readline() if not re.search(self.magic_re, magic): msg = "%s does not seem to contain cookies" % filename raise LoadError(msg) now = time.time() header = "Set-Cookie3:" boolean_attrs = ("port_spec", "path_spec", "domain_dot", "secure", "discard", "rfc2109") value_attrs = ("version", "port", "path", "domain", "expires", "comment", "commenturl") try: while 1: line = f.readline() if line == "": break if not line.startswith(header): continue line = line[len(header):].strip() for data in split_header_words([line]): name, value = data[0] standard = {} rest = {} for k in boolean_attrs: standard[k] = False for k, v in data[1:]: if k is not None: lc = k.lower() else: lc = None # don't lose case distinction for unknown fields if (lc in value_attrs) or (lc in boolean_attrs): k = lc if k in boolean_attrs: if v is None: v = True standard[k] = v elif k in value_attrs: standard[k] = v else: rest[k] = v h = standard.get expires = h("expires") discard = h("discard") if expires is not None: expires = iso2time(expires) if expires is None: discard = True domain = h("domain") domain_specified = domain.startswith(".") c = Cookie(h("version"), name, value, h("port"), h("port_spec"), domain, domain_specified, h("domain_dot"), h("path"), h("path_spec"), h("secure"), expires, discard, h("comment"), h("commenturl"), rest, h("rfc2109"), ) if not ignore_discard and c.discard: continue if not ignore_expires and c.is_expired(now): continue self.set_cookie(c) except: reraise_unmasked_exceptions((IOError,)) raise LoadError("invalid Set-Cookie3 format file %s" % filename) ���������������������mechanize-0.2.5/mechanize/_mechanize.py�������������������������������������������������������������0000644�0001750�0001750�00000060524�11545150644�016572� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������"""Stateful programmatic WWW navigation, after Perl's WWW::Mechanize. Copyright 2003-2006 John J. Lee <jjl@pobox.com> Copyright 2003 Andy Lester (original Perl code) This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses (see the file COPYING.txt included with the distribution). """ import copy, re, os, urllib, urllib2 from _html import DefaultFactory import _response import _request import _rfc3986 import _sockettimeout import _urllib2_fork from _useragent import UserAgentBase class BrowserStateError(Exception): pass class LinkNotFoundError(Exception): pass class FormNotFoundError(Exception): pass def sanepathname2url(path): urlpath = urllib.pathname2url(path) if os.name == "nt" and urlpath.startswith("///"): urlpath = urlpath[2:] # XXX don't ask me about the mac... return urlpath class History: """ Though this will become public, the implied interface is not yet stable. """ def __init__(self): self._history = [] # LIFO def add(self, request, response): self._history.append((request, response)) def back(self, n, _response): response = _response # XXX move Browser._response into this class? while n > 0 or response is None: try: request, response = self._history.pop() except IndexError: raise BrowserStateError("already at start of history") n -= 1 return request, response def clear(self): del self._history[:] def close(self): for request, response in self._history: if response is not None: response.close() del self._history[:] class HTTPRefererProcessor(_urllib2_fork.BaseHandler): def http_request(self, request): # See RFC 2616 14.36. The only times we know the source of the # request URI has a URI associated with it are redirect, and # Browser.click() / Browser.submit() / Browser.follow_link(). # Otherwise, it's the user's job to add any Referer header before # .open()ing. if hasattr(request, "redirect_dict"): request = self.parent._add_referer_header( request, origin_request=False) return request https_request = http_request class Browser(UserAgentBase): """Browser-like class with support for history, forms and links. BrowserStateError is raised whenever the browser is in the wrong state to complete the requested operation - e.g., when .back() is called when the browser history is empty, or when .follow_link() is called when the current response does not contain HTML data. Public attributes: request: current request (mechanize.Request) form: currently selected form (see .select_form()) """ handler_classes = copy.copy(UserAgentBase.handler_classes) handler_classes["_referer"] = HTTPRefererProcessor default_features = copy.copy(UserAgentBase.default_features) default_features.append("_referer") def __init__(self, factory=None, history=None, request_class=None, ): """ Only named arguments should be passed to this constructor. factory: object implementing the mechanize.Factory interface. history: object implementing the mechanize.History interface. Note this interface is still experimental and may change in future. request_class: Request class to use. Defaults to mechanize.Request The Factory and History objects passed in are 'owned' by the Browser, so they should not be shared across Browsers. In particular, factory.set_response() should not be called except by the owning Browser itself. Note that the supplied factory's request_class is overridden by this constructor, to ensure only one Request class is used. """ self._handle_referer = True if history is None: history = History() self._history = history if request_class is None: request_class = _request.Request if factory is None: factory = DefaultFactory() factory.set_request_class(request_class) self._factory = factory self.request_class = request_class self.request = None self._set_response(None, False) # do this last to avoid __getattr__ problems UserAgentBase.__init__(self) def close(self): UserAgentBase.close(self) if self._response is not None: self._response.close() if self._history is not None: self._history.close() self._history = None # make use after .close easy to spot self.form = None self.request = self._response = None self.request = self.response = self.set_response = None self.geturl = self.reload = self.back = None self.clear_history = self.set_cookie = self.links = self.forms = None self.viewing_html = self.encoding = self.title = None self.select_form = self.click = self.submit = self.click_link = None self.follow_link = self.find_link = None def set_handle_referer(self, handle): """Set whether to add Referer header to each request.""" self._set_handler("_referer", handle) self._handle_referer = bool(handle) def _add_referer_header(self, request, origin_request=True): if self.request is None: return request scheme = request.get_type() original_scheme = self.request.get_type() if scheme not in ["http", "https"]: return request if not origin_request and not self.request.has_header("Referer"): return request if (self._handle_referer and original_scheme in ["http", "https"] and not (original_scheme == "https" and scheme != "https")): # strip URL fragment (RFC 2616 14.36) parts = _rfc3986.urlsplit(self.request.get_full_url()) parts = parts[:-1]+(None,) referer = _rfc3986.urlunsplit(parts) request.add_unredirected_header("Referer", referer) return request def open_novisit(self, url, data=None, timeout=_sockettimeout._GLOBAL_DEFAULT_TIMEOUT): """Open a URL without visiting it. Browser state (including request, response, history, forms and links) is left unchanged by calling this function. The interface is the same as for .open(). This is useful for things like fetching images. See also .retrieve(). """ return self._mech_open(url, data, visit=False, timeout=timeout) def open(self, url, data=None, timeout=_sockettimeout._GLOBAL_DEFAULT_TIMEOUT): return self._mech_open(url, data, timeout=timeout) def _mech_open(self, url, data=None, update_history=True, visit=None, timeout=_sockettimeout._GLOBAL_DEFAULT_TIMEOUT): try: url.get_full_url except AttributeError: # string URL -- convert to absolute URL if required scheme, authority = _rfc3986.urlsplit(url)[:2] if scheme is None: # relative URL if self._response is None: raise BrowserStateError( "can't fetch relative reference: " "not viewing any document") url = _rfc3986.urljoin(self._response.geturl(), url) request = self._request(url, data, visit, timeout) visit = request.visit if visit is None: visit = True if visit: self._visit_request(request, update_history) success = True try: response = UserAgentBase.open(self, request, data) except urllib2.HTTPError, error: success = False if error.fp is None: # not a response raise response = error ## except (IOError, socket.error, OSError), error: ## # Yes, urllib2 really does raise all these :-(( ## # See test_urllib2.py for examples of socket.gaierror and OSError, ## # plus note that FTPHandler raises IOError. ## # XXX I don't seem to have an example of exactly socket.error being ## # raised, only socket.gaierror... ## # I don't want to start fixing these here, though, since this is a ## # subclass of OpenerDirector, and it would break old code. Even in ## # Python core, a fix would need some backwards-compat. hack to be ## # acceptable. ## raise if visit: self._set_response(response, False) response = copy.copy(self._response) elif response is not None: response = _response.upgrade_response(response) if not success: raise response return response def __str__(self): text = [] text.append("<%s " % self.__class__.__name__) if self._response: text.append("visiting %s" % self._response.geturl()) else: text.append("(not visiting a URL)") if self.form: text.append("\n selected form:\n %s\n" % str(self.form)) text.append(">") return "".join(text) def response(self): """Return a copy of the current response. The returned object has the same interface as the object returned by .open() (or mechanize.urlopen()). """ return copy.copy(self._response) def open_local_file(self, filename): path = sanepathname2url(os.path.abspath(filename)) url = 'file://'+path return self.open(url) def set_response(self, response): """Replace current response with (a copy of) response. response may be None. This is intended mostly for HTML-preprocessing. """ self._set_response(response, True) def _set_response(self, response, close_current): # sanity check, necessary but far from sufficient if not (response is None or (hasattr(response, "info") and hasattr(response, "geturl") and hasattr(response, "read") ) ): raise ValueError("not a response object") self.form = None if response is not None: response = _response.upgrade_response(response) if close_current and self._response is not None: self._response.close() self._response = response self._factory.set_response(response) def visit_response(self, response, request=None): """Visit the response, as if it had been .open()ed. Unlike .set_response(), this updates history rather than replacing the current response. """ if request is None: request = _request.Request(response.geturl()) self._visit_request(request, True) self._set_response(response, False) def _visit_request(self, request, update_history): if self._response is not None: self._response.close() if self.request is not None and update_history: self._history.add(self.request, self._response) self._response = None # we want self.request to be assigned even if UserAgentBase.open # fails self.request = request def geturl(self): """Get URL of current document.""" if self._response is None: raise BrowserStateError("not viewing any document") return self._response.geturl() def reload(self): """Reload current document, and return response object.""" if self.request is None: raise BrowserStateError("no URL has yet been .open()ed") if self._response is not None: self._response.close() return self._mech_open(self.request, update_history=False) def back(self, n=1): """Go back n steps in history, and return response object. n: go back this number of steps (default 1 step) """ if self._response is not None: self._response.close() self.request, response = self._history.back(n, self._response) self.set_response(response) if not response.read_complete: return self.reload() return copy.copy(response) def clear_history(self): self._history.clear() def set_cookie(self, cookie_string): """Request to set a cookie. Note that it is NOT necessary to call this method under ordinary circumstances: cookie handling is normally entirely automatic. The intended use case is rather to simulate the setting of a cookie by client script in a web page (e.g. JavaScript). In that case, use of this method is necessary because mechanize currently does not support JavaScript, VBScript, etc. The cookie is added in the same way as if it had arrived with the current response, as a result of the current request. This means that, for example, if it is not appropriate to set the cookie based on the current request, no cookie will be set. The cookie will be returned automatically with subsequent responses made by the Browser instance whenever that's appropriate. cookie_string should be a valid value of the Set-Cookie header. For example: browser.set_cookie( "sid=abcdef; expires=Wednesday, 09-Nov-06 23:12:40 GMT") Currently, this method does not allow for adding RFC 2986 cookies. This limitation will be lifted if anybody requests it. """ if self._response is None: raise BrowserStateError("not viewing any document") if self.request.get_type() not in ["http", "https"]: raise BrowserStateError("can't set cookie for non-HTTP/HTTPS " "transactions") cookiejar = self._ua_handlers["_cookies"].cookiejar response = self.response() # copy headers = response.info() headers["Set-cookie"] = cookie_string cookiejar.extract_cookies(response, self.request) def links(self, **kwds): """Return iterable over links (mechanize.Link objects).""" if not self.viewing_html(): raise BrowserStateError("not viewing HTML") links = self._factory.links() if kwds: return self._filter_links(links, **kwds) else: return links def forms(self): """Return iterable over forms. The returned form objects implement the mechanize.HTMLForm interface. """ if not self.viewing_html(): raise BrowserStateError("not viewing HTML") return self._factory.forms() def global_form(self): """Return the global form object, or None if the factory implementation did not supply one. The "global" form object contains all controls that are not descendants of any FORM element. The returned form object implements the mechanize.HTMLForm interface. This is a separate method since the global form is not regarded as part of the sequence of forms in the document -- mostly for backwards-compatibility. """ if not self.viewing_html(): raise BrowserStateError("not viewing HTML") return self._factory.global_form def viewing_html(self): """Return whether the current response contains HTML data.""" if self._response is None: raise BrowserStateError("not viewing any document") return self._factory.is_html def encoding(self): if self._response is None: raise BrowserStateError("not viewing any document") return self._factory.encoding def title(self): r"""Return title, or None if there is no title element in the document. Treatment of any tag children of attempts to follow Firefox and IE (currently, tags are preserved). """ if not self.viewing_html(): raise BrowserStateError("not viewing HTML") return self._factory.title def select_form(self, name=None, predicate=None, nr=None): """Select an HTML form for input. This is a bit like giving a form the "input focus" in a browser. If a form is selected, the Browser object supports the HTMLForm interface, so you can call methods like .set_value(), .set(), and .click(). Another way to select a form is to assign to the .form attribute. The form assigned should be one of the objects returned by the .forms() method. At least one of the name, predicate and nr arguments must be supplied. If no matching form is found, mechanize.FormNotFoundError is raised. If name is specified, then the form must have the indicated name. If predicate is specified, then the form must match that function. The predicate function is passed the HTMLForm as its single argument, and should return a boolean value indicating whether the form matched. nr, if supplied, is the sequence number of the form (where 0 is the first). Note that control 0 is the first form matching all the other arguments (if supplied); it is not necessarily the first control in the form. The "global form" (consisting of all form controls not contained in any FORM element) is considered not to be part of this sequence and to have no name, so will not be matched unless both name and nr are None. """ if not self.viewing_html(): raise BrowserStateError("not viewing HTML") if (name is None) and (predicate is None) and (nr is None): raise ValueError( "at least one argument must be supplied to specify form") global_form = self._factory.global_form if nr is None and name is None and \ predicate is not None and predicate(global_form): self.form = global_form return orig_nr = nr for form in self.forms(): if name is not None and name != form.name: continue if predicate is not None and not predicate(form): continue if nr: nr -= 1 continue self.form = form break # success else: # failure description = [] if name is not None: description.append("name '%s'" % name) if predicate is not None: description.append("predicate %s" % predicate) if orig_nr is not None: description.append("nr %d" % orig_nr) description = ", ".join(description) raise FormNotFoundError("no form matching "+description) def click(self, *args, **kwds): """See mechanize.HTMLForm.click for documentation.""" if not self.viewing_html(): raise BrowserStateError("not viewing HTML") request = self.form.click(*args, **kwds) return self._add_referer_header(request) def submit(self, *args, **kwds): """Submit current form. Arguments are as for mechanize.HTMLForm.click(). Return value is same as for Browser.open(). """ return self.open(self.click(*args, **kwds)) def click_link(self, link=None, **kwds): """Find a link and return a Request object for it. Arguments are as for .find_link(), except that a link may be supplied as the first argument. """ if not self.viewing_html(): raise BrowserStateError("not viewing HTML") if not link: link = self.find_link(**kwds) else: if kwds: raise ValueError( "either pass a Link, or keyword arguments, not both") request = self.request_class(link.absolute_url) return self._add_referer_header(request) def follow_link(self, link=None, **kwds): """Find a link and .open() it. Arguments are as for .click_link(). Return value is same as for Browser.open(). """ return self.open(self.click_link(link, **kwds)) def find_link(self, **kwds): """Find a link in current page. Links are returned as mechanize.Link objects. # Return third link that .search()-matches the regexp "python" # (by ".search()-matches", I mean that the regular expression method # .search() is used, rather than .match()). find_link(text_regex=re.compile("python"), nr=2) # Return first http link in the current page that points to somewhere # on python.org whose link text (after tags have been removed) is # exactly "monty python". find_link(text="monty python", url_regex=re.compile("http.*python.org")) # Return first link with exactly three HTML attributes. find_link(predicate=lambda link: len(link.attrs) == 3) Links include anchors (<a>), image maps (<area>), and frames (<frame>, <iframe>). All arguments must be passed by keyword, not position. Zero or more arguments may be supplied. In order to find a link, all arguments supplied must match. If a matching link is not found, mechanize.LinkNotFoundError is raised. text: link text between link tags: e.g. <a href="blah">this bit</a> (as returned by pullparser.get_compressed_text(), ie. without tags but with opening tags "textified" as per the pullparser docs) must compare equal to this argument, if supplied text_regex: link text between tag (as defined above) must match the regular expression object or regular expression string passed as this argument, if supplied name, name_regex: as for text and text_regex, but matched against the name HTML attribute of the link tag url, url_regex: as for text and text_regex, but matched against the URL of the link tag (note this matches against Link.url, which is a relative or absolute URL according to how it was written in the HTML) tag: element name of opening tag, e.g. "a" predicate: a function taking a Link object as its single argument, returning a boolean result, indicating whether the links nr: matches the nth link that matches all other criteria (default 0) """ try: return self._filter_links(self._factory.links(), **kwds).next() except StopIteration: raise LinkNotFoundError() def __getattr__(self, name): # pass through _form.HTMLForm methods and attributes form = self.__dict__.get("form") if form is None: raise AttributeError( "%s instance has no attribute %s (perhaps you forgot to " ".select_form()?)" % (self.__class__, name)) return getattr(form, name) def _filter_links(self, links, text=None, text_regex=None, name=None, name_regex=None, url=None, url_regex=None, tag=None, predicate=None, nr=0 ): if not self.viewing_html(): raise BrowserStateError("not viewing HTML") orig_nr = nr for link in links: if url is not None and url != link.url: continue if url_regex is not None and not re.search(url_regex, link.url): continue if (text is not None and (link.text is None or text != link.text)): continue if (text_regex is not None and (link.text is None or not re.search(text_regex, link.text))): continue if name is not None and name != dict(link.attrs).get("name"): continue if name_regex is not None: link_name = dict(link.attrs).get("name") if link_name is None or not re.search(name_regex, link_name): continue if tag is not None and tag != link.tag: continue if predicate is not None and not predicate(link): continue if nr: nr -= 1 continue yield link nr = orig_nr ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.2.5/setup.py����������������������������������������������������������������������������0000755�0001750�0001750�00000006360�11545150644�013666� 0����������������������������������������������������������������������������������������������������ustar �john����������������������������john�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������#!/usr/bin/env python """Stateful programmatic web browsing. Stateful programmatic web browsing, after Andy Lester's Perl module WWW::Mechanize. mechanize.Browser implements the urllib2.OpenerDirector interface. Browser objects have state, including navigation history, HTML form state, cookies, etc. The set of features and URL schemes handled by Browser objects is configurable. The library also provides an API that is mostly compatible with urllib2: your urllib2 program will likely still work if you replace "urllib2" with "mechanize" everywhere. Features include: ftp:, http: and file: URL schemes, browser history, hyperlink and HTML form support, HTTP cookies, HTTP-EQUIV and Refresh, Referer [sic] header, robots.txt, redirections, proxies, and Basic and Digest HTTP authentication. Much of the code originally derived from Perl code by Gisle Aas (libwww-perl), Johnny Lee (MSIE Cookie support) and last but not least Andy Lester (WWW::Mechanize). urllib2 was written by Jeremy Hylton. """ import os VERSION = open(os.path.join("mechanize", "_version.py")).\ readlines()[0].strip(' "\n') CLASSIFIERS = """\ Development Status :: 5 - Production/Stable Intended Audience :: Developers Intended Audience :: System Administrators License :: OSI Approved :: BSD License License :: OSI Approved :: Zope Public License Natural Language :: English Operating System :: OS Independent Programming Language :: Python Programming Language :: Python :: 2 Programming Language :: Python :: 2.4 Programming Language :: Python :: 2.5 Programming Language :: Python :: 2.6 Programming Language :: Python :: 2.7 Topic :: Internet Topic :: Internet :: File Transfer Protocol (FTP) Topic :: Internet :: WWW/HTTP Topic :: Internet :: WWW/HTTP :: Browsers Topic :: Internet :: WWW/HTTP :: Indexing/Search Topic :: Internet :: WWW/HTTP :: Site Management Topic :: Internet :: WWW/HTTP :: Site Management :: Link Checking Topic :: Software Development :: Libraries Topic :: Software Development :: Libraries :: Python Modules Topic :: Software Development :: Testing Topic :: Software Development :: Testing :: Traffic Generation Topic :: System :: Archiving :: Mirroring Topic :: System :: Networking :: Monitoring Topic :: System :: Systems Administration Topic :: Text Processing Topic :: Text Processing :: Markup Topic :: Text Processing :: Markup :: HTML Topic :: Text Processing :: Markup :: XML """ def main(): try: import setuptools except ImportError: import ez_setup ez_setup.use_setuptools() import setuptools setuptools.setup( name = "mechanize", version = VERSION, license = "BSD", # or ZPL 2.1 platforms = ["any"], classifiers = [c for c in CLASSIFIERS.split("\n") if c], install_requires = [], zip_safe = True, test_suite = "test", author = "John J. Lee", author_email = "jjl@pobox.com", description = __doc__.split("\n", 1)[0], long_description = __doc__.split("\n", 2)[-1], url = "http://wwwsearch.sourceforge.net/mechanize/", download_url = ("http://pypi.python.org/packages/source/m/mechanize/" "mechanize-%s.tar.gz" % VERSION), packages = ["mechanize"], ) if __name__ == "__main__": main() ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������