pax_global_header00006660000000000000000000000064135776322750014533gustar00rootroot0000000000000052 comment=3acb1836f3fd8edc5a758a417dd46b53832ae3b5 mechanize-0.4.5/000077500000000000000000000000001357763227500135045ustar00rootroot00000000000000mechanize-0.4.5/.gitignore000066400000000000000000000000641357763227500154740ustar00rootroot00000000000000/build/ /dist/ /docs/_* *.egg-info/ *.py[co] .venv* mechanize-0.4.5/COPYRIGHT000066400000000000000000000002031357763227500147720ustar00rootroot00000000000000Files: * Copyright: Copyright (C) 2008-2017 Kovid Goyal, John J Lee, Gisle Aas, Johnny Lee, Andy Lester License: BSD-3-clause-like mechanize-0.4.5/ChangeLog000066400000000000000000000703771357763227500152740ustar00rootroot00000000000000This isn't really in proper GNU ChangeLog format, it just happens to look that way. 2019-12-22 Kovid Goyal * 0.4.5 release * Add a set_html() method to the browser object 2019-11-07 Kovid Goyal * 0.4.4 release * URLs passed into mechanize now automatically have URL unsafe characters percent encoded. This is necessary because newer versions of python disallow processing of URLs with unsafe characters. Note that this means values return by get_full_url(), get_selector() etc will be percent encoded. 2019-08-18 Kovid Goyal * 0.4.3 release * When filling forms with unicode strings automatically encode them into the correct encoding fr the HTML page being viewed * Guess content type when uploading files if not specified * py3 compat - Have the version of simple cookies be 0 rather than None 2019-04-12 Kovid Goyal * 0.4.2 release * A couple of python 3 specific fixes for proxy authorization and * adding controls to forms 2019-03-16 Kovid Goyal * 0.4.1 release * A couple of python 3 specific fixes for servers with missing robots.txt files and also errors when using basic/digest auth 2019-01-16 Kovid Goyal * 0.4.0 release * Python 3 compatibility * Add a finalize_request_headers callback to Browser to allow users full control of what headers are sent with every request * Preserve header ordering when making HTTP requests 2018-09-11 Kovid Goyal * 0.3.7 release * Fix processing of http-equiv meta tags incorrectly lower casing the content * Fix error when a textbox contained within a form contains unicode characters 2017-10-13 Kovid Goyal * 0.3.6 release. * Use html5-parser for parsing HTML, when available instead of html5lib for a big performance boost. * Fix error when trying to submit forms with non-ascii values on systems where the default encoding is ascii. * Fix errors on python environments with broken threading 2017-06-24 Kovid Goyal * 0.3.5 release. * Fix error when trying to open pages that contain HTML entities that decode to unicode characters in their sections 2017-05-05 Kovid Goyal * 0.3.3 release. * Add get() and __getitem__ methods to the response object for conveninent access to response headers 2017-04-29 Kovid Goyal * 0.3.2 release. * Allow overriding of Host headers via addheaders * Fix using unicode strings in addheaders and trying to send data with a request failing 2017-03-17 Kovid Goyal * 0.3.1 release. * Allow easily selecting forms based on HTML attributes of the
tag * Allow specifying the HTTP method when creating requests * Convenience API to set headers * Convenience API for dealing with cookies * Create full documentation at: https://mechanize.readthedocs.io/en/latest/ 2017-03-15 Kovid Goyal * 0.3.0 release. * Support HTML 5 (all html is now parsed using html5lib) * Implement cloning of browser instances via Browser.__copy__() * Make gzip content-encoding non-experimental. Always handle it if sent by the server, regardless of set_handle_gzip(). * mechanize now requires python >= 2.7.0 * When processing cookies that have a blank (unset) path, assume the path is /. Mimics modern browser behavior. * Support PyPy (added to continuous integration testing) * Make the global urlopen/urlretrieve methods threadsafe * Add support for user supplied CA certificates * Fix gzip not being requested on HTTPS connections * Normalize the case of HTTP headers in requests * Fix proxy authentication for https connections not working * Size of codebase reduced by 10,000 lines of code (40%) * Backward incompatibility: The factory keyword argument to Browser is no longer allowed * Backward incompatibility: Browser.forms() and Browser.links() return unicode strings instead of byte strings * Backward incompatibility: When searching for a form control if more than one control matches, an AmbiguityError is always raised * Backward incompatibility: There is no longer a mechanize.ParseError and mechanize.ParseResponse class. Parsing now uses the HTML 5 algorithm, which is designed to not fail on malformed markup. * Backward incompatibility: For links that do not have any text the text attribute is now always an empty string instead of None or an empty string. * Backward incompatibility: Remove support for the HTML tag which was deprecated in HTML 4 and removed in HTML 5 2011-03-31 John J Lee * 0.2.5 release. * This is essentially a no-changes release to fix easy_install breakage caused by a SourceForge issue * Sourceforge is returning invalid HTTP responses, make download links point to PyPI instead * Include cookietest.cgi in source distribution * Note new IETF cookie standardisation effort 2010-10-28 John J Lee * 0.2.4 release. * Fix IndexError on empty Content-type header value. (GH-18) * Fall back to another encoding if an unknown one is declared. Fixes traceback on unknown encoding in Content-type header. (GH-30) 2010-10-16 John J Lee * 0.2.3 release. * Fix str(ParseError()) traceback. (GH-25) * Add equality methods to mechanize.Cookie . (GH-29) 2010-07-17 John J Lee * 0.2.2 release. * Officially support Python 2.7 (no changes were required) * Fix TypeError on .open()ing ftp: URL (only affects Python 2.4 and 2.5) * Don't include HTTPSHandler in __all__ if it's not available 2010-05-16 John J Lee * 0.2.1 release. * API change: Change argument order of HTTPRedirectHandler.redirect_request() to match urllib2. * Fix failure to use bundled BeautifulSoup for forms. (GH-15) * Fix default cookie path where request path has query containing / character. (http://bugs.python.org/issue3704) * Fix failure to raise on click for nonexistent label. (GH-16) * Documentation fixes. 2010-04-22 John J Lee * 0.2.0 release. * Behaviour change: merged upstream urllib2 change (allegedly a "bug fix") to return a response for all 2** HTTP responses (e.g. 206 Partial Content). Previously, only 200 caused a response object to be returned. All other HTTP response codes resulted in a response object being raised as an exception. * Behaviour change: Use of mechanize classes with `urllib2` (and vice-versa) is no longer supported. However, existing classes implementing the urllib2 Handler interface are likely to work unchanged with mechanize. Removed RequestUpgradeProcessor, ResponseUpgradeProcessor, SeekableProcessor. * ClientForm has been merged into mechanize. This means that mechanize has no dependencies other than Python itself. The ClientForm API is still available -- to switch from ClientForm to mechanize, just s/ClientForm/mechanize in your source code, and ensure any use of the module logging logger named "ClientForm" is updated to use the new logger name "mechanize.forms". I probably won't do further standalone releases of ClientForm. * Stop monkey-patching Python stdlib. * Merge fixes from urllib2 trunk * Close file objects on .read() failure in .retrieve() * Fix a python 2.4 bug due to buggy urllib.splithost * Fix Python 2.4 syntax error in _firefox3cookiejar * Fix __init__.py typo that hid mechanize.seek_wrapped_response and mechanize.str2time. Fixes http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=465206 * Fix an obvious bug with experimental firefox 3 cookiejar support. It's still experimental and not properly tested. * Change documentation to not require a .port attribute on request objects, since that's unused. * Doc fixes * Added mechanize.urljoin (RFC 3986 compliant function for joining a base URI with a URI reference) * Merge of ClientForm (see above). * Moved to git (from SVN) http://github.com/jjlee/mechanize * Created an issue tracker http://github.com/jjlee/mechanize/issues * Docs are now in markdown format (thanks John Gabriele). * Website rearranged. The old website has been archived at http://wwwsearch.sourceforge.net/old/ . The new website is essentially just the mechanize pages, rearranged and cleaned up a bit. * Source code rearranged for easier merging with upstream urllib2 * Fully automated release process. * New test runner. Single test suite; tests create their own HTTP server fixtures (server fixtures are cached where possible for speed). 2009-02-07 John J Lee * 0.1.11 release. * Fix quadratic performance in number of .read() calls (and add an automated performance test). 2008-12-03 John J Lee * 0.1.10 release. * Add support for Python 2.6: Raise URLError on file: URL errors, not IOError (port of upstream urllib2 fix). Add support for Python 2.6's per-connection timeouts: Add timeout arguments to urlopen(), Request constructor, .open(), and .open_novisit(). * Drop support for Python 2.3 * Add Content-length header to Request object (httplib bug that prevented doing that was fixed in Python 2.4). There's no change is what is actually sent over the wire here, just in what headers get added to the Request object. * Fix AttributeError on .retrieve() with a Request (as opposed to URL string) argument * Don't change CookieJar state in .make_cookies(). * Fix AttributeError in case where .make_cookies() or .cookies_for_request() is called before other methods like .extract_cookies() or .make_cookie_header() * Fixes affecting version cookie-attribute (http://bugs.python.org/issue3924). * Silence module logging's "no handlers could be found for logger mechanize" warning in a way that doesn't clobber attempts to set log level sometimes * Don't use private attribute of request in request upgrade handler (what was I thinking??) * Don't call setup() on import of setup.py * Add new public function effective_request_host * Add .get_policy() method to CookieJar * Add method CookieJar.cookies_for_request() * Fix documented interface required of requests and responses (and add some tests for this!) * Allow either .is_unverifiable() or .unverifiable on request objects (preferring the former) * Note that there's a new functional test for digest auth, which fails when run against the sourceforge site (which is the default). It looks like this reflects the fact that digest auth has been fairly broken since it was introduced in urllib2. I don't plan to fix this myself. 2008-09-24 John J Lee * 0.1.9 release. * Fix ImportError if sqlite3 not available * Fix a couple of functional tests not to wait 5 seconds each 2008-09-13 John J Lee * 0.1.8 release. * Close sockets. This only affects Python 2.5 (and later) - earlier versions of Python were unaffected. See http://bugs.python.org/issue1627441 * Make title parsing follow Firefox behaviour wrt child elements (previously the behaviour differed between Factory and RobustFactory). * Fix BeautifulSoup RobustLinksFactory (hence RobustFactory) link text parsing for case of link text containing tags (Titus Brown) * Fix issue where more tags after caused default parser to raise an exception * Handle missing cookie max-age value. Previously, a warning was emitted in this case. * Fix thoroughly broken digest auth (still need functional test!) (trebor74hr@...) * Handle cookies containing embedded tabs in mozilla format files * Remove an assertion about mozilla format cookies file contents (raise LoadError instead) * Fix MechanizeRobotFileParser.set_opener() * Fix selection of global form using .select_form() (Titus Brown) * Log skipped Refreshes * Stop tests from clobbering files that happen to be lying around in cwd (!) * Use SO_REUSEADDR for local test server. * Raise exception if local test server fails to start. * Tests no longer (accidentally) depend on third-party coverage module * The usual docs and test fixes. * Add convenience method Browser.open_local_file(filename) * Add experimental support for Firefox 3 cookie jars ("cookies.sqlite"). Requires Python 2.5 * Fix a _gzip.py NameError (gzip support is experimental) 2007-05-31 John J Lee <jjl@pobox.com> * 0.1.7b release. * Sub-requests should not usually be visiting, so make it so. In fact the visible behaviour wasn't really broken here, since .back() skips over None responses (which is odd in itself, but won't be changed until after stable release is branched). However, this patch does change visible behaviour in that it creates new Request objects for sub-requests (e.g. basic auth retries) where previously we just mutated the existing Request object. * Changes to sort out abuse of by SeekableProcessor and ResponseUpgradeProcessor (latter shouldn't have been public in the first place) and resulting confusing / unclear / broken behaviour. Deprecate SeekableProcessor and ResponseUpgradeProcessor. Add SeekableResponseOpener. Remove SeekableProcessor and ResponseUpgradeProcessor from Browser. Move UserAgentBase.add_referer_header() to Browser (it was on by default, breaking UserAgent, and should never really have been there). * Fix HTTP proxy support: r29110 meant that Request.get_selector() didn't take into account the change to .__r_host (Thanks tgates@...). * Redirected robots.txt fetch no longer results in another attempted robots.txt fetch to check the redirection is allowed! * Fix exception raised by RFC 3986 implementation with urljoin(base, '/..') * Fix two multiple-response-wrapping bugs. * Add missing import in tests (caused failure on Windows). * Set svn:eol-style to native for all text files in SVN. * Add some tests for upgrade_response(). * Add a functional test for 302 + 404 case. * Add an -l option to run the functional tests against a local twisted.web-based server (you need Twisted installed for this to work). This is much faster than running against wwwsearch.sourceforge.net * Add -u switch to skip unittests (and only run the doctests). 2007-01-07 John J Lee <jjl@pobox.com> * 0.1.6b release * Add mechanize.ParseError class, document it as part of the mechanize.Factory interface, and raise it from all Factory implementations. This is backwards-compatible, since the new exception derives from the old exceptions. * Bug fix: Truncation when there is no full .read() before navigating to the next page, and an old response is read after navigation. This happened e.g. with r = br.open(); r.readline(); br.open(url); r.read(); br.back() . * Bug fix: when .back() caused a reload, it was returning the old response, not the .reload()ed one. * Bug fix: .back() was not returning a copy of the response, which presumably would cause seek position problems. * Bug fix: base tag without href attribute would override document URL with a None value, causing a crash (thanks Nathan Eror). * Fix .set_response() to close current response first. * Fix non-idempotent behaviour of Factory.forms() / .links() . Previously, if for example you got a ParseError during execution of .forms(), you could call it again and have it not raise an exception, because it started out where it left off! * Add a missing copy.copy() to RobustFactory . * Fix redirection to 'URIs' that contain characters that are not allowed in URIs (thanks Riko Wichmann). Also, Request constructor now logs a module logging warning about any such bad URIs. * Add .global_form() method to Browser to support form controls whose HTML elements are not descendants of any FORM element. * Add a new method .visit_response() . This creates a new history entry from a response object, rather than just changing the current visited response. This is useful e.g. when you want to use Browser features in a handler. * Misc minor bug fixes. 2006-10-25 John J Lee <jjl@pobox.com> * 0.1.5b release: Update setuptools dependencies to depend on ClientForm>=0.2.5 (for an important bug fix affecting fragments in URLs). There are no other changes in this release -- this release was done purely so that people upgrading to the latest version of mechanize will get the latest ClientForm too. 2006-10-14 John J Lee <jjl@pobox.com> * 0.1.4b release: (skipped a version deliberately for obscure reasons) * Improved auth & proxies support. * Follow RFC 3986. * Add a .set_cookie() method to Browser . * Add Browser.open_novisit() and Request.visit to allow fetching files without affecting Browser state. * UserAgent and Browser are now subclasses of UserAgentBase. UserAgent's only role in life above what UserAgentBase does is to provide the .set_seekable_responses() method (it lives there because Browser depends on seekable responses, because that's how browser history is implemented). * Bundle BeautifulSoup 2.1.1. No more dependency pain! Note that BeautifulSoup is, and always was, optional, and that mechanize will eventually switch to BeautifulSoup version 3, at which point it may well stop bundling BeautifulSoup. Note also that the module is only used internally, and is not available as a public attribute of the package. If you dare, you can import it ("from mechanize import _beautifulsoup"), but beware that it will go away later, and that the API of BeautifulSoup will change when the upgrade to 3 happens. Also, BeautifulSoup support (mainly RobustFactory) is still a little experimental and buggy. * Fix HTTP-EQUIV with no content attribute case (thanks Pratik Dam). * Fix bug with quoted META Refresh URL (thanks Nilton Volpato). * Fix crash with </base> tag (yajdbgr02@...). * Somebody found a server that (incorrectly) depends on HTTP header case, so follow the Title-Case convention. Note that the Request headers interface(s), which were (somewhat oddly -- this is an inheritance from urllib2 that should really be fixed in a better way than it is currently) always case-sensitive still are; the only thing that changed is what actually eventually gets sent over the wire. * Use mechanize (not urllib) to open robots.txt. Don't consult RobotFileParser instance about non-HTTP URLs. * Fix OpenerDirector.retrieve(), which was very broken (thanks Duncan Booth). * Crash in a much more obvious way if trying to use OpenerDirector after .close() . * .reload() on .back() if necessary (necessary iff response was not fully .read() on first .open()ing ) * Strip fragments before retrieving URLs (fixed Request.get_selector() to strip fragment) * Fix catching HTTPError subclasses while still preserving all their response behaviour * Correct over-enthusiastic documented guarantees of closeable_response . * Fix assumption that httplib.HTTPMessage treats dict-style __setitem__ as append rather than set (where on earth did I get that from?). * Expose History in mechanize/__init__.py (though interface is still experimental). * Lots of other "internals" bugs fixed (thanks to reports / patches from Benji York especially, also Titus Brown, Duncan Booth, and me ;-), where I'm not 100% sure exactly when they were introduced, so not listing them here in detail. * Numerous other minor fixes. * Some code cleanup. 2006-05-21 John J Lee <jjl@pobox.com> * 0.1.2b release: * mechanize now exports the whole urllib2 interface. * Pull in bugfixed auth/proxy support code from Python 2.5. * Bugfix: strip leading and trailing whitespace from link URLs * Fix .any_response() / .any_request() methods to have ordering. consistent with rest of handlers rather than coming before all of them. * Tell cookie-handling code about new TLDs. * Remove Browser.set_seekable_responses() (they always are anyway). * Show in web page examples how to munge responses and how to do proxy/auth. * Rename 0.1.* changes document 0.1.0-changes.txt --> 0.1-changes.txt. * In 0.1 changes document, note change of logger name from "ClientCookie" to "mechanize" * Add something about response objects to changes document * Improve Browser.__str__ * Accept regexp strings as well as regexp objects when finding links. * Add crappy gzip transfer encoding support. This is off by default and warns if you turn it on (hopefully will get better later :-). * A bit of internal cleanup following merge with pullparser / ClientCookie. 2006-05-06 John J Lee <jjl@pobox.com> * 0.1.1a release: * Merge ClientCookie and pullparser with mechanize. * Response object fixes. * Remove accidental dependency on BeautifulSoup introduced in 0.1.0a (the BeautifulSoup support is still here, but BeautifulSoup is not required to use mechanize). 2006-05-03 John J Lee <jjl@pobox.com> * 0.1.0a release: * Stop trying to record precise dates in changelog, since that's silly ;-) * A fair number of interface changes: see 0.1.0-changes.txt. * Depend on recent ClientCookie with copy.copy()able response objects. * Don't do broken XHTML handling by default (need to review code before switching this back on, e.g. should use a real XML parser for first-try at parsing). To get the old behaviour, pass i_want_broken_xhtml_support=True to mechanize.DefaultFactory / .RobustFactory constructor. * Numerous small bug fixes. * Documentation & setup.py fixes. * Don't use cookielib, to avoid having to work around Python 2.4 RFC 2109 bug, and to avoid my braindead thread synchronisation code in cookielib :-((((( (I haven't encountered specific breakage due to latter, but since it's braindead I may as well avoid it). 2005-11-30 John J Lee <jjl@pobox.com> * Fixed setuptools support. * Release 0.0.11a. 2005-11-19 John J Lee <jjl@pobox.com> * Release 0.0.10a. 2005-11-17 John J Lee <jjl@pobox.com> * Fix set_handle_referer. 2005-11-12 John J Lee <jjl@pobox.com> * Fix history (Gary Poster). * Close responses on reload (Gary Poster). * Don't depend on SSL support (Gary Poster). 2005-10-31 John J Lee <jjl@pobox.com> * Add setuptools support. 2005-10-30 John J Lee <jjl@pobox.com> * Don't mask AttributeError exception messages from ClientForm. * Document intent of .links() vs. .get_links_iter(); Rename LinksFactory method. * Remove pullparser import dependency. * Remove Browser.urltags (now an argument to LinksFactory). * Document Browser constructor as taking keyword args only (and change positional arg spec). * Cleanup of lazy parsing (may fix bugs, not sure...). 2005-10-28 John J Lee <jjl@pobox.com> * Support ClientForm backwards_compat switch. 2005-08-28 John J Lee <jjl@pobox.com> * Apply optimisation patch (Stephan Richter). 2005-08-15 John J Lee <jjl@pobox.com> * Close responses (ie. close the file handles but leave response still .read()able &c., thanks to the response objects we're using) (aurel@nexedi.com). 2005-08-14 John J Lee <jjl@pobox.com> * Add missing argument to UserAgent's _add_referer_header stub. * Doc and comment improvements. 2005-06-28 John J Lee <jjl@pobox.com> * Allow specifying parser class for equiv handling. * Ensure correct default constructor args are passed to HTTPRefererProcessor. * Allow configuring details of Refresh handling. * Switch to tolerant parser. 2005-06-11 John J Lee <jjl@pobox.com> * Do .seek(0) after link parsing in a finally block. * Regard text/xhtml as HTML. * Fix 2.4-compatibility bugs. * Fix spelling of '_equiv' feature string. 2005-05-30 John J Lee <jjl@pobox.com> * Turn on Referer, Refresh and HTTP-Equiv handling by default. 2005-05-08 John J Lee <jjl@pobox.com> * Fix .reload() to not update history (thanks to Titus Brown). * Use cookielib where available 2005-03-01 John J Lee <jjl@pobox.com> * Fix referer bugs: Don't send URL fragments; Don't add in Referer header in redirected request unless original request had a Referer header. 2005-02-19 John J Lee <jjl@pobox.com> * Allow supplying own mechanize.FormsFactory, so eg. can use ClientForm.XHTMLFormParser. Also allow supplying own Request class, and use sensible defaults for this. Now depends on ClientForm 0.1.17. Side effect is that, since we use the correct Request class by default, there's (I hope) no need for using RequestUpgradeProcessor in Browser._add_referer_header() :-) 2005-01-30 John J Lee <jjl@pobox.com> * Released 0.0.9a. 2005-01-05 John J Lee <jjl@pobox.com> * Fix examples (scraped sites have changed). * Fix .set_*() method boolean arguments. * The .response attribute is now a method, .response() * Don't depend on BaseProcessor (no longer exists). 2004-05-18 John J Lee <jjl@pobox.com> * Released 0.0.8a: * Added robots.txt observance, controlled by * BASE element has attribute 'href', not 'uri'! (patch from Jochen Knuth) * Fixed several bugs in handling of Referer header. * Link.__eq__ now returns False instead of raising AttributeError on comparison with non-Link (patch from Jim Jewett) * Removed dependencies on HTTPS support in Python and on ClientCookie.HTTPRobotRulesProcessor 2004-01-18 John J Lee <jjl@pobox.com> * Added robots.txt observance, controlled by UserAgent.set_handle_robots(). This is now on by default. * Removed set_persistent_headers() method -- just use .addheaders, as in base class. 2004-01-09 John J Lee <jjl@pobox.com> * Removed unnecessary dependence on SSL support in Python. Thanks to Krzysztof Kowalczyk for bug report. * Released 0.0.7a. 2004-01-06 John J Lee <jjl@pobox.com> * Link instances may now be passed to .click_link() and .follow_link(). * Added a new example program, pypi.py. 2004-01-05 John J Lee <jjl@pobox.com> * Released 0.0.5a. * If <title> tag was missing, links and forms would not be parsed. Also, base element (giving base URI) was ignored. Now parse title lazily, and get base URI while parsing links. Also, fixed ClientForm to take note of base element. Thanks to Phillip J. Eby for bug report. * Released 0.0.6a. 2004-01-04 John J Lee <jjl@pobox.com> * Fixed _useragent._replace_handler() to update self.handlers correctly. * Updated required pullparser version check. * Visiting a URL now deselects form (sets self.form to None). * Only first Content-Type header is now checked by ._viewing_html(), if there are more than one. * Stopped using getheaders from ClientCookie -- don't need it, since depend on Python 2.2, which has .getheaders() method on responses. Improved comments. * .open() now resets .response to None. Also rearranged .open() a bit so instance remains in consistent state on failure. * .geturl() now checks for non-None .response, and raises Browser. * .back() now checks for non-None .response, and doesn't attempt to parse if it's None. * .reload() no longer adds new history item. * Documented tag argument to .find_link(). * Fixed a few places where non-keyword arguments for .find_link() were silently ignored. Now raises ValueError. 2004-01-02 John J Lee <jjl@pobox.com> * Use response_seek_wrapper instead of seek_wrapper, which broke use of reponses after they're closed. * (Fixed response_seek_wrapper in ClientCookie.) * Fixed adding of Referer header. Thanks to Per Cederqvist for bug report. * Released 0.0.4a. * Updated required ClientCookie version check. 2003-12-30 John J Lee <jjl@pobox.com> * Added support for character encodings (for matching link text). * Released 0.0.3a. 2003-12-28 John J Lee <jjl@pobox.com> * Attribute lookups are no longer forwarded to .response -- you have to do it explicitly. * Added .geturl() method, which just delegates to .response. * Big rehash of UserAgent, which was broken. Added a test. * Discovered that zip() doesn't raise an exception when its arguments are of different length, so several tests could pass when they should have failed. Fixed. * Fixed <A/> case in ._parse_html(). * Released 0.0.2a. 2003-12-27 John J Lee <jjl@pobox.com> * Added and improved docstrings. * Browser.form is now a public attribute. Also documented Browser's public attributes. * Added base_url and absolute_url attributes to Link. * Tidied up .open(). Relative URL Request objects are no longer converted to absolute URLs -- they should probably be absolute in the first place anyway. * Added proper Referer handling (the handler in ClientCookie is a hack that only covers a restricted case). * Added click_link method, for symmetry with .click() / .submit() methods (which latter apply to forms). Of these methods, .click/.click_link() returns a request, and .submit/ .follow_link() actually .open()s the request. * Updated broken example code. 2003-12-24 John J Lee <jjl@pobox.com> * Modified setup.py so can easily register with PyPI. 2003-12-22 John J Lee <jjl@pobox.com> * Released 0.0.1a. �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/LICENSE�����������������������������������������������������������������������������0000664�0000000�0000000�00000002740�13577632275�0014514�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������Copyright 2017 Kovid Goyal, John J Lee, Gisle Aas, Johnny Lee, Andy Lester Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ��������������������������������mechanize-0.4.5/MANIFEST.in�������������������������������������������������������������������������0000664�0000000�0000000�00000000524�13577632275�0015243�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������include LICENSE include COPYRIGHT include MANIFEST.in include README.rst include ChangeLog include *.py recursive-include examples *.py recursive-include examples/forms *.dat *.txt *.html *.cgi *.py recursive-include test/test_form_data *.html recursive-include test *.py *.doctest *.special_doctest recursive-include test-tools *.py *.cgi ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/README.rst��������������������������������������������������������������������������0000664�0000000�0000000�00000004335�13577632275�0015200�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize - Automate interaction with HTTP web servers ########################################################## |pypi| |build| .. contents:: Major features ----------------- Stateful programmatic web browsing in Python - The browser class `mechanize.Browser` implements the interface of `urllib2.OpenerDirector`, so any URL can be opened not just `http`. - Easy HTML form filling. - Convenient link parsing and following. - Browser history (`.back()` and `.reload()` methods). - The `Referer` HTTP header is added properly (optional). - Automatic observance of `robots.txt <http://www.robotstxt.org/wc/norobots.html>`_. - Automatic handling of HTTP-Equiv and Refresh. Installation ----------------- To install for normal usage: .. code-block:: bash sudo pip2 install mechanize To install for development: .. code-block:: bash git clone https://github.com/python-mechanize/mechanize.git cd mechanize sudo pip2 install -e . To install manually, simply add the `mechanize` sub-directory somewhere on your `PYTHONPATH`. Documentation --------------- See https://mechanize.readthedocs.io/en/latest/ Credits ----------------- python-mechanize was the creation of John J. Lee. Maintenance was taken over by Kovid Goyal in 2017. Much of the code was originally derived from the work of the following people: - Gisle Aas -- [libwww-perl] - Jeremy Hylton (and many others) -- [urllib2] - Andy Lester -- [WWW::Mechanize] - Johnny Lee (coincidentally-named) -- MSIE CookieJar Perl code from which mechanize's support for that is derived. Also: - Gary Poster and Benji York at Zope Corporation -- contributed significant changes to the HTML forms code - Ronald Tschalar -- provided help with Netscape cookies Thanks also to the many people who have contributed bug reports and patches. .. |pypi| image:: https://img.shields.io/pypi/v/mechanize.svg?label=version :target: https://pypi.python.org/pypi/mechanize :alt: Latest version released on PyPi .. |build| image:: https://dev.azure.com/divok/mechanize/_apis/build/status/python-mechanize.mechanize?branchName=master :target: https://dev.azure.com/divok/mechanize/_build/latest?definitionId=3&branchName=master :alt: Build status of the master branch ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/azure-pipelines.yml�����������������������������������������������������������������0000664�0000000�0000000�00000003371�13577632275�0017347�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# Python package # Create and test a Python package on multiple Python versions. # Add steps that analyze code, save the dist with the build record, publish to a PyPI-compatible index, and more: # https://docs.microsoft.com/azure/devops/pipelines/languages/python trigger: - master jobs: - job: 'Linux' pool: vmImage: 'Ubuntu-latest' strategy: matrix: Python27: python.version: '2.7' Python35: python.version: '3.5' Python37: python.version: '3.7' maxParallel: 3 steps: - task: UsePythonVersion@0 inputs: versionSpec: '$(python.version)' architecture: 'x64' - script: | python -m pip install --upgrade pip pip install -r requirements.txt pip install twisted six sudo apt-get --yes update sudo apt-get --yes install libxml2-dev libxslt-dev pip install --no-binary lxml html5-parser displayName: 'Install dependencies' - script: | python setup.py test displayName: 'test' - job: 'macOS' pool: vmImage: 'macOS-latest' steps: - task: UsePythonVersion@0 inputs: versionSpec: '3.7' architecture: 'x64' - script: | python -m pip install --upgrade pip pip install twisted six pip install -r requirements.txt displayName: 'Install dependencies' - script: | python setup.py test displayName: 'test' - job: 'Windows' pool: vmImage: 'windows-latest' steps: - task: UsePythonVersion@0 inputs: versionSpec: '3.7' architecture: 'x64' - script: | python -m pip install --upgrade pip pip install -r requirements.txt pip install twisted six displayName: 'Install dependencies' - script: | python setup.py test displayName: 'test' �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/docs/�������������������������������������������������������������������������������0000775�0000000�0000000�00000000000�13577632275�0014434�5����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/docs/Makefile�����������������������������������������������������������������������0000664�0000000�0000000�00000001136�13577632275�0016075�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# Minimal makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build SPHINXPROJ = mechanize SOURCEDIR = . BUILDDIR = _build # Put it first so that "make" without argument is like "make help". help: @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) .PHONY: help Makefile # Catch-all target: route all unknown targets to Sphinx using the new # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). %: Makefile @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/docs/advanced.rst�������������������������������������������������������������������0000664�0000000�0000000�00000012761�13577632275�0016742�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������Advanced topics ================== .. _threading: Thread safety --------------- The global :func:`mechanize.urlopen()` and :func:`mechanize.urlretrieve()` functions are thread safe. However, mechanize browser instances **are not** thread safe. If you want to use a mechanize Browser instance in multiple threads, clone it, using `copy.copy(browser_object)` method. The clone will share the same, thread safe cookie jar, and have the same settings/handlers as the original, but all other state is not shared, making the clone safe to use in a different thread. Using custom CA certificates ------------------------------- mechanize supports the same mechanism for using custom CA certificates as python >= 2.7.9. To change the certificates a mechanize browser instance uses, call the :meth:`mechanize.Browser.set_ca_data()` method on it. .. _debugging: Debugging -------------- Hints for debugging programs that use mechanize. .. _cookies: Cookies ^^^^^^^^^^ A common mistake is to use :func:`mechanize.urlopen()`, *and* the `.extract_cookies()` and `.add_cookie_header()` methods on a cookie object themselves. If you use `mechanize.urlopen()` (or `OpenerDirector.open()`), the module handles extraction and adding of cookies by itself, so you should not call `.extract_cookies()` or `.add_cookie_header()`. Are you sure the server is sending you any cookies in the first place? Maybe the server is keeping track of state in some other way (`HIDDEN` HTML form entries (possibly in a separate page referenced by a frame), URL-encoded session keys, IP address, HTTP `Referer` headers)? Perhaps some embedded script in the HTML is setting cookies (see below)? Turn on :ref:`logging`. When you `.save()` to or `.load()`/`.revert()` from a file, single-session cookies will expire unless you explicitly request otherwise with the `ignore_discard` argument. This may be your problem if you find cookies are going away after saving and loading. .. code-block:: python import mechanize cj = mechanize.LWPCookieJar() opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj)) mechanize.install_opener(opener) r = mechanize.urlopen("http://foobar.com/") cj.save("/some/file", ignore_discard=True, ignore_expires=True) JavaScript code can set cookies; mechanize does not support this. See :ref:`jsfaq`. General ^^^^^^^^^ Enable :ref:`logging`. Sometimes, a server wants particular HTTP headers set to the values it expects. For example, the `User-Agent` header may need to be set (:meth:`mechanize.Browser.set_header()`) to a value like that of a popular browser. Check that the browser is able to do manually what you're trying to achieve programmatically. Make sure that what you do manually is *exactly* the same as what you're trying to do from Python -- you may simply be hitting a server bug that only gets revealed if you view pages in a particular order, for example. Try comparing the headers and data that your program sends with those that a browser sends. Often this will give you the clue you need. You can use the developer tools in any browser to see exactly what the browser sends and receives. If nothing is obviously wrong with the requests your program is sending and you're out of ideas, you can reliably locate the problem by copying the headers that a browser sends, and then changing headers until your program stops working again. Temporarily switch to explicitly sending individual HTTP headers (by calling `.add_header()`, or by using `httplib` directly). Start by sending exactly the headers that Firefox or Chrome send. You may need to make sure that a valid session ID is sent -- the one you got from your browser may no longer be valid. If that works, you can begin the tedious process of changing your headers and data until they match what your original code was sending. You should end up with a minimal set of changes. If you think that reveals a bug in mechanize, please report it. .. _logging: Logging ^^^^^^^^^ To enable logging to stdout: .. code-block:: python import sys, logging logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.DEBUG) You can reduce the amount of information shown by setting the level to `logging.INFO` instead of `logging.DEBUG`, or by only enabling logging for one of the following logger names instead of `"mechanize"`: * `"mechanize"`: Everything. * `"mechanize.cookies"`: Why particular cookies are accepted or rejected and why they are or are not returned. Requires logging enabled at the `DEBUG` level. * `"mechanize.http_responses"`: HTTP response body data. * `"mechanize.http_redirects"`: HTTP redirect information. HTTP headers ^^^^^^^^^^^^^ An example showing how to enable printing of HTTP headers to stdout, logging of HTTP response bodies, and logging of information about redirections: .. code-block:: python import sys, logging import mechanize logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.DEBUG) browser = mechanize.Browser() browser.set_debug_http(True) browser.set_debug_responses(True) browser.set_debug_redirects(True) response = browser.open("http://python.org/") Alternatively, you can examine request and response objects to see what's going on. Note that requests may involve "sub-requests" in cases such as redirection, in which case you will not see everything that's going on just by examining the original request and final response. ���������������mechanize-0.4.5/docs/browser_api.rst����������������������������������������������������������������0000664�0000000�0000000�00000005266�13577632275�0017513�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������.. _browser_api: Browser API ====================================================== .. module:: mechanize._mechanize :synopsis: The API for mechanize browsers API documentation for the mechanize :class:`Browser` object. You can create a mechanize :class:`Browser` instance as: .. code-block:: python from mechanize import Browser br = Browser() .. contents:: Contents The Browser ---------------- .. autoclass:: mechanize.Browser :members: :inherited-members: The Request -------------- .. autoclass:: mechanize.Request :members: :inherited-members: The Response --------------- Response objects in mechanize are `seek()` able :class:`file`-like objects that support some additional methods, depending on the protocol used for the connection. The documentation below is for HTTP(s) responses, as these are the most common. Additional methods present for HTTP responses: .. class:: HTTPResponse .. attribute:: code The HTTP status code .. method:: getcode() Return HTTP status code .. method:: geturl() Return the URL of the resource retrieved, commonly used to determine if a redirect was followed .. method:: get_all_header_names(normalize=True) Return a list of all headers names. When `normalize` is `True`, the case of the header names is normalized. .. method:: get_all_header_values(name, normalize=True) Return a list of all values for the specified header `name` (which is case-insensitive. Since headers in HTTP can be specified multiple times, the returned value is always a list. See :meth:`rfc822.Message.getheaders`. .. method:: info() Return the headers of the response as a :class:`rfc822.Message` instance. .. method:: __getitem__(header_name) Return the *last* HTTP Header matching the specified name as string. mechanize Response object act like dictionaries for convenient access to header values. For example: :code:`response['Date']`. You can access header values using the header names, case-insensitively. Note that when more than one header with the same name is present, only the value of the last header is returned, use :meth:`get_all_header_values()` to get the values of all headers. .. method:: get(header_name, default=None): Return the header value for the specified `header_name` or `default` if the header is not present. See :meth:`__getitem__`. Miscellaneous ----------------- .. autoclass:: mechanize.Link :members: .. autoclass:: mechanize.History :members: .. automodule:: mechanize._html :members: content_parser ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/docs/conf.py������������������������������������������������������������������������0000664�0000000�0000000�00000011541�13577632275�0015735�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������#!/usr/bin/env python3 # -*- coding: utf-8 -*- # # mechanize documentation build configuration file, created by # sphinx-quickstart on Thu Mar 16 11:15:03 2017. # # This file is execfile()d with the current directory set to its # containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. # import os import sys sys.path.insert(0, os.path.abspath('..')) from mechanize._version import __version__ as mechanize_version # noqa # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. # # needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ 'sphinx.ext.viewcode', 'sphinx.ext.autodoc', 'sphinx.ext.intersphinx' ] intersphinx_mapping = {'python': ('https://docs.python.org/2.7', None)} # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix(es) of source filenames. # You can specify multiple suffix as a list of string: # # source_suffix = ['.rst', '.md'] source_suffix = '.rst' # The master toctree document. master_doc = 'index' # General information about the project. project = 'mechanize' copyright = '2017, Kovid Goyal' author = 'Kovid Goyal' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # The short X.Y version. version = '.'.join(map(str, mechanize_version[:3])) # The full version, including alpha/beta/rc tags. release = version # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. language = None # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This patterns also effect to html_static_path and html_extra_path exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # If true, `todo` and `todoList` produce output, else they produce nothing. todo_include_todos = False # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. # # html_theme = 'alabaster' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. # # html_theme_options = {} # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # -- Options for HTMLHelp output ------------------------------------------ # Output file base name for HTML help builder. htmlhelp_basename = 'mechanizedoc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). # # 'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). # # 'pointsize': '10pt', # Additional stuff for the LaTeX preamble. # # 'preamble': '', # Latex figure (float) alignment # # 'figure_align': 'htbp', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ (master_doc, 'mechanize.tex', 'mechanize Documentation', 'Kovid Goyal', 'manual'), ] # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [(master_doc, 'mechanize', 'mechanize Documentation', [author], 1)] # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ (master_doc, 'mechanize', 'mechanize Documentation', author, 'mechanize', 'One line description of project.', 'Miscellaneous'), ] ���������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/docs/faq.rst������������������������������������������������������������������������0000664�0000000�0000000�00000020617�13577632275�0015743�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������Frequently Asked Questions ============================= .. contents:: Contents :depth: 2 :local: General -------- Which version of Python do I need? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mechanize works on all python versions, python 2 (>= 2.7) and 3 (>= 3.5). What dependencies does mechanize need? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. literalinclude:: ../requirements.txt What license does mechanize use? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mechanize is licensed under the `BSD-3-clause <https://opensource.org/licenses/BSD-3-Clause>`_ license. Usage ------ I'm not getting the HTML page I expected to see? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ See :ref:`debugging`. Is JavaScript supported? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ No, sorry. See :ref:`jsfaq` My HTTP response data is truncated? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `mechanize.Browser's` response objects support the `.seek()` method, and can still be used after `.close()` has been called. Response data is not fetched until it is needed, so navigation away from a URL before fetching all of the response will truncate it. Call `response.get_data()` before navigation if you don't want that to happen. Is there any example code? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Look in the `examples/` directory. Note that the examples on the forms page are executable as-is. Contributions of example code would be very welcome! Cookies ------- Which HTTP cookie protocols does mechanize support? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Netscape and `RFC 2965 <http://www.ietf.org/rfc/rfc2965.txt>`_. RFC 2965 handling is switched off by default. What about RFC 2109? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ RFC 2109 cookies are currently parsed as Netscape cookies, and treated by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled, or as Netscape cookies otherwise. Why don't I have any cookies? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ See :ref:`cookies`. My response claims to be empty, but I know it's not? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Did you call `response.read()` (e.g., in a debug statement), then forget that all the data has already been read? In that case, you may want to use `mechanize.response_seek_wrapper`. `mechanize.Browser` always returns seekable responses, so it's not necessary to use this explicitly in that case. What's the difference between the `.load()` and `.revert()` methods of `CookieJar`? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `.load()` *appends* cookies from a file. `.revert()` discards all existing cookies held by the `CookieJar` first (but it won't lose any existing cookies if the loading fails). Is it threadsafe? ~~~~~~~~~~~~~~~~~~~ See :ref:`threading`. How do I do `X`? ~~~~~~~~~~~~~~~~~~~~ Refer to the API documentation in :doc:`browser_api`. Forms ---------- How do I figure out what control names and values to use? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `print(form)` is usually all you need. In your code, things like the `HTMLForm.items` attribute of :class:`mechanize.HTMLForm` instances can be useful to inspect forms at runtime. Note that it's possible to use item labels instead of item names, which can be useful — use the `by_label` arguments to the various methods, and the `.get_value_by_label()` / `.set_value_by_label()` methods on `ListControl`. What do those `'*'` characters mean in the string representations of list controls? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A `*` next to an item means that item is selected. What do those parentheses (round brackets) mean in the string representations of list controls? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Parentheses `(foo)` around an item mean that item is disabled. Why doesn't `<some control>` turn up in the data returned by `.click*()` when that control has non-`None` value? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Either the control is disabled, or it is not successful for some other reason. 'Successful' (see `HTML 4 specification <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.2>`_) means that the control will cause data to get sent to the server. Why does mechanize not follow the HTML 4.0 / RFC 1866 standards for `RADIO` and multiple-selection `SELECT` controls? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Because by default, it follows browser behaviour when setting the initially-selected items in list controls that have no items explicitly selected in the HTML. Why does `.click()` ing on a button not work for me? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Clicking on a `RESET` button doesn't do anything, by design - this is a library for web automation, not an interactive browser. Even in an interactive browser, clicking on `RESET` sends nothing to the server, so there is little point in having `.click()` do anything special here. Clicking on a `BUTTON TYPE=BUTTON` doesn't do anything either, also by design. This time, the reason is that that `BUTTON` is only in the HTML standard so that one can attach JavaScript callbacks to its events. Their execution may result in information getting sent back to the server. mechanize, however, knows nothing about these callbacks, so it can't do anything useful with a click on a `BUTTON` whose type is `BUTTON`. Generally, JavaScript may be messing things up in all kinds of ways. See :ref:`jsfaq`. How do I change `INPUT TYPE=HIDDEN` field values (for example, to emulate the effect of JavaScript code)? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As with any control, set the control's `readonly` attribute false. .. code-block:: python form.find_control("foo").readonly = False # allow changing .value of control foo form.set_all_readonly(False) # allow changing the .value of all controls I'm having trouble debugging my code. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ See :ref:`debugging`. I have a control containing a list of integers. How do I select the one whose value is nearest to the one I want? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import bisect def closest_int_value(form, ctrl_name, value): values = map(int, [item.name for item in form.find_control(ctrl_name).items]) return str(values[bisect.bisect(values, value) - 1]) form["distance"] = [closest_int_value(form, "distance", 23)] Miscellaneous ------------------- I want to see what my web browser is doing? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use the developer tools for your browser (you may have to install them first). These provide excellent views into all HTTP requests/responses in the browser. .. _jsfaq: JavaScript is messing up my web-scraping. What do I do? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ JavaScript is used in web pages for many purposes -- for example: creating content that was not present in the page at load time, submitting or filling in parts of forms in response to user actions, setting cookies, etc. mechanize does not provide any support for JavaScript. If you come across this in a page you want to automate, you have a few options. Here they are, roughly in order of simplicity: * Figure out what the JavaScript is doing and emulate it in your Python code. The simplest case is if the JavaScript is setting some cookies. In that case you can inspect the cookies in your browser and emulate setting them in mechanize with :meth:`mechanize.Browser.set_simple_cookie()`. * More complex is to use your browser developer tools to see exactly what requests are sent by the browser and emulate them in mechanize by using :class:`mechanize.Request` to create the request manually and open it with :meth:`mechanize.Browser.open()`. * Third is to use some browser automation framework/library to scrape the site instead of using mechanize. These libraries typically drive a headless version of a full browser that can execute all JavaScript. They are typically much slower than using mechanize and far more resource intensive, but do work as a last resort. �����������������������������������������������������������������������������������������������������������������mechanize-0.4.5/docs/forms_api.rst������������������������������������������������������������������0000664�0000000�0000000�00000002625�13577632275�0017152�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������HTML Forms API ============== Forms in HTML documents are represented by :class:`mechanize.HTMLForm`. Every form is a collection of controls. The different types of controls are represented by the various classes documented below. .. autoclass:: mechanize.HTMLForm :members: :inherited-members: :show-inheritance: .. autoclass:: mechanize.Control :members: :inherited-members: :show-inheritance: .. autoclass:: mechanize.ScalarControl :members: :inherited-members: :show-inheritance: .. autoclass:: mechanize.TextControl :members: :inherited-members: :show-inheritance: .. autoclass:: mechanize.FileControl :members: :inherited-members: :show-inheritance: .. autoclass:: mechanize.IgnoreControl :members: :inherited-members: :show-inheritance: .. autoclass:: mechanize.ListControl :members: :inherited-members: :show-inheritance: .. autoclass:: mechanize.RadioControl :members: :inherited-members: :show-inheritance: .. autoclass:: mechanize.CheckboxControl :members: :inherited-members: :show-inheritance: .. autoclass:: mechanize.SelectControl :members: :inherited-members: :show-inheritance: .. autoclass:: mechanize.SubmitControl :members: :inherited-members: :show-inheritance: .. autoclass:: mechanize.ImageControl :members: :inherited-members: :show-inheritance: �����������������������������������������������������������������������������������������������������������mechanize-0.4.5/docs/index.rst����������������������������������������������������������������������0000664�0000000�0000000�00000011600�13577632275�0016273�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize ===================================== Stateful programmatic web browsing in Python. Browse pages programmatically with easy HTML form filling and clicking of links. .. toctree:: :maxdepth: 2 :caption: Table of Contents: faq browser_api forms_api advanced Quickstart ------------ The examples below are written for a website that does not exist (`example.com`), so cannot be run. .. code-block:: python import re import mechanize br = mechanize.Browser() br.open("http://www.example.com/") # follow second link with element text matching regular expression response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1) print(br.title()) print(response1.geturl()) print(response1.info()) # headers print(response1.read()) # body br.select_form(name="order") # Browser passes through unknown attributes (including methods) # to the selected HTMLForm. br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__) # Submit current form. Browser calls .close() on the current response on # navigation, so this closes response1 response2 = br.submit() # print currently selected form (don't call .submit() on this, use br.submit()) print(br.form) response3 = br.back() # back to cheese shop (same data as response1) # the history mechanism returns cached response objects # we can still use the response, even though it was .close()d response3.get_data() # like .seek(0) followed by .read() response4 = br.reload() # fetches from server for form in br.forms(): print(form) # .links() optionally accepts the keyword args of .follow_/.find_link() for link in br.links(url_regex="python.org"): print(link) br.follow_link(link) # takes EITHER Link instance OR keyword args br.back() You may control the browser's policy by using the methods of `mechanize.Browser`'s base class, `mechanize.UserAgent`. For example: .. code-block:: python br = mechanize.Browser() # Explicitly configure proxies (Browser will attempt to set good defaults). # Note the userinfo ("joe:password@") and port number (":3128") are optional. br.set_proxies({"http": "joe:password@myproxy.example.com:3128", "ftp": "proxy.example.com", }) # Add HTTP Basic/Digest auth username and password for HTTP proxy access. # (equivalent to using "joe:password@..." form above) br.add_proxy_password("joe", "password") # Add HTTP Basic/Digest auth username and password for website access. br.add_password("http://example.com/protected/", "joe", "password") # Add an extra header to all outgoing requests, you can also # re-order or remove headers in this function. br.finalize_request_headers = lambda request, headers: headers.__setitem__( 'My-Custom-Header', 'Something') # Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML). br.set_handle_equiv(False) # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False) # Don't add Referer (sic) header br.set_handle_referer(False) # Don't handle Refresh redirections br.set_handle_refresh(False) # Don't handle cookies br.set_cookiejar() # Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by # default: no need to do this unless you have some reason to use a # particular cookiejar) br.set_cookiejar(cj) # Tell the browser to send the Accept-Encoding: gzip header to the server # to indicate it supports gzip Content-Encoding br.set_request_gzip(True) # Do not verify SSL certificates import ssl br.set_ca_data(context=ssl._create_unverified_context(cert_reqs=ssl.CERT_NONE) # Log information about HTTP redirects and Refreshes. br.set_debug_redirects(True) # Log HTTP response bodies (i.e. the HTML, most of the time). br.set_debug_responses(True) # Print HTTP headers. br.set_debug_http(True) # To make sure you're seeing all debug output: logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.INFO) # Sometimes it's useful to process bad headers or bad HTML: response = br.response() # this is a copy of response headers = response.info() # this is a HTTPMessage headers["Content-type"] = "text/html; charset=utf-8" response.set_data(response.get_data().replace("<!---", "<!--")) br.set_response(response) mechanize exports the complete interface of `urllib2`: .. code-block:: python import mechanize response = mechanize.urlopen("http://www.example.com/") print(response.read()) When using mechanize, anything you would normally import from `urllib2` should be imported from mechanize instead. Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` ��������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/examples/���������������������������������������������������������������������������0000775�0000000�0000000�00000000000�13577632275�0015322�5����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/examples/forms/���������������������������������������������������������������������0000775�0000000�0000000�00000000000�13577632275�0016450�5����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/examples/forms/data.dat�������������������������������������������������������������0000664�0000000�0000000�00000000045�13577632275�0020052�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������Let's pretend this is a binary file. �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/examples/forms/data.txt�������������������������������������������������������������0000664�0000000�0000000�00000000031�13577632275�0020114�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������Text, text, text. Blah. �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������mechanize-0.4.5/examples/forms/echo.cgi�������������������������������������������������������������0000775�0000000�0000000�00000001102�13577632275�0020047�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������#!/usr/bin/python # -*-python-*- print "Content-Type: text/html\n" import sys import os import string import cgi from types import ListType print "<html><head><title>Form submission parameters" form = cgi.FieldStorage() print "

Received parameters:

" print "
"
for k in form.keys():
    v = form[k]
    if isinstance(v, ListType):
        vs = []
        for item in v:
            vs.append(item.value)
        text = string.join(vs, ", ")
    else:
        text = v.value
    print "%s: %s" % (cgi.escape(k), cgi.escape(text))
print "
" mechanize-0.4.5/examples/forms/example.html000066400000000000000000000026111357763227500207710ustar00rootroot00000000000000 Example