Scrapy-1.0.3/0000775000175000017500000000000012562425152013624 5ustar travistravis00000000000000Scrapy-1.0.3/setup.cfg0000664000175000017500000000017412562425152015447 0ustar travistravis00000000000000[bdist_rpm] doc_files = docs AUTHORS INSTALL LICENSE README.rst [egg_info] tag_build = tag_date = 0 tag_svn_revision = 0 Scrapy-1.0.3/PKG-INFO0000664000175000017500000000574612562425152014735 0ustar travistravis00000000000000Metadata-Version: 1.1 Name: Scrapy Version: 1.0.3 Summary: A high-level Web Crawling and Web Scraping framework Home-page: http://scrapy.org Author: Pablo Hoffman Author-email: pablo@pablohoffman.com License: BSD Description: ====== Scrapy ====== .. image:: https://img.shields.io/pypi/v/Scrapy.svg :target: https://pypi.python.org/pypi/Scrapy :alt: PyPI Version .. image:: https://img.shields.io/travis/scrapy/scrapy/master.svg :target: http://travis-ci.org/scrapy/scrapy :alt: Build Status .. image:: https://img.shields.io/badge/wheel-yes-brightgreen.svg :target: https://pypi.python.org/pypi/Scrapy :alt: Wheel Status Overview ======== Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. For more information including a list of features check the Scrapy homepage at: http://scrapy.org Requirements ============ * Python 2.7 * Works on Linux, Windows, Mac OSX, BSD Install ======= The quick way:: pip install scrapy For more details see the install section in the documentation: http://doc.scrapy.org/en/latest/intro/install.html Releases ======== You can download the latest stable and development releases from: http://scrapy.org/download/ Documentation ============= Documentation is available online at http://doc.scrapy.org/ and in the ``docs`` directory. Community (blog, twitter, mail list, IRC) ========================================= See http://scrapy.org/community/ Contributing ============ See http://doc.scrapy.org/en/master/contributing.html Companies using Scrapy ====================== See http://scrapy.org/companies/ Commercial Support ================== See http://scrapy.org/support/ Platform: UNKNOWN Classifier: Framework :: Scrapy Classifier: Development Status :: 5 - Production/Stable Classifier: Environment :: Console Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: BSD License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.7 Classifier: Topic :: Internet :: WWW/HTTP Classifier: Topic :: Software Development :: Libraries :: Application Frameworks Classifier: Topic :: Software Development :: Libraries :: Python Modules Scrapy-1.0.3/setup.py0000664000175000017500000000301512562423747015345 0ustar travistravis00000000000000from os.path import dirname, join from setuptools import setup, find_packages with open(join(dirname(__file__), 'scrapy/VERSION'), 'rb') as f: version = f.read().decode('ascii').strip() setup( name='Scrapy', version=version, url='http://scrapy.org', description='A high-level Web Crawling and Web Scraping framework', long_description=open('README.rst').read(), author='Scrapy developers', maintainer='Pablo Hoffman', maintainer_email='pablo@pablohoffman.com', license='BSD', packages=find_packages(exclude=('tests', 'tests.*')), include_package_data=True, zip_safe=False, entry_points={ 'console_scripts': ['scrapy = scrapy.cmdline:execute'] }, classifiers=[ 'Framework :: Scrapy', 'Development Status :: 5 - Production/Stable', 'Environment :: Console', 'Intended Audience :: Developers', 'License :: OSI Approved :: BSD License', 'Operating System :: OS Independent', 'Programming Language :: Python', 'Programming Language :: Python :: 2', 'Programming Language :: Python :: 2.7', 'Topic :: Internet :: WWW/HTTP', 'Topic :: Software Development :: Libraries :: Application Frameworks', 'Topic :: Software Development :: Libraries :: Python Modules', ], install_requires=[ 'Twisted>=10.0.0', 'w3lib>=1.8.0', 'queuelib', 'lxml', 'pyOpenSSL', 'cssselect>=0.9', 'six>=1.5.2', 'service_identity', ], ) Scrapy-1.0.3/README.rst0000664000175000017500000000311712562423747015325 0ustar travistravis00000000000000====== Scrapy ====== .. image:: https://img.shields.io/pypi/v/Scrapy.svg :target: https://pypi.python.org/pypi/Scrapy :alt: PyPI Version .. image:: https://img.shields.io/travis/scrapy/scrapy/master.svg :target: http://travis-ci.org/scrapy/scrapy :alt: Build Status .. image:: https://img.shields.io/badge/wheel-yes-brightgreen.svg :target: https://pypi.python.org/pypi/Scrapy :alt: Wheel Status Overview ======== Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. For more information including a list of features check the Scrapy homepage at: http://scrapy.org Requirements ============ * Python 2.7 * Works on Linux, Windows, Mac OSX, BSD Install ======= The quick way:: pip install scrapy For more details see the install section in the documentation: http://doc.scrapy.org/en/latest/intro/install.html Releases ======== You can download the latest stable and development releases from: http://scrapy.org/download/ Documentation ============= Documentation is available online at http://doc.scrapy.org/ and in the ``docs`` directory. Community (blog, twitter, mail list, IRC) ========================================= See http://scrapy.org/community/ Contributing ============ See http://doc.scrapy.org/en/master/contributing.html Companies using Scrapy ====================== See http://scrapy.org/companies/ Commercial Support ================== See http://scrapy.org/support/ Scrapy-1.0.3/MANIFEST.in0000664000175000017500000000051112562423747015367 0ustar travistravis00000000000000include README.rst include AUTHORS include INSTALL include LICENSE include MANIFEST.in include scrapy/VERSION include scrapy/mime.types recursive-include scrapy/templates * recursive-include scrapy license.txt recursive-include docs * prune docs/build recursive-include extras * recursive-include bin * recursive-include tests * Scrapy-1.0.3/LICENSE0000664000175000017500000000276112562423747014647 0ustar travistravis00000000000000Copyright (c) Scrapy developers. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of Scrapy nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Scrapy-1.0.3/INSTALL0000664000175000017500000000023212562423747014662 0ustar travistravis00000000000000For information about installing Scrapy see: * docs/intro/install.rst (local file) * http://doc.scrapy.org/en/latest/intro/install.html (online version) Scrapy-1.0.3/AUTHORS0000664000175000017500000000237112562423747014707 0ustar travistravis00000000000000Scrapy was brought to life by Shane Evans while hacking a scraping framework prototype for Mydeco (mydeco.com). It soon became maintained, extended and improved by Insophia (insophia.com), with the initial sponsorship of Mydeco to bootstrap the project. In mid-2011, Scrapinghub became the new official maintainer. Here is the list of the primary authors & contributors: * Pablo Hoffman * Daniel Graña * Martin Olveyra * Gabriel García * Michael Cetrulo * Artem Bogomyagkov * Damian Canabal * Andres Moreira * Ismael Carnales * Matías Aguirre * German Hoffmann * Anibal Pacheco * Bruno Deferrari * Shane Evans * Ezequiel Rivero * Patrick Mezard * Rolando Espinoza * Ping Yin * Lucian Ursu * Shuaib Khan * Didier Deshommes * Vikas Dhiman * Jochen Maes * Darian Moody * Jordi Lonch * Zuhao Wan * Steven Almeroth * Tom Mortimer-Jones * Chris Tilden * Alexandr N Zamaraev * Emanuel Schorsch * Michal Danilak * Natan Lao * Hasnain Lakhani * Pedro Faustino * Alex Cepoi * Ilya Baryshev * Libor Nenadál * Jae-Myoung Yu * Vladislav Poluhin * Marc Abramowitz * Valentin-Costel Hăloiu * Jason Yeo * Сергей Прохоров * Simon Ratne * Julien Duponchelle * Jochen Maes * Vikas Dhiman * Juan Picca * Nicolás Ramírez Scrapy-1.0.3/tests/0000775000175000017500000000000012562425152014766 5ustar travistravis00000000000000Scrapy-1.0.3/tests/test_webclient.pyc0000664000175000017500000003411612562424322020521 0ustar travistravis00000000000000 'Uc@sIdZddlZddlmZddlmZddlmZmZm Z m Z ddl m Z m Z ddlmZddlmZdd lmZdd lmZdd lmZmZdd Zd ejfdYZdejfdYZddlm Z m!Z!m"Z"m#Z#m$Z$m%Z%dejfdYZ&dS(sV from twisted.internet import defer Tests borrowed from the twisted.web.client tests. iN(turlparse(tunittest(tservertstaticterrortutil(treactortdefer(tStringTransport(tFilePath(tWrappingFactory(t webclient(tRequesttHeaderscOs5d}ddlm}|||d|||jS(s-Adapted version of twisted.web.client.getPagec_sG|jdd}tjt||d|}|jjd|S(NttimeouticSs|jS(N(tbody(tr((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyts(tpoptclienttScrapyHTTPClientFactoryR tdeferredt addCallback(targstkwargsRtf((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyt_clientfactorysi(t_makeGetterFactorytcontextFactory(ttwisted.web.clientRR(turlRRRRR((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pytgetPages  tParseUrlTestCasecBs)eZdZdZdZdZRS(s.Test URL parsing facility and defaults values.cCs7tjt|}|j|j|j|j|jfS(N(RRR tschemetnetlocthosttporttpath(tselfRR((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyt_parse#scCsd}dd||ddffdd||ddffdd||ddffd d|d |d dffd d|d |d dffd d|d |d dffdd||ddffdd||ddffdd||ddffdd||ddffdd||ddffdd|d|ddffdd)fdd*fdd||d dffd!d||d dffd"d|d|ddffd#d+fd%d,ff}x0|D](\}}|jtj|||qWdS(-Ns 127.0.0.1s#http://127.0.0.1?c=v&c2=v2#fragmentthttpiPs /?c=v&c2=v2s$http://127.0.0.1/?c=v&c2=v2#fragments#http://127.0.0.1/foo?c=v&c2=v2#frags/foo?c=v&c2=v2s'http://127.0.0.1:100?c=v&c2=v2#fragments:100ids$http://127.0.0.1:100/?c=v&c2=v2#frags'http://127.0.0.1:100/foo?c=v&c2=v2#fragshttp://127.0.0.1t/shttp://127.0.0.1/shttp://127.0.0.1/foos/fooshttp://127.0.0.1?param=values /?param=valueshttp://127.0.0.1/?param=valueshttp://127.0.0.1:12345/foos:12345i90shttp://spam:12345/foos spam:12345tspamshttp://spam.test.org/foos spam.test.orgshttps://127.0.0.1/foothttpsishttps://127.0.0.1/?param=valueshttps://127.0.0.1:12345/shttp://scrapytest.org/foo sscrapytest.orgshttp://egg:7890 segg:7890teggi(shttps spam:12345sspami90s/foo(shttps spam.test.orgs spam.test.orgiPs/foo(shttpsscrapytest.orgsscrapytest.orgiPs/foo(shttpsegg:7890seggiR)(t assertEqualsRR'(R&tlipttestsRttest((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyt testParse's,   cCsd}|jd}t||j|\}}}}}|jt|t|jt|t|jt|t|jt|t|jt|tdS(s L{client._parse} should return C{str} for the scheme, host, and path elements of its return tuple, even when passed an URL which has previously been passed to L{urlparse} as a C{unicode} string. uhttp://example.com/pathtasciiN(tencodeRR't assertTruet isinstancetstrtint(R&tbadInputt goodInputR!R"R#R$R%((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyt test_externalUnicodeInterferenceEs (t__name__t __module__t__doc__R'R1R:(((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyR s  tScrapyHTTPPageGetterTestscBs#eZdZdZdZRS(c Csytjtdddddidd6dd 6d d 6d d 6dd6}|j|dtjtd}|j|dtjtdddddddidd6}|j|dtjtdddd}|j|dtjtdddidd6ddgd6}|j|dtjtdddtidd6ddgd6}|j|ddS(NRshttp://foo/barRs some datatheaderss example.nettHosttfoobles User-Agents blah blahtCookiet12981sContent-LengthtvaluetUsefulsGET /bar HTTP/1.0 Content-Length: 9 Useful: value Connection: close User-Agent: fooble Host: example.net Cookie: blah blah some datas GET /bar HTTP/1.0 Host: foo tmethodtPOSTs name=values!application/x-www-form-urlencodeds Content-TypesPOST /bar HTTP/1.0 Host: foo Connection: close Content-Type: application/x-www-form-urlencoded Content-Length: 10 name=values4POST /bar HTTP/1.0 Host: foo Content-Length: 0 tsingles X-Meta-Singletvalue1tvalue2sX-Meta-MultivaluedsoGET /bar HTTP/1.0 Host: foo X-Meta-Multivalued: value1 X-Meta-Multivalued: value2 X-Meta-Single: single (RRR t_testR (R&tfactory((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyttest_earlyHeadersYsP             cCs]t}tj}||_|j||jt|jjt|j|S(N( RRtScrapyHTTPPageGetterRLtmakeConnectiont assertEqualtsetRDt splitlines(R&RLt testvaluet transporttprotocol((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyRKs    cCstjtdd}tj}||_t|_|jd|jd|jd|jd|j|jtidgd6d gd 6dS( NRshttp://foo/barsHTTP/1.0 200 OK s Hello: World s Foo: Bar s tWorldtHellotBartFoo( RRR RNRLR R?t dataReceivedRP(R&RLRU((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyttest_non_standard_line_endingss          (R;R<RMRKR[(((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyR>Ws Z (tForeverTakingResourcet ErrorResourcetNoLengthResourcetHostHeaderResourcetPayloadResourcetBrokenDownloadResourcetWebClientTestCasecBseZdZdZdZdZdZdZdZdZ dZ d Z d Z d Z d Zd ZdZdZRS(cCstjd|ddS(Nit interfaces 127.0.0.1(Rt listenTCP(R&tsite((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyt_listenscCs'|j}tj|t|jdjdtj|}|jdt j d|jdt |jdt |jdt |jdt|jd t|jd ttj|d d|_t|j|_|j|j|_|jjj|_dS( Ntfilet 0123456789tredirects/filetwaitRtnolengthR#tpayloadtbrokenR(tmktemptostmkdirR tchildt setContentRtFiletputChildRtRedirectR\R]R^R_R`RaRtSitetNoneReR twrapperRfR$tgetHosttportno(R&tnameR((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pytsetUps  cCs |jjS(N(R$t stopListening(R&((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyttearDownscCsd|j|fS(Nshttp://127.0.0.1:%d/%s(Rz(R&R%((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pytgetURLscCs2dd}t|jdd|j|j|S(NRhi RlR(RRRR-(R&ts((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyt testPayloads cCsctjt|jdj|jd|jt|jddidd6j|jdgS(NR#s 127.0.0.1:%dR?swww.example.comR@(Rt gatherResultsRRRR-Rz(R&((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyttestHostHeaders(cCs,t|jd}|j|jd|S(s L{client.getPage} returns a L{Deferred} which is called back with the body of the response if the default method B{GET} is used. RgRh(RRRR-(R&td((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyt test_getPagescsLfd}tj|djjd|djjdgS(s L{client.getPage} returns a L{Deferred} which is called back with the empty string if the method is C{HEAD} and there is a successful response code. cstjdd|S(NRgRF(RR(RF(R&(s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyt_getPagestheadttHEAD(RRRRP(R&R((R&s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyttest_getPageHeadscCs9t|jddd}|j|jd|j|S(s When a non-zero timeout is passed to L{getPage} and the page is retrieved before the timeout period elapses, the L{Deferred} is called back with the contents of the page. R#Rids 127.0.0.1:%d(RRRR-Rz(R&R((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyttest_timeoutNotTriggeringscsJjtjdddtj}fd}|j||S(s When a non-zero timeout is passed to L{getPage} and that many seconds elapse before the server responds to the request. the L{Deferred} is errbacked with a L{error.TimeoutError}. RjRgư>cs0jjj}|r,|djjn|S(Ni(Rxt protocolstkeysRTtloseConnection(t passthrought connected(R&(s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pytcleanup"s(t assertFailureRRRt TimeoutErrortaddBoth(R&tfinishedR((R&s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyttest_timeoutTriggerings   cCst|jdj|jS(Nt notsuchfile(RRRt _cbNoSuchFile(R&((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyt testNotFound.scCs|jd|kdS(Ns404 - No Such Resource(tassert_(R&tpageData((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyR1scCsk|jd}tj|\}}}}}tjt|}tj||||jj|j |S(NRg( RRR'RR Rt connectTCPRRt_cbFactoryInfo(R&RR!R"R#R$R%RL((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyttestFactoryInfo4s cCsZ|j|jd|j|jjd|j|jd|j|jdddS(Nt200sHTTP/tOKscontent-lengtht10(R-tstatusRtversiont startswithtmessagetresponse_headers(R&t ignoredResultRL((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyR;scCst|jdj|jS(NRi(RRRt _cbRedirect(R&((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyt testRedirectAscCs|j|ddS(Ns click here (R-(R&R((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyRDs (R;R<RfR|R~RRRRRRRRRRRRR(((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyRbs             ('R=Rotsix.moves.urllib.parseRt twisted.trialRt twisted.webRRRRttwisted.internetRRttwisted.test.proto_helpersRttwisted.python.filepathR ttwisted.protocols.policiesR tscrapy.core.downloaderR Rt scrapy.httpR R RwRtTestCaseR R>ttwisted.web.test.test_webclientR\R]R^R_R`RaRb(((s8/home/travis/build/scrapy/scrapy/tests/test_webclient.pyts " 7u.Scrapy-1.0.3/tests/test_webclient.py0000664000175000017500000003111312562423747020362 0ustar travistravis00000000000000""" from twisted.internet import defer Tests borrowed from the twisted.web.client tests. """ import os from six.moves.urllib.parse import urlparse from twisted.trial import unittest from twisted.web import server, static, error, util from twisted.internet import reactor, defer from twisted.test.proto_helpers import StringTransport from twisted.python.filepath import FilePath from twisted.protocols.policies import WrappingFactory from scrapy.core.downloader import webclient as client from scrapy.http import Request, Headers def getPage(url, contextFactory=None, *args, **kwargs): """Adapted version of twisted.web.client.getPage""" def _clientfactory(*args, **kwargs): timeout = kwargs.pop('timeout', 0) f = client.ScrapyHTTPClientFactory(Request(*args, **kwargs), timeout=timeout) f.deferred.addCallback(lambda r: r.body) return f from twisted.web.client import _makeGetterFactory return _makeGetterFactory(url, _clientfactory, contextFactory=contextFactory, *args, **kwargs).deferred class ParseUrlTestCase(unittest.TestCase): """Test URL parsing facility and defaults values.""" def _parse(self, url): f = client.ScrapyHTTPClientFactory(Request(url)) return (f.scheme, f.netloc, f.host, f.port, f.path) def testParse(self): lip = '127.0.0.1' tests = ( ("http://127.0.0.1?c=v&c2=v2#fragment", ('http', lip, lip, 80, '/?c=v&c2=v2')), ("http://127.0.0.1/?c=v&c2=v2#fragment", ('http', lip, lip, 80, '/?c=v&c2=v2')), ("http://127.0.0.1/foo?c=v&c2=v2#frag", ('http', lip, lip, 80, '/foo?c=v&c2=v2')), ("http://127.0.0.1:100?c=v&c2=v2#fragment", ('http', lip+':100', lip, 100, '/?c=v&c2=v2')), ("http://127.0.0.1:100/?c=v&c2=v2#frag", ('http', lip+':100', lip, 100, '/?c=v&c2=v2')), ("http://127.0.0.1:100/foo?c=v&c2=v2#frag", ('http', lip+':100', lip, 100, '/foo?c=v&c2=v2')), ("http://127.0.0.1", ('http', lip, lip, 80, '/')), ("http://127.0.0.1/", ('http', lip, lip, 80, '/')), ("http://127.0.0.1/foo", ('http', lip, lip, 80, '/foo')), ("http://127.0.0.1?param=value", ('http', lip, lip, 80, '/?param=value')), ("http://127.0.0.1/?param=value", ('http', lip, lip, 80, '/?param=value')), ("http://127.0.0.1:12345/foo", ('http', lip+':12345', lip, 12345, '/foo')), ("http://spam:12345/foo", ('http', 'spam:12345', 'spam', 12345, '/foo')), ("http://spam.test.org/foo", ('http', 'spam.test.org', 'spam.test.org', 80, '/foo')), ("https://127.0.0.1/foo", ('https', lip, lip, 443, '/foo')), ("https://127.0.0.1/?param=value", ('https', lip, lip, 443, '/?param=value')), ("https://127.0.0.1:12345/", ('https', lip+':12345', lip, 12345, '/')), ("http://scrapytest.org/foo ", ('http', 'scrapytest.org', 'scrapytest.org', 80, '/foo')), ("http://egg:7890 ", ('http', 'egg:7890', 'egg', 7890, '/')), ) for url, test in tests: self.assertEquals(client._parse(url), test, url) def test_externalUnicodeInterference(self): """ L{client._parse} should return C{str} for the scheme, host, and path elements of its return tuple, even when passed an URL which has previously been passed to L{urlparse} as a C{unicode} string. """ badInput = u'http://example.com/path' goodInput = badInput.encode('ascii') urlparse(badInput) scheme, netloc, host, port, path = self._parse(goodInput) self.assertTrue(isinstance(scheme, str)) self.assertTrue(isinstance(netloc, str)) self.assertTrue(isinstance(host, str)) self.assertTrue(isinstance(path, str)) self.assertTrue(isinstance(port, int)) class ScrapyHTTPPageGetterTests(unittest.TestCase): def test_earlyHeaders(self): # basic test stolen from twisted HTTPageGetter factory = client.ScrapyHTTPClientFactory(Request( url='http://foo/bar', body="some data", headers={ 'Host': 'example.net', 'User-Agent': 'fooble', 'Cookie': 'blah blah', 'Content-Length': '12981', 'Useful': 'value'})) self._test(factory, "GET /bar HTTP/1.0\r\n" "Content-Length: 9\r\n" "Useful: value\r\n" "Connection: close\r\n" "User-Agent: fooble\r\n" "Host: example.net\r\n" "Cookie: blah blah\r\n" "\r\n" "some data") # test minimal sent headers factory = client.ScrapyHTTPClientFactory(Request('http://foo/bar')) self._test(factory, "GET /bar HTTP/1.0\r\n" "Host: foo\r\n" "\r\n") # test a simple POST with body and content-type factory = client.ScrapyHTTPClientFactory(Request( method='POST', url='http://foo/bar', body='name=value', headers={'Content-Type': 'application/x-www-form-urlencoded'})) self._test(factory, "POST /bar HTTP/1.0\r\n" "Host: foo\r\n" "Connection: close\r\n" "Content-Type: application/x-www-form-urlencoded\r\n" "Content-Length: 10\r\n" "\r\n" "name=value") # test a POST method with no body provided factory = client.ScrapyHTTPClientFactory(Request( method='POST', url='http://foo/bar' )) self._test(factory, "POST /bar HTTP/1.0\r\n" "Host: foo\r\n" "Content-Length: 0\r\n" "\r\n") # test with single and multivalued headers factory = client.ScrapyHTTPClientFactory(Request( url='http://foo/bar', headers={ 'X-Meta-Single': 'single', 'X-Meta-Multivalued': ['value1', 'value2'], })) self._test(factory, "GET /bar HTTP/1.0\r\n" "Host: foo\r\n" "X-Meta-Multivalued: value1\r\n" "X-Meta-Multivalued: value2\r\n" "X-Meta-Single: single\r\n" "\r\n") # same test with single and multivalued headers but using Headers class factory = client.ScrapyHTTPClientFactory(Request( url='http://foo/bar', headers=Headers({ 'X-Meta-Single': 'single', 'X-Meta-Multivalued': ['value1', 'value2'], }))) self._test(factory, "GET /bar HTTP/1.0\r\n" "Host: foo\r\n" "X-Meta-Multivalued: value1\r\n" "X-Meta-Multivalued: value2\r\n" "X-Meta-Single: single\r\n" "\r\n") def _test(self, factory, testvalue): transport = StringTransport() protocol = client.ScrapyHTTPPageGetter() protocol.factory = factory protocol.makeConnection(transport) self.assertEqual( set(transport.value().splitlines()), set(testvalue.splitlines())) return testvalue def test_non_standard_line_endings(self): # regression test for: http://dev.scrapy.org/ticket/258 factory = client.ScrapyHTTPClientFactory(Request( url='http://foo/bar')) protocol = client.ScrapyHTTPPageGetter() protocol.factory = factory protocol.headers = Headers() protocol.dataReceived("HTTP/1.0 200 OK\n") protocol.dataReceived("Hello: World\n") protocol.dataReceived("Foo: Bar\n") protocol.dataReceived("\n") self.assertEqual(protocol.headers, Headers({'Hello': ['World'], 'Foo': ['Bar']})) from twisted.web.test.test_webclient import ForeverTakingResource, \ ErrorResource, NoLengthResource, HostHeaderResource, \ PayloadResource, BrokenDownloadResource class WebClientTestCase(unittest.TestCase): def _listen(self, site): return reactor.listenTCP(0, site, interface="127.0.0.1") def setUp(self): name = self.mktemp() os.mkdir(name) FilePath(name).child("file").setContent("0123456789") r = static.File(name) r.putChild("redirect", util.Redirect("/file")) r.putChild("wait", ForeverTakingResource()) r.putChild("error", ErrorResource()) r.putChild("nolength", NoLengthResource()) r.putChild("host", HostHeaderResource()) r.putChild("payload", PayloadResource()) r.putChild("broken", BrokenDownloadResource()) self.site = server.Site(r, timeout=None) self.wrapper = WrappingFactory(self.site) self.port = self._listen(self.wrapper) self.portno = self.port.getHost().port def tearDown(self): return self.port.stopListening() def getURL(self, path): return "http://127.0.0.1:%d/%s" % (self.portno, path) def testPayload(self): s = "0123456789" * 10 return getPage(self.getURL("payload"), body=s).addCallback(self.assertEquals, s) def testHostHeader(self): # if we pass Host header explicitly, it should be used, otherwise # it should extract from url return defer.gatherResults([ getPage(self.getURL("host")).addCallback(self.assertEquals, "127.0.0.1:%d" % self.portno), getPage(self.getURL("host"), headers={"Host": "www.example.com"}).addCallback(self.assertEquals, "www.example.com")]) def test_getPage(self): """ L{client.getPage} returns a L{Deferred} which is called back with the body of the response if the default method B{GET} is used. """ d = getPage(self.getURL("file")) d.addCallback(self.assertEquals, "0123456789") return d def test_getPageHead(self): """ L{client.getPage} returns a L{Deferred} which is called back with the empty string if the method is C{HEAD} and there is a successful response code. """ def _getPage(method): return getPage(self.getURL("file"), method=method) return defer.gatherResults([ _getPage("head").addCallback(self.assertEqual, ""), _getPage("HEAD").addCallback(self.assertEqual, "")]) def test_timeoutNotTriggering(self): """ When a non-zero timeout is passed to L{getPage} and the page is retrieved before the timeout period elapses, the L{Deferred} is called back with the contents of the page. """ d = getPage(self.getURL("host"), timeout=100) d.addCallback(self.assertEquals, "127.0.0.1:%d" % self.portno) return d def test_timeoutTriggering(self): """ When a non-zero timeout is passed to L{getPage} and that many seconds elapse before the server responds to the request. the L{Deferred} is errbacked with a L{error.TimeoutError}. """ finished = self.assertFailure( getPage(self.getURL("wait"), timeout=0.000001), defer.TimeoutError) def cleanup(passthrough): # Clean up the server which is hanging around not doing # anything. connected = self.wrapper.protocols.keys() # There might be nothing here if the server managed to already see # that the connection was lost. if connected: connected[0].transport.loseConnection() return passthrough finished.addBoth(cleanup) return finished def testNotFound(self): return getPage(self.getURL('notsuchfile')).addCallback(self._cbNoSuchFile) def _cbNoSuchFile(self, pageData): self.assert_('404 - No Such Resource' in pageData) def testFactoryInfo(self): url = self.getURL('file') scheme, netloc, host, port, path = client._parse(url) factory = client.ScrapyHTTPClientFactory(Request(url)) reactor.connectTCP(host, port, factory) return factory.deferred.addCallback(self._cbFactoryInfo, factory) def _cbFactoryInfo(self, ignoredResult, factory): self.assertEquals(factory.status, '200') self.assert_(factory.version.startswith('HTTP/')) self.assertEquals(factory.message, 'OK') self.assertEquals(factory.response_headers['content-length'], '10') def testRedirect(self): return getPage(self.getURL("redirect")).addCallback(self._cbRedirect) def _cbRedirect(self, pageData): self.assertEquals(pageData, '\n\n \n \n' ' \n \n ' 'click here\n \n\n') Scrapy-1.0.3/tests/test_utils_url.pyc0000664000175000017500000001725212562424322020571 0ustar travistravis00000000000000 'Uc@swddlZddlmZddlmZmZmZdgZdejfdYZ e dkrsej ndS(iN(tSpider(turl_is_from_any_domainturl_is_from_spidertcanonicalize_urlsscrapy.utils.urlt UrlUtilsTestcBs>eZdZdZdZdZdZdZRS(cCs d}|jt|dg|jt|dgd}|jt|dg|jt|dgd}|jt|dg|jt|dgd}|jt|d g|jt|d gd }|jt|d g|jt|d d gdS(Ns/http://www.wheele-bin-art.co.uk/get/product/123swheele-bin-art.co.uks art.co.uks+http://wheele-bin-art.co.uk/get/product/123s/http://www.Wheele-Bin-Art.co.uk/get/product/123swheele-bin-art.CO.UKsWHEELE-BIN-ART.CO.UKs$http://192.169.0.15:8080/mypage.htmls192.169.0.15:8080s 192.169.0.15sjavascript:%20document.orderform_2581_1190810811.mode.value=%27add%27;%20javascript:%20document.orderform_2581_1190810811.submit%28%29stestdomain.coms.testdomain.com(t assertTrueRt assertFalse(tselfturl((s8/home/travis/build/scrapy/scrapy/tests/test_utils_url.pyttest_url_is_from_any_domain scCsktdd}|jtd||jtd||jtd||jtd|dS(Ntnames example.coms%http://www.example.com/some/page.htmls%http://sub.example.com/some/page.htmls%http://www.example.org/some/page.htmls%http://www.example.net/some/page.html(RRRR(Rtspider((s8/home/travis/build/scrapy/scrapy/tests/test_utils_url.pyttest_url_is_from_spider s cCsrdtfdY}|jtd||jtd||jtd||jtd|dS(NtMySpidercBseZdZRS(s example.com(t__name__t __module__R (((s8/home/travis/build/scrapy/scrapy/tests/test_utils_url.pyR (ss%http://www.example.com/some/page.htmls%http://sub.example.com/some/page.htmls%http://www.example.org/some/page.htmls%http://www.example.net/some/page.html(RRRR(RR ((s8/home/travis/build/scrapy/scrapy/tests/test_utils_url.pyt(test_url_is_from_spider_class_attributes's cCstdddddg}|jtd||jtd||jtd||jtd ||jtd ||jtd |tdddtd }|jtd|tdddd }|jtd|dS(NR s example.comtallowed_domainss example.orgs example.nets%http://www.example.com/some/page.htmls%http://sub.example.com/some/page.htmls!http://example.com/some/page.htmls%http://www.example.org/some/page.htmls%http://www.example.net/some/page.htmls$http://www.example.us/some/page.html(s example.coms example.net(s example.coms example.net(RRRRtset(RR ((s8/home/travis/build/scrapy/scrapy/tests/test_utils_url.pyt,test_url_is_from_spider_with_allowed_domains/scCsdtfdY}|jtd||jtd||jtd||jtd||jtd||jtd|dS( NR cBseZdZdZRS(s example.coms example.orgs example.net(s example.orgs example.net(RRR R(((s8/home/travis/build/scrapy/scrapy/tests/test_utils_url.pyR ?ss%http://www.example.com/some/page.htmls%http://sub.example.com/some/page.htmls!http://example.com/some/page.htmls%http://www.example.org/some/page.htmls%http://www.example.net/some/page.htmls$http://www.example.us/some/page.html(RRRR(RR ((s8/home/travis/build/scrapy/scrapy/tests/test_utils_url.pyt=test_url_is_from_spider_with_allowed_domains_class_attributes>scCs|jtddttdts1t|jtdd|jtdd|jtdd|jtdd|jtd d |jtd d td |jtd d|jtdd td |jtdd|jtdd|jtdd|jtdd|jtdd|jtddf|jtdd|jtddf|jtddf|jtdd |jtd!d"|jtd#d$|jtd%d&|jtd'd'|jtd(d)|jtd(d*td(|jtd+d+|jtd,d-|jtd.d|jtd/d0|jtd1d1dS(2Nshttp://www.example.com/uhttp://www.example.comshttp://www.example.coms%http://www.example.com/do?a=1&b=2&c=3s%http://www.example.com/do?c=1&b=2&a=3s%http://www.example.com/do?a=3&b=2&c=1shttp://www.example.com/do?&a=1shttp://www.example.com/do?a=1s*http://www.example.com/do?c=3&b=5&b=2&a=50s*http://www.example.com/do?a=50&b=2&b=5&c=3s http://www.example.com/do?b=&a=2tkeep_blank_valuesshttp://www.example.com/do?a=2s http://www.example.com/do?a=2&b=s"http://www.example.com/do?b=&c&a=2s#http://www.example.com/do?a=2&b=&c=u http://www.example.com/do?1750,4s#http://www.example.com/do?1750%2C4=s'http://www.example.com/do?q=a space&a=1s'http://www.example.com/do?a=1&q=a+spaces'http://www.example.com/do?q=a+space&a=1s)http://www.example.com/do?q=a%20space&a=1shttp://www.example.com/a%a3doshttp://www.example.com/a%A3dos http://www.example.com/do?k=b%a3s http://www.example.com/do?k=b%A3shttp://www.example.com/a do?a=1s!http://www.example.com/a%20do?a=1s"http://www.example.com/a %20do?a=1s$http://www.example.com/a%20%20do?a=1s&http://www.example.com/a do£.html?a=1s,http://www.example.com/a%20do%C2%A3.html?a=1u-http://www.example.com/do?price=£500&a=5&z=3u1http://www.example.com/do?a=5&price=%C2%A3500&z=3s-http://www.example.com/do?price=£500&a=5&z=3s1http://www.example.com/do?a=5&price=%C2%A3500&z=3s+http://www.example.com/do?price(£)=500&a=1s3http://www.example.com/do?a=1&price%28%C2%A3%29=500u,http://user:pass@www.example.com:81/do?now=1u,http://user:pass@www.example.com/do?a=1#fragu'http://user:pass@www.example.com/do?a=1tkeep_fragmentssbhttp://www.simplybedrooms.com/White-Bedroom-Furniture/Bedroom-Mirror:-Josephine-Cheval-Mirror.htmlu+http://www.example.com/caf%E9-con-leche.htms+http://www.example.com/caf%E9-con-leche.htmshttp://www.EXAMPLE.com/s'http://foo.com/AC%2FDC+rocks%3f/?yeah=1s'http://foo.com/AC%2FDC+rocks%3F/?yeah=1shttp://foo.com/AC%2FDC/(t assertEqualRt isinstancetstrtAssertionErrortFalsetTrue(R((s8/home/travis/build/scrapy/scrapy/tests/test_utils_url.pyttest_canonicalize_urlIs|    (RRR R RRRR(((s8/home/travis/build/scrapy/scrapy/tests/test_utils_url.pyR s      t__main__( tunittesttscrapy.spidersRtscrapy.utils.urlRRRt __doctests__tTestCaseRRtmain(((s8/home/travis/build/scrapy/scrapy/tests/test_utils_url.pyts   Scrapy-1.0.3/tests/test_utils_url.py0000664000175000017500000002440412562423747020435 0ustar travistravis00000000000000import unittest from scrapy.spiders import Spider from scrapy.utils.url import url_is_from_any_domain, url_is_from_spider, canonicalize_url __doctests__ = ['scrapy.utils.url'] class UrlUtilsTest(unittest.TestCase): def test_url_is_from_any_domain(self): url = 'http://www.wheele-bin-art.co.uk/get/product/123' self.assertTrue(url_is_from_any_domain(url, ['wheele-bin-art.co.uk'])) self.assertFalse(url_is_from_any_domain(url, ['art.co.uk'])) url = 'http://wheele-bin-art.co.uk/get/product/123' self.assertTrue(url_is_from_any_domain(url, ['wheele-bin-art.co.uk'])) self.assertFalse(url_is_from_any_domain(url, ['art.co.uk'])) url = 'http://www.Wheele-Bin-Art.co.uk/get/product/123' self.assertTrue(url_is_from_any_domain(url, ['wheele-bin-art.CO.UK'])) self.assertTrue(url_is_from_any_domain(url, ['WHEELE-BIN-ART.CO.UK'])) url = 'http://192.169.0.15:8080/mypage.html' self.assertTrue(url_is_from_any_domain(url, ['192.169.0.15:8080'])) self.assertFalse(url_is_from_any_domain(url, ['192.169.0.15'])) url = 'javascript:%20document.orderform_2581_1190810811.mode.value=%27add%27;%20javascript:%20document.orderform_2581_1190810811.submit%28%29' self.assertFalse(url_is_from_any_domain(url, ['testdomain.com'])) self.assertFalse(url_is_from_any_domain(url+'.testdomain.com', ['testdomain.com'])) def test_url_is_from_spider(self): spider = Spider(name='example.com') self.assertTrue(url_is_from_spider('http://www.example.com/some/page.html', spider)) self.assertTrue(url_is_from_spider('http://sub.example.com/some/page.html', spider)) self.assertFalse(url_is_from_spider('http://www.example.org/some/page.html', spider)) self.assertFalse(url_is_from_spider('http://www.example.net/some/page.html', spider)) def test_url_is_from_spider_class_attributes(self): class MySpider(Spider): name = 'example.com' self.assertTrue(url_is_from_spider('http://www.example.com/some/page.html', MySpider)) self.assertTrue(url_is_from_spider('http://sub.example.com/some/page.html', MySpider)) self.assertFalse(url_is_from_spider('http://www.example.org/some/page.html', MySpider)) self.assertFalse(url_is_from_spider('http://www.example.net/some/page.html', MySpider)) def test_url_is_from_spider_with_allowed_domains(self): spider = Spider(name='example.com', allowed_domains=['example.org', 'example.net']) self.assertTrue(url_is_from_spider('http://www.example.com/some/page.html', spider)) self.assertTrue(url_is_from_spider('http://sub.example.com/some/page.html', spider)) self.assertTrue(url_is_from_spider('http://example.com/some/page.html', spider)) self.assertTrue(url_is_from_spider('http://www.example.org/some/page.html', spider)) self.assertTrue(url_is_from_spider('http://www.example.net/some/page.html', spider)) self.assertFalse(url_is_from_spider('http://www.example.us/some/page.html', spider)) spider = Spider(name='example.com', allowed_domains=set(('example.com', 'example.net'))) self.assertTrue(url_is_from_spider('http://www.example.com/some/page.html', spider)) spider = Spider(name='example.com', allowed_domains=('example.com', 'example.net')) self.assertTrue(url_is_from_spider('http://www.example.com/some/page.html', spider)) def test_url_is_from_spider_with_allowed_domains_class_attributes(self): class MySpider(Spider): name = 'example.com' allowed_domains = ('example.org', 'example.net') self.assertTrue(url_is_from_spider('http://www.example.com/some/page.html', MySpider)) self.assertTrue(url_is_from_spider('http://sub.example.com/some/page.html', MySpider)) self.assertTrue(url_is_from_spider('http://example.com/some/page.html', MySpider)) self.assertTrue(url_is_from_spider('http://www.example.org/some/page.html', MySpider)) self.assertTrue(url_is_from_spider('http://www.example.net/some/page.html', MySpider)) self.assertFalse(url_is_from_spider('http://www.example.us/some/page.html', MySpider)) def test_canonicalize_url(self): # simplest case self.assertEqual(canonicalize_url("http://www.example.com/"), "http://www.example.com/") # always return a str assert isinstance(canonicalize_url(u"http://www.example.com"), str) # append missing path self.assertEqual(canonicalize_url("http://www.example.com"), "http://www.example.com/") # typical usage self.assertEqual(canonicalize_url("http://www.example.com/do?a=1&b=2&c=3"), "http://www.example.com/do?a=1&b=2&c=3") self.assertEqual(canonicalize_url("http://www.example.com/do?c=1&b=2&a=3"), "http://www.example.com/do?a=3&b=2&c=1") self.assertEqual(canonicalize_url("http://www.example.com/do?&a=1"), "http://www.example.com/do?a=1") # sorting by argument values self.assertEqual(canonicalize_url("http://www.example.com/do?c=3&b=5&b=2&a=50"), "http://www.example.com/do?a=50&b=2&b=5&c=3") # using keep_blank_values self.assertEqual(canonicalize_url("http://www.example.com/do?b=&a=2", keep_blank_values=False), "http://www.example.com/do?a=2") self.assertEqual(canonicalize_url("http://www.example.com/do?b=&a=2"), "http://www.example.com/do?a=2&b=") self.assertEqual(canonicalize_url("http://www.example.com/do?b=&c&a=2", keep_blank_values=False), "http://www.example.com/do?a=2") self.assertEqual(canonicalize_url("http://www.example.com/do?b=&c&a=2"), "http://www.example.com/do?a=2&b=&c=") self.assertEqual(canonicalize_url(u'http://www.example.com/do?1750,4'), 'http://www.example.com/do?1750%2C4=') # spaces self.assertEqual(canonicalize_url("http://www.example.com/do?q=a space&a=1"), "http://www.example.com/do?a=1&q=a+space") self.assertEqual(canonicalize_url("http://www.example.com/do?q=a+space&a=1"), "http://www.example.com/do?a=1&q=a+space") self.assertEqual(canonicalize_url("http://www.example.com/do?q=a%20space&a=1"), "http://www.example.com/do?a=1&q=a+space") # normalize percent-encoding case (in paths) self.assertEqual(canonicalize_url("http://www.example.com/a%a3do"), "http://www.example.com/a%A3do"), # normalize percent-encoding case (in query arguments) self.assertEqual(canonicalize_url("http://www.example.com/do?k=b%a3"), "http://www.example.com/do?k=b%A3") # non-ASCII percent-encoding in paths self.assertEqual(canonicalize_url("http://www.example.com/a do?a=1"), "http://www.example.com/a%20do?a=1"), self.assertEqual(canonicalize_url("http://www.example.com/a %20do?a=1"), "http://www.example.com/a%20%20do?a=1"), self.assertEqual(canonicalize_url("http://www.example.com/a do\xc2\xa3.html?a=1"), "http://www.example.com/a%20do%C2%A3.html?a=1") # non-ASCII percent-encoding in query arguments self.assertEqual(canonicalize_url(u"http://www.example.com/do?price=\xa3500&a=5&z=3"), u"http://www.example.com/do?a=5&price=%C2%A3500&z=3") self.assertEqual(canonicalize_url("http://www.example.com/do?price=\xc2\xa3500&a=5&z=3"), "http://www.example.com/do?a=5&price=%C2%A3500&z=3") self.assertEqual(canonicalize_url("http://www.example.com/do?price(\xc2\xa3)=500&a=1"), "http://www.example.com/do?a=1&price%28%C2%A3%29=500") # urls containing auth and ports self.assertEqual(canonicalize_url(u"http://user:pass@www.example.com:81/do?now=1"), u"http://user:pass@www.example.com:81/do?now=1") # remove fragments self.assertEqual(canonicalize_url(u"http://user:pass@www.example.com/do?a=1#frag"), u"http://user:pass@www.example.com/do?a=1") self.assertEqual(canonicalize_url(u"http://user:pass@www.example.com/do?a=1#frag", keep_fragments=True), u"http://user:pass@www.example.com/do?a=1#frag") # dont convert safe characters to percent encoding representation self.assertEqual(canonicalize_url( "http://www.simplybedrooms.com/White-Bedroom-Furniture/Bedroom-Mirror:-Josephine-Cheval-Mirror.html"), "http://www.simplybedrooms.com/White-Bedroom-Furniture/Bedroom-Mirror:-Josephine-Cheval-Mirror.html") # urllib.quote uses a mapping cache of encoded characters. when parsing # an already percent-encoded url, it will fail if that url was not # percent-encoded as utf-8, that's why canonicalize_url must always # convert the urls to string. the following test asserts that # functionality. self.assertEqual(canonicalize_url(u'http://www.example.com/caf%E9-con-leche.htm'), 'http://www.example.com/caf%E9-con-leche.htm') # domains are case insensitive self.assertEqual(canonicalize_url("http://www.EXAMPLE.com/"), "http://www.example.com/") # quoted slash and question sign self.assertEqual(canonicalize_url("http://foo.com/AC%2FDC+rocks%3f/?yeah=1"), "http://foo.com/AC%2FDC+rocks%3F/?yeah=1") self.assertEqual(canonicalize_url("http://foo.com/AC%2FDC/"), "http://foo.com/AC%2FDC/") if __name__ == "__main__": unittest.main() Scrapy-1.0.3/tests/test_utils_template.pyc0000664000175000017500000000031012562424322021565 0ustar travistravis00000000000000 'Uc@s dgZdS(sscrapy.utils.templateN(t __doctests__(((s=/home/travis/build/scrapy/scrapy/tests/test_utils_template.pytsScrapy-1.0.3/tests/test_utils_template.py0000664000175000017500000000005112562423747021436 0ustar travistravis00000000000000__doctests__ = ['scrapy.utils.template'] Scrapy-1.0.3/tests/test_utils_spider.pyc0000664000175000017500000000417012562424322021250 0ustar travistravis00000000000000 'Uc@sddlZddlmZddlmZddlmZmZddlm Z de fdYZ de fd YZ d e fd YZ d ej fd YZedkrejndS(iN(tRequest(tBaseItem(titerate_spider_outputtiter_spider_classes(t CrawlSpidert MyBaseSpidercBseZRS((t__name__t __module__(((s;/home/travis/build/scrapy/scrapy/tests/test_utils_spider.pyR st MySpider1cBseZdZRS(t myspider1(RRtname(((s;/home/travis/build/scrapy/scrapy/tests/test_utils_spider.pyR st MySpider2cBseZdZRS(t myspider2(RRR (((s;/home/travis/build/scrapy/scrapy/tests/test_utils_spider.pyR stUtilsSpidersTestCasecBseZdZdZRS(cCst}td}t}|jtt||g|jtt||g|jtt||g|jtt|||g|||gdS(Nshttp://scrapytest.org(RRtobjectt assertEqualtlistR(tselftitrto((s;/home/travis/build/scrapy/scrapy/tests/test_utils_spider.pyttest_iterate_spider_outputs   cCs;ddl}t|j}|jt|tthdS(Ni(ttests.test_utils_spiderRttest_utils_spiderRtsetRR (Rtteststit((s;/home/travis/build/scrapy/scrapy/tests/test_utils_spider.pyttest_iter_spider_classess (RRRR(((s;/home/travis/build/scrapy/scrapy/tests/test_utils_spider.pyR s t__main__(tunittestt scrapy.httpRt scrapy.itemRtscrapy.utils.spiderRRtscrapy.spidersRRRR tTestCaseR Rtmain(((s;/home/travis/build/scrapy/scrapy/tests/test_utils_spider.pyts  Scrapy-1.0.3/tests/test_utils_spider.py0000664000175000017500000000204012562423747021111 0ustar travistravis00000000000000import unittest from scrapy.http import Request from scrapy.item import BaseItem from scrapy.utils.spider import iterate_spider_output, iter_spider_classes from scrapy.spiders import CrawlSpider class MyBaseSpider(CrawlSpider): pass # abstract spider class MySpider1(MyBaseSpider): name = 'myspider1' class MySpider2(MyBaseSpider): name = 'myspider2' class UtilsSpidersTestCase(unittest.TestCase): def test_iterate_spider_output(self): i = BaseItem() r = Request('http://scrapytest.org') o = object() self.assertEqual(list(iterate_spider_output(i)), [i]) self.assertEqual(list(iterate_spider_output(r)), [r]) self.assertEqual(list(iterate_spider_output(o)), [o]) self.assertEqual(list(iterate_spider_output([r, i, o])), [r, i, o]) def test_iter_spider_classes(self): import tests.test_utils_spider it = iter_spider_classes(tests.test_utils_spider) self.assertEqual(set(it), {MySpider1, MySpider2}) if __name__ == "__main__": unittest.main() Scrapy-1.0.3/tests/test_utils_sitemap.pyc0000664000175000017500000002133712562424322021430 0ustar travistravis00000000000000 'Uc@sXddlZddlmZmZdejfdYZedkrTejndS(iN(tSitemaptsitemap_urls_from_robotst SitemapTestcBsbeZdZdZdZdZdZdZdZdZ dZ d Z RS( cCsytd}|jdks!t|jt|idd6dd6dd6d d 6id d6d d6dd6d d 6gdS(Ns http://www.example.com/ 2009-08-16 daily 1 http://www.example.com/Special-Offers.html 2009-08-16 weekly 0.8 turlsett1tpriorityshttp://www.example.com/tlocs 2009-08-16tlastmodtdailyt changefreqs0.8s*http://www.example.com/Special-Offers.htmltweekly(RttypetAssertionErrort assertEqualtlist(tselfts((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyt test_sitemaps  cCs]td}|jdks!t|jt|idd6dd6idd6dd6gdS( Nsv http://www.example.com/sitemap1.xml.gz 2004-10-01T18:23:17+00:00 http://www.example.com/sitemap2.xml.gz 2005-01-01 t sitemapindexs&http://www.example.com/sitemap1.xml.gzRs2004-10-01T18:23:17+00:00Rs&http://www.example.com/sitemap2.xml.gzs 2005-01-01(RR R R R(RR((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyttest_sitemap_indexs cCsVtd}|jt|idd6dd6dd6dd 6id d6d d6gd S( s]Assert we can deal with trailing spaces inside tags - we've seen those sP http://www.example.com/ 2009-08-16 daily 1 http://www.example.com/2 RRshttp://www.example.com/Rs 2009-08-16RRR shttp://www.example.com/2tN(RR R(RR((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyttest_sitemap_strip*s  cCsVtd}|jt|idd6dd6dd6dd 6id d6d d6gd S( suWe have seen sitemaps with wrongs ns. Presumably, Google still works with these, though is not 100% confirmedsb http://www.example.com/ 2009-08-16 daily 1 http://www.example.com/2 RRshttp://www.example.com/Rs 2009-08-16RRR shttp://www.example.com/2RN(RR R(RR((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyttest_sitemap_wrong_nsAs  cCsktd}|jdks!t|jt|idd6dd6dd6d d 6id d6d d6gd S(suWe have seen sitemaps with wrongs ns. Presumably, Google still works with these, though is not 100% confirmeds/ http://www.example.com/ 2009-08-16 daily 1 http://www.example.com/2 RRRshttp://www.example.com/Rs 2009-08-16RRR shttp://www.example.com/2RN(RR R R R(RR((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyttest_sitemap_wrong_ns2Ws  cCs,d}|jtt|ddgdS(Ns!User-agent: * Disallow: /aff/ Disallow: /wl/ # Search and shopping refining Disallow: /s*/*facet Disallow: /s*/*tags # Sitemap files Sitemap: http://example.com/sitemap.xml Sitemap: http://example.com/sitemap-product-index.xml # Forums Disallow: /forum/search/ Disallow: /forum/active/ shttp://example.com/sitemap.xmls,http://example.com/sitemap-product-index.xml(R RR(Rtrobots((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyttest_sitemap_urls_from_robotsnscCsYtd}|jt|idd6dd6idd6dd6idd6dd6gdS( s=Assert we can deal with starting blank lines before tags http://www.example.com/sitemap1.xml 2013-07-15 http://www.example.com/sitemap2.xml 2013-07-15 http://www.example.com/sitemap3.xml 2013-07-15 s 2013-07-15Rs#http://www.example.com/sitemap1.xmlRs#http://www.example.com/sitemap2.xmls#http://www.example.com/sitemap3.xmlN(RR R(RR((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyttest_sitemap_blankliness  cCs0td}|jt|idd6gdS(Nsc http://www.example.com/ shttp://www.example.com/R(RR R(RR((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyt test_comments cCs@td}|jt|idd6dddgd6gdS(Ns http://www.example.com/english/ shttp://www.example.com/english/Rshttp://www.example.com/deutsch/s'http://www.example.com/schweiz-deutsch/t alternate(RR R(RR((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyttest_alternates   cCs0td}|jt|idd6gdS(Ns^ ]> http://127.0.0.1:8000/&xxe; shttp://127.0.0.1:8000/R(RR R(RR((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyttest_xml_entity_expansions ( t__name__t __module__RRRRRRRRRR(((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyRs        t__main__(tunittesttscrapy.utils.sitemapRRtTestCaseRRtmain(((s</home/travis/build/scrapy/scrapy/tests/test_utils_sitemap.pyts  Scrapy-1.0.3/tests/test_utils_sitemap.py0000664000175000017500000001636312562423747021302 0ustar travistravis00000000000000import unittest from scrapy.utils.sitemap import Sitemap, sitemap_urls_from_robots class SitemapTest(unittest.TestCase): def test_sitemap(self): s = Sitemap(b""" http://www.example.com/ 2009-08-16 daily 1 http://www.example.com/Special-Offers.html 2009-08-16 weekly 0.8 """) assert s.type == 'urlset' self.assertEqual(list(s), [{'priority': '1', 'loc': 'http://www.example.com/', 'lastmod': '2009-08-16', 'changefreq': 'daily'}, {'priority': '0.8', 'loc': 'http://www.example.com/Special-Offers.html', 'lastmod': '2009-08-16', 'changefreq': 'weekly'}]) def test_sitemap_index(self): s = Sitemap(b""" http://www.example.com/sitemap1.xml.gz 2004-10-01T18:23:17+00:00 http://www.example.com/sitemap2.xml.gz 2005-01-01 """) assert s.type == 'sitemapindex' self.assertEqual(list(s), [{'loc': 'http://www.example.com/sitemap1.xml.gz', 'lastmod': '2004-10-01T18:23:17+00:00'}, {'loc': 'http://www.example.com/sitemap2.xml.gz', 'lastmod': '2005-01-01'}]) def test_sitemap_strip(self): """Assert we can deal with trailing spaces inside tags - we've seen those """ s = Sitemap(b""" http://www.example.com/ 2009-08-16 daily 1 http://www.example.com/2 """) self.assertEqual(list(s), [{'priority': '1', 'loc': 'http://www.example.com/', 'lastmod': '2009-08-16', 'changefreq': 'daily'}, {'loc': 'http://www.example.com/2', 'lastmod': ''}, ]) def test_sitemap_wrong_ns(self): """We have seen sitemaps with wrongs ns. Presumably, Google still works with these, though is not 100% confirmed""" s = Sitemap(b""" http://www.example.com/ 2009-08-16 daily 1 http://www.example.com/2 """) self.assertEqual(list(s), [{'priority': '1', 'loc': 'http://www.example.com/', 'lastmod': '2009-08-16', 'changefreq': 'daily'}, {'loc': 'http://www.example.com/2', 'lastmod': ''}, ]) def test_sitemap_wrong_ns2(self): """We have seen sitemaps with wrongs ns. Presumably, Google still works with these, though is not 100% confirmed""" s = Sitemap(b""" http://www.example.com/ 2009-08-16 daily 1 http://www.example.com/2 """) assert s.type == 'urlset' self.assertEqual(list(s), [{'priority': '1', 'loc': 'http://www.example.com/', 'lastmod': '2009-08-16', 'changefreq': 'daily'}, {'loc': 'http://www.example.com/2', 'lastmod': ''}, ]) def test_sitemap_urls_from_robots(self): robots = """User-agent: * Disallow: /aff/ Disallow: /wl/ # Search and shopping refining Disallow: /s*/*facet Disallow: /s*/*tags # Sitemap files Sitemap: http://example.com/sitemap.xml Sitemap: http://example.com/sitemap-product-index.xml # Forums Disallow: /forum/search/ Disallow: /forum/active/ """ self.assertEqual(list(sitemap_urls_from_robots(robots)), ['http://example.com/sitemap.xml', 'http://example.com/sitemap-product-index.xml']) def test_sitemap_blanklines(self): """Assert we can deal with starting blank lines before tag""" s = Sitemap(b"""\ http://www.example.com/sitemap1.xml 2013-07-15 http://www.example.com/sitemap2.xml 2013-07-15 http://www.example.com/sitemap3.xml 2013-07-15 """) self.assertEqual(list(s), [ {'lastmod': '2013-07-15', 'loc': 'http://www.example.com/sitemap1.xml'}, {'lastmod': '2013-07-15', 'loc': 'http://www.example.com/sitemap2.xml'}, {'lastmod': '2013-07-15', 'loc': 'http://www.example.com/sitemap3.xml'}, ]) def test_comment(self): s = Sitemap(b""" http://www.example.com/ """) self.assertEqual(list(s), [ {'loc': 'http://www.example.com/'} ]) def test_alternate(self): s = Sitemap(b""" http://www.example.com/english/ """) self.assertEqual(list(s), [ {'loc': 'http://www.example.com/english/', 'alternate': ['http://www.example.com/deutsch/', 'http://www.example.com/schweiz-deutsch/', 'http://www.example.com/english/'] } ]) def test_xml_entity_expansion(self): s = Sitemap(b""" ]> http://127.0.0.1:8000/&xxe; """) self.assertEqual(list(s), [{'loc': 'http://127.0.0.1:8000/'}]) if __name__ == '__main__': unittest.main() Scrapy-1.0.3/tests/test_utils_signal.pyc0000664000175000017500000001043012562424322021233 0ustar travistravis00000000000000 'Uc@sddlmZddlmZddlmZddlmZmZddl m Z ddl m Z m Z dejfdYZd efd YZd efd YZd ejfdYZdS(i(t LogCapture(tunittest(tFailure(tdefertreactor(t dispatcher(tsend_catch_logtsend_catch_log_deferredtSendCatchLogTestcBs5eZejdZdZdZdZRS(c csqt}t}tj|jd|tj|jd|t)}tj|j |ddd|V}WdQX|j|kst |j|kst |j t |j d|j d}|jd|j|j |jd|j |dd|j|jt|ddt|j |d|jd ftj|jd|tj|jd|dS( Ntsignaltargttestthandlers_callediit error_handlertERRORtOK(tobjecttsetRtconnectR t ok_handlerRRt maybeDeferredt _get_resulttAssertionErrort assertEqualtlentrecordstassertInt getMessaget levelnametassert_t isinstanceRt disconnect(tselft test_signalR tltresulttrecord((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyttest_send_catch_log s&    cOst|||S(N(R(R R tatkw((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyR&scCs|j|jdd}dS(Nii(taddR (R R R R&((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyR )scCs&|j|j|dks"tdS(NR R(R(RR(R R R ((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyR-s(t__name__t __module__RtinlineCallbacksR%RR R(((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyR s  tSendCatchLogDeferredTestcBseZdZRS(cOst|||S(N(R(R R R&R'((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyR5s(R)R*R(((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyR,3stSendCatchLogDeferredTest2cBseZdZdZRS(cCsH|j|j|dks"ttj}tjd|jd|S(NR iR(R(RRRtDeferredRt callLatertcallback(R R R td((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyR;s  cOst|||S(N(R(R R R&R'((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyRBs(R)R*RR(((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyR-9s tSendCatchLogTest2cBseZdZRS(cCst}d}tj||t}t|WdQX|jt|jd|jdt |tj ||dS(NcSs tjS(N(RR.(((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pytIsis+Cannot return deferreds from signal handler( RRRRRRRRRtstrR(R R!t test_handlerR"((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyt+test_error_logged_if_deferred_not_supportedGs   (R)R*R6(((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyR2EsN(t testfixturesRt twisted.trialRttwisted.python.failureRttwisted.internetRRtscrapy.xlib.pydispatchRtscrapy.utils.signalRRtTestCaseRR,R-R2(((s;/home/travis/build/scrapy/scrapy/tests/test_utils_signal.pyts) Scrapy-1.0.3/tests/test_utils_signal.py0000664000175000017500000000530112562423747021103 0ustar travistravis00000000000000from testfixtures import LogCapture from twisted.trial import unittest from twisted.python.failure import Failure from twisted.internet import defer, reactor from scrapy.xlib.pydispatch import dispatcher from scrapy.utils.signal import send_catch_log, send_catch_log_deferred class SendCatchLogTest(unittest.TestCase): @defer.inlineCallbacks def test_send_catch_log(self): test_signal = object() handlers_called = set() dispatcher.connect(self.error_handler, signal=test_signal) dispatcher.connect(self.ok_handler, signal=test_signal) with LogCapture() as l: result = yield defer.maybeDeferred( self._get_result, test_signal, arg='test', handlers_called=handlers_called ) assert self.error_handler in handlers_called assert self.ok_handler in handlers_called self.assertEqual(len(l.records), 1) record = l.records[0] self.assertIn('error_handler', record.getMessage()) self.assertEqual(record.levelname, 'ERROR') self.assertEqual(result[0][0], self.error_handler) self.assert_(isinstance(result[0][1], Failure)) self.assertEqual(result[1], (self.ok_handler, "OK")) dispatcher.disconnect(self.error_handler, signal=test_signal) dispatcher.disconnect(self.ok_handler, signal=test_signal) def _get_result(self, signal, *a, **kw): return send_catch_log(signal, *a, **kw) def error_handler(self, arg, handlers_called): handlers_called.add(self.error_handler) a = 1/0 def ok_handler(self, arg, handlers_called): handlers_called.add(self.ok_handler) assert arg == 'test' return "OK" class SendCatchLogDeferredTest(SendCatchLogTest): def _get_result(self, signal, *a, **kw): return send_catch_log_deferred(signal, *a, **kw) class SendCatchLogDeferredTest2(SendCatchLogTest): def ok_handler(self, arg, handlers_called): handlers_called.add(self.ok_handler) assert arg == 'test' d = defer.Deferred() reactor.callLater(0, d.callback, "OK") return d def _get_result(self, signal, *a, **kw): return send_catch_log_deferred(signal, *a, **kw) class SendCatchLogTest2(unittest.TestCase): def test_error_logged_if_deferred_not_supported(self): test_signal = object() test_handler = lambda: defer.Deferred() dispatcher.connect(test_handler, test_signal) with LogCapture() as l: send_catch_log(test_signal) self.assertEqual(len(l.records), 1) self.assertIn("Cannot return deferreds from signal handler", str(l)) dispatcher.disconnect(test_handler, test_signal) Scrapy-1.0.3/tests/test_utils_serialize.pyc0000664000175000017500000000477212562424322021761 0ustar travistravis00000000000000 'Uc@sddlZddlZddlZddlmZddlmZddlmZddl m Z m Z dej fdYZ dS(iN(tDecimal(tdefer(tScrapyJSONEncoder(tRequesttResponsetJsonEncoderTestCasecBs5eZdZdZdZdZdZRS(cCst|_dS(N(Rtencoder(tself((s>/home/travis/build/scrapy/scrapy/tests/test_utils_serialize.pytsetUpsc Cstjdddddd}d}tjddd}d}tjddd}d }td }d }xud ||f||f||f||fd |gd |gfgD]1\} } |j|jj| tj| qWdS( Niiii i i s2010-01-02 10:11:12s 2010-01-02s10:11:12s1000.12tfoo(sfoosfoo( tdatetimetdatettimeRt assertEqualRtencodetjsontdumps( Rtdttdtstdtdsttttstdectdecstinputtoutput((s>/home/travis/build/scrapy/scrapy/tests/test_utils_serialize.pyttest_encode_decodes !.cCs&|jd|jjtjdS(NtDeferred(tassertInRRRR(R((s>/home/travis/build/scrapy/scrapy/tests/test_utils_serialize.pyttest_encode_deferredscCsHtd}|jj|}|j|j||j|j|dS(Nshttp://www.example.com/lala(RRRRtmethodturl(Rtrtrs((s>/home/travis/build/scrapy/scrapy/tests/test_utils_serialize.pyttest_encode_request"s cCsNtd}|jj|}|j|j||jt|j|dS(Nshttp://www.example.com/lala(RRRRR tstrtstatus(RR!R"((s>/home/travis/build/scrapy/scrapy/tests/test_utils_serialize.pyttest_encode_response(s (t__name__t __module__RRRR#R&(((s>/home/travis/build/scrapy/scrapy/tests/test_utils_serialize.pyR s     (RtunittestR tdecimalRttwisted.internetRtscrapy.utils.serializeRt scrapy.httpRRtTestCaseR(((s>/home/travis/build/scrapy/scrapy/tests/test_utils_serialize.pyts   Scrapy-1.0.3/tests/test_utils_serialize.py0000664000175000017500000000251012562423747021614 0ustar travistravis00000000000000import json import unittest import datetime from decimal import Decimal from twisted.internet import defer from scrapy.utils.serialize import ScrapyJSONEncoder from scrapy.http import Request, Response class JsonEncoderTestCase(unittest.TestCase): def setUp(self): self.encoder = ScrapyJSONEncoder() def test_encode_decode(self): dt = datetime.datetime(2010, 1, 2, 10, 11, 12) dts = "2010-01-02 10:11:12" d = datetime.date(2010, 1, 2) ds = "2010-01-02" t = datetime.time(10, 11, 12) ts = "10:11:12" dec = Decimal("1000.12") decs = "1000.12" for input, output in [('foo', 'foo'), (d, ds), (t, ts), (dt, dts), (dec, decs), (['foo', d], ['foo', ds])]: self.assertEqual(self.encoder.encode(input), json.dumps(output)) def test_encode_deferred(self): self.assertIn('Deferred', self.encoder.encode(defer.Deferred())) def test_encode_request(self): r = Request("http://www.example.com/lala") rs = self.encoder.encode(r) self.assertIn(r.method, rs) self.assertIn(r.url, rs) def test_encode_response(self): r = Response("http://www.example.com/lala") rs = self.encoder.encode(r) self.assertIn(r.url, rs) self.assertIn(str(r.status), rs) Scrapy-1.0.3/tests/test_utils_response.pyc0000664000175000017500000000730612562424322021624 0ustar travistravis00000000000000 'Uc@sddlZddlZddlmZddlmZmZmZddlm Z m Z m Z dgZ dej fdYZedkrejndS( iN(turlparse(tResponset TextResponset HtmlResponse(tresponse_httpreprtopen_in_browsertget_meta_refreshsscrapy.utils.responsetResponseUtilsTestcBs8eZeddddZdZdZdZRS(turlshttp://example.org/tbodytdummy_responsecCstd}|jt|dtddddidd6dd }|jt|d tddd didd6dd }|jt|d dS( Nshttp://www.example.comsHTTP/1.1 200 OK tstatusitheaderss text/htmls Content-typeR s Some bodys<HTTP/1.1 404 Not Found Content-Type: text/html Some bodyi s4HTTP/1.1 6666 Content-Type: text/html Some body(Rt assertEqualR(tselftr1((s=/home/travis/build/scrapy/scrapy/tests/test_utils_response.pyttest_response_httprepr s  %%cstdd}fd}td|}t|d|sKtd|jtttd|dtdS(Ns&http:///www.example.com/some/page.htmlsM test page test body csht|j}tjj|s6|jdd}nt|j}d|ksdtdtS(Nsfile://tss tag not added( RtpathtostexiststreplacetopentreadtAssertionErrortTrue(tburlRtbbody(R(s=/home/travis/build/scrapy/scrapy/tests/test_utils_response.pyt browser_opens R t _openfuncsBrowser not calledtdebug(RRRt assertRaisest TypeErrorRR(RR Rtresponse((Rs=/home/travis/build/scrapy/scrapy/tests/test_utils_response.pyttest_open_in_browsers cCs|tddd}tddd}tddd}|jt|d|jt|d |jt|d dS( Nshttp://www.example.comR s Dummy blahablsdfsal& s Dummy blahablsdfsal& s"