Release notes — Scrapy documentation

From Get docs
Scrapy/docs/latest/news

Release notes

Scrapy 2.5.1 (2021-10-05)

  • Security bug fix:

    If you use HttpAuthMiddleware (i.e. the http_user and http_pass spider attributes) for HTTP authentication, any request exposes your credentials to the request target.

    To prevent unintended exposure of authentication credentials to unintended domains, you must now additionally set a new, additional spider attribute, http_auth_domain, and point it to the specific domain to which the authentication credentials must be sent.

    If the http_auth_domain spider attribute is not set, the domain of the first request will be considered the HTTP authentication target, and authentication credentials will only be sent in requests targeting that domain.

    If you need to send the same HTTP authentication credentials to multiple domains, you can use w3lib.http.basic_auth_header() instead to set the value of the Authorization header of your requests.

    If you really want your spider to send the same HTTP authentication credentials to any domain, set the http_auth_domain spider attribute to None.

    Finally, if you are a user of scrapy-splash, know that this version of Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier. You will need to upgrade scrapy-splash to a greater version for it to continue to work.


Scrapy 2.5.0 (2021-04-06)

Highlights:

  • Official Python 3.9 support
  • Experimental HTTP/2 support
  • New get_retry_request() function to retry requests from spider callbacks
  • New headers_received signal that allows stopping downloads early
  • New Response.protocol attribute

Deprecation removals


Deprecations

  • The scrapy.utils.py36 module is now deprecated in favor of scrapy.utils.asyncgen. (:issue:`4900`)


New features


Bug fixes


Documentation


Quality Assurance


Scrapy 2.4.1 (2020-11-17)


Scrapy 2.4.0 (2020-10-11)

Highlights:

  • Python 3.5 support has been dropped.

  • The file_path method of media pipelines can now access the source item.

    This allows you to set a download file path based on item data.

  • The new item_export_kwargs key of the :setting:`FEEDS` setting allows to define keyword parameters to pass to item exporter classes

  • You can now choose whether feed exports overwrite or append to the output file.

    For example, when using the crawl or runspider commands, you can use the -O option instead of -o to overwrite the output file.

  • Zstd-compressed responses are now supported if zstandard is installed.

  • In settings, where the import path of a class is required, it is now possible to pass a class object instead.

Modified requirements


Backward-incompatible changes

  • CookiesMiddleware once again discards cookies defined in Request.headers.

    We decided to revert this bug fix, introduced in Scrapy 2.2.0, because it was reported that the current implementation could break existing code.

    If you need to set cookies for a request, use the Request.cookies parameter.

    A future version of Scrapy will include a new, better implementation of the reverted bug fix.

    (:issue:`4717`, :issue:`4823`)


Deprecation removals

  • scrapy.extensions.feedexport.S3FeedStorage no longer reads the values of access_key and secret_key from the running project settings when they are not passed to its __init__ method; you must either pass those parameters to its __init__ method or use S3FeedStorage.from_crawler (:issue:`4356`, :issue:`4411`, :issue:`4688`)
  • Rule.process_request no longer admits callables which expect a single request parameter, rather than both request and response (:issue:`4818`)


Deprecations


New features


Bug fixes


Documentation


Quality assurance


Scrapy 2.3.0 (2020-08-04)

Highlights:

  • Feed exports now support Google Cloud Storage as a storage backend

  • The new :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` setting allows to deliver output items in batches of up to the specified number of items.

    It also serves as a workaround for delayed file delivery, which causes Scrapy to only start item delivery after the crawl has finished when using certain storage backends (S3, FTP, and now GCS).

  • The base implementation of item loaders has been moved into a separate library, itemloaders, allowing usage from outside Scrapy and a separate release schedule

Deprecation removals

  • Removed the following classes and their parent modules from scrapy.linkextractors:

    • htmlparser.HtmlParserLinkExtractor

    • regex.RegexLinkExtractor

    • sgml.BaseSgmlLinkExtractor

    • sgml.SgmlLinkExtractor

    Use LinkExtractor instead (:issue:`4356`, :issue:`4679`)


Deprecations

  • The scrapy.utils.python.retry_on_eintr function is now deprecated (:issue:`4683`)


New features


Bug fixes


Documentation


Quality assurance


Scrapy 2.2.1 (2020-07-17)

  • The startproject command no longer makes unintended changes to the permissions of files in the destination folder, such as removing execution permissions (:issue:`4662`, :issue:`4666`)


Scrapy 2.2.0 (2020-06-24)

Highlights:

Backward-incompatible changes

  • Support for Python 3.5.0 and 3.5.1 has been dropped; Scrapy now refuses to run with a Python version lower than 3.5.2, which introduced typing.Type (:issue:`4615`)


Deprecations


New features


Bug fixes


Documentation


Quality assurance


Scrapy 2.1.0 (2020-04-24)

Highlights:

  • New :setting:`FEEDS` setting to export to multiple feeds
  • New Response.ip_address attribute

Backward-incompatible changes

  • AssertionError exceptions triggered by assert statements have been replaced by new exception types, to support running Python in optimized mode (see -O) without changing Scrapy’s behavior in any unexpected ways.

    If you catch an AssertionError exception from Scrapy, update your code to catch the corresponding new exception.

    (:issue:`4440`)


Deprecation removals


Deprecations


New features


Bug fixes

  • Request serialization no longer breaks for callbacks that are spider attributes which are assigned a function with a different name (:issue:`4500`)
  • None values in allowed_domains no longer cause a TypeError exception (:issue:`4410`)
  • Zsh completion no longer allows options after arguments (:issue:`4438`)
  • zope.interface 5.0.0 and later versions are now supported (:issue:`4447`, :issue:`4448`)
  • Spider.make_requests_from_url, deprecated in Scrapy 1.4.0, now issues a warning when used (:issue:`4412`)


Documentation


Quality assurance


Scrapy 2.0.1 (2020-03-18)


Scrapy 2.0.0 (2020-03-03)

Highlights:

Backward-incompatible changes


Deprecation removals

  • The Scrapy shell no longer provides a sel proxy object, use response.selector instead (:issue:`4347`)
  • LevelDB support has been removed (:issue:`4112`)
  • The following functions have been removed from scrapy.utils.python: isbinarytext, is_writable, setattr_default, stringify_dict (:issue:`4362`)


Deprecations


New features


Bug fixes


Documentation


Quality assurance


Changes to scheduler queue classes

The following changes may impact any custom queue classes of all types:

  • The push method no longer receives a second positional parameter containing request.priority * -1. If you need that value, get it from the first positional parameter, request, instead, or use the new priority() method in scrapy.core.scheduler.ScrapyPriorityQueue subclasses.

The following changes may impact custom priority queue classes:

  • In the __init__ method or the from_crawler or from_settings class methods:
    • The parameter that used to contain a factory function, qfactory, is now passed as a keyword parameter named downstream_queue_cls.
    • A new keyword parameter has been added: key. It is a string that is always an empty string for memory queues and indicates the :setting:`JOB_DIR` value for disk queues.
    • The parameter for disk queues that contains data from the previous crawl, startprios or slot_startprios, is now passed as a keyword parameter named startprios.
    • The serialize parameter is no longer passed. The disk queue class must take care of request serialization on its own before writing to disk, using the request_to_dict() and request_from_dict() functions from the scrapy.utils.reqser module.

The following changes may impact custom disk and memory queue classes:

  • The signature of the __init__ method is now __init__(self, crawler, key).

The following changes affect specifically the ScrapyPriorityQueue and DownloaderAwarePriorityQueue classes from scrapy.core.scheduler and may affect subclasses:

  • In the __init__ method, most of the changes described above apply.

    __init__ may still receive all parameters as positional parameters, however:

    • downstream_queue_cls, which replaced qfactory, must be instantiated differently.

      qfactory was instantiated with a priority value (integer).

      Instances of downstream_queue_cls should be created using the new ScrapyPriorityQueue.qfactory or DownloaderAwarePriorityQueue.pqfactory methods.

    • The new key parameter displaced the startprios parameter 1 position to the right.

  • The following class attributes have been added:

    • crawler

    • downstream_queue_cls (details above)

    • key (details above)

  • The serialize attribute has been removed (details above)

The following changes affect specifically the ScrapyPriorityQueue class and may affect subclasses:

  • A new priority() method has been added which, given a request, returns request.priority * -1.

    It is used in push() to make up for the removal of its priority parameter.

  • The spider attribute has been removed. Use crawler.spider instead.

The following changes affect specifically the DownloaderAwarePriorityQueue class and may affect subclasses:

  • A new pqueues attribute offers a mapping of downloader slot names to the corresponding instances of downstream_queue_cls.

(:issue:`3884`)


Scrapy 1.8.1 (2021-10-05)

  • Security bug fix:

    If you use HttpAuthMiddleware (i.e. the http_user and http_pass spider attributes) for HTTP authentication, any request exposes your credentials to the request target.

    To prevent unintended exposure of authentication credentials to unintended domains, you must now additionally set a new, additional spider attribute, http_auth_domain, and point it to the specific domain to which the authentication credentials must be sent.

    If the http_auth_domain spider attribute is not set, the domain of the first request will be considered the HTTP authentication target, and authentication credentials will only be sent in requests targeting that domain.

    If you need to send the same HTTP authentication credentials to multiple domains, you can use w3lib.http.basic_auth_header() instead to set the value of the Authorization header of your requests.

    If you really want your spider to send the same HTTP authentication credentials to any domain, set the http_auth_domain spider attribute to None.

    Finally, if you are a user of scrapy-splash, know that this version of Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier. You will need to upgrade scrapy-splash to a greater version for it to continue to work.


Scrapy 1.8.0 (2019-10-28)

Highlights:

Backward-incompatible changes

See also Deprecation removals below.


New features


Bug fixes


Documentation


Deprecation removals


Deprecations


Other changes


Scrapy 1.7.4 (2019-10-21)

Revert the fix for :issue:`3804` (:issue:`3819`), which has a few undesired side effects (:issue:`3897`, :issue:`3976`).

As a result, when an item loader is initialized with an item, ItemLoader.load_item() once again makes later calls to ItemLoader.get_output_value() or ItemLoader.load_item() return empty data.


Scrapy 1.7.3 (2019-08-01)

Enforce lxml 4.3.5 or lower for Python 3.4 (:issue:`3912`, :issue:`3918`).


Scrapy 1.7.2 (2019-07-23)

Fix Python 2 support (:issue:`3889`, :issue:`3893`, :issue:`3896`).


Scrapy 1.7.1 (2019-07-18)

Re-packaging of Scrapy 1.7.0, which was missing some changes in PyPI.


Scrapy 1.7.0 (2019-07-18)

Note

Make sure you install Scrapy 1.7.1. The Scrapy 1.7.0 package in PyPI is the result of an erroneous commit tagging and does not include all the changes described below.


Highlights:

  • Improvements for crawls targeting multiple domains
  • A cleaner way to pass arguments to callbacks
  • A new class for JSON requests
  • Improvements for rule-based spiders
  • New features for feed exports

Backward-incompatible changes

  • 429 is now part of the :setting:`RETRY_HTTP_CODES` setting by default

    This change is backward incompatible. If you don’t want to retry 429, you must override :setting:`RETRY_HTTP_CODES` accordingly.

  • Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler no longer accept a Spider subclass instance, they only accept a Spider subclass now.

    Spider subclass instances were never meant to work, and they were not working as one would expect: instead of using the passed Spider subclass instance, their from_crawler method was called to generate a new instance.

  • Non-default values for the :setting:`SCHEDULER_PRIORITY_QUEUE` setting may stop working. Scheduler priority queue classes now need to handle Request objects instead of arbitrary Python data structures.

  • An additional crawler parameter has been added to the __init__ method of the Scheduler class. Custom scheduler subclasses which don’t accept arbitrary parameters in their __init__ method might break because of this change.

    For more information, see :setting:`SCHEDULER`.

See also Deprecation removals below.


New features


Bug fixes


Documentation


Deprecation removals

The following deprecated APIs have been removed (:issue:`3578`):

  • scrapy.conf (use Crawler.settings)
  • From scrapy.core.downloader.handlers:
    • http.HttpDownloadHandler (use http10.HTTP10DownloadHandler)
  • scrapy.loader.ItemLoader._get_values (use _get_xpathvalues)
  • scrapy.loader.XPathItemLoader (use ItemLoader)
  • scrapy.log (see Logging)
  • From scrapy.pipelines:
    • files.FilesPipeline.file_key (use file_path)
    • images.ImagesPipeline.file_key (use file_path)
    • images.ImagesPipeline.image_key (use file_path)
    • images.ImagesPipeline.thumb_key (use thumb_path)
  • From both scrapy.selector and scrapy.selector.lxmlsel:
    • HtmlXPathSelector (use Selector)
    • XmlXPathSelector (use Selector)
    • XPathSelector (use Selector)
    • XPathSelectorList (use Selector)
  • From scrapy.selector.csstranslator:
  • From Selector:
    • _root (both the __init__ method argument and the object property, use root)
    • extract_unquoted (use getall)
    • select (use xpath)
  • From SelectorList:
    • extract_unquoted (use getall)
    • select (use xpath)
    • x (use xpath)
  • scrapy.spiders.BaseSpider (use Spider)
  • From Spider (and subclasses):
  • scrapy.spiders.spiders (use SpiderLoader)
  • scrapy.telnet (use scrapy.extensions.telnet)
  • From scrapy.utils.python:
    • str_to_unicode (use to_unicode)
    • unicode_to_str (use to_bytes)
  • scrapy.utils.response.body_or_str

The following deprecated settings have also been removed (:issue:`3578`):


Deprecations

  • The queuelib.PriorityQueue value for the :setting:`SCHEDULER_PRIORITY_QUEUE` setting is deprecated. Use scrapy.pqueues.ScrapyPriorityQueue instead.
  • process_request callbacks passed to Rule that do not accept two arguments are deprecated.
  • The following modules are deprecated:
  • The scrapy.utils.datatypes.MergeDict class is deprecated for Python 3 code bases. Use ChainMap instead. (:issue:`3878`)
  • The scrapy.utils.gz.is_gzipped function is deprecated. Use scrapy.utils.gz.gzip_magic_number instead.


Other changes


Scrapy 1.6.0 (2019-01-30)

Highlights:

  • better Windows support;
  • Python 3.7 compatibility;
  • big documentation improvements, including a switch from .extract_first() + .extract() API to .get() + .getall() API;
  • feed exports, FilePipeline and MediaPipeline improvements;
  • better extensibility: :signal:`item_error` and :signal:`request_reached_downloader` signals; from_crawler support for feed exporters, feed storages and dupefilters.
  • scrapy.contracts fixes and new features;
  • telnet console security improvements, first released as a backport in Scrapy 1.5.2 (2019-01-22);
  • clean-up of the deprecated code;
  • various bug fixes, small new features and usability improvements across the codebase.

Selector API changes

While these are not changes in Scrapy itself, but rather in the parsel library which Scrapy uses for xpath/css selectors, these changes are worth mentioning here. Scrapy now depends on parsel >= 1.5, and Scrapy documentation is updated to follow recent parsel API conventions.

Most visible change is that .get() and .getall() selector methods are now preferred over .extract_first() and .extract(). We feel that these new methods result in a more concise and readable code. See extract() and extract_first() for more details.

Note

There are currently no plans to deprecate .extract() and .extract_first() methods.


Another useful new feature is the introduction of Selector.attrib and SelectorList.attrib properties, which make it easier to get attributes of HTML elements. See Selecting element attributes.

CSS selectors are cached in parsel >= 1.5, which makes them faster when the same CSS path is used many times. This is very common in case of Scrapy spiders: callbacks are usually called several times, on different pages.

If you’re using custom Selector or SelectorList subclasses, a backward incompatible change in parsel may affect your code. See parsel changelog for a detailed description, as well as for the full list of improvements.


Telnet console

Backward incompatible: Scrapy’s telnet console now requires username and password. See Telnet Console for more details. This change fixes a security issue; see Scrapy 1.5.2 (2019-01-22) release notes for details.


New extensibility features

  • from_crawler support is added to feed exporters and feed storages. This, among other things, allows to access Scrapy settings from custom feed storages and exporters (:issue:`1605`, :issue:`3348`).
  • from_crawler support is added to dupefilters (:issue:`2956`); this allows to access e.g. settings or a spider from a dupefilter.
  • :signal:`item_error` is fired when an error happens in a pipeline (:issue:`3256`);
  • :signal:`request_reached_downloader` is fired when Downloader gets a new Request; this signal can be useful e.g. for custom Schedulers (:issue:`3393`).
  • new SitemapSpider sitemap_filter() method which allows to select sitemap entries based on their attributes in SitemapSpider subclasses (:issue:`3512`).
  • Lazy loading of Downloader Handlers is now optional; this enables better initialization error handling in custom Downloader Handlers (:issue:`3394`).


New FilePipeline and MediaPipeline features


scrapy.contracts improvements

  • Exceptions in contracts code are handled better (:issue:`3377`);
  • dont_filter=True is used for contract requests, which allows to test different callbacks with the same URL (:issue:`3381`);
  • request_cls attribute in Contract subclasses allow to use different Request classes in contracts, for example FormRequest (:issue:`3383`).
  • Fixed errback handling in contracts, e.g. for cases where a contract is executed for URL which returns non-200 response (:issue:`3371`).


Usability improvements

  • more stats for RobotsTxtMiddleware (:issue:`3100`)
  • INFO log level is used to show telnet host/port (:issue:`3115`)
  • a message is added to IgnoreRequest in RobotsTxtMiddleware (:issue:`3113`)
  • better validation of url argument in Response.follow (:issue:`3131`)
  • non-zero exit code is returned from Scrapy commands when error happens on spider initialization (:issue:`3226`)
  • Link extraction improvements: “ftp” is added to scheme list (:issue:`3152`); “flv” is added to common video extensions (:issue:`3165`)
  • better error message when an exporter is disabled (:issue:`3358`);
  • scrapy shell --help mentions syntax required for local files (./file.html) - :issue:`3496`.
  • Referer header value is added to RFPDupeFilter log messages (:issue:`3588`)


Bug fixes

  • fixed issue with extra blank lines in .csv exports under Windows (:issue:`3039`);
  • proper handling of pickling errors in Python 3 when serializing objects for disk queues (:issue:`3082`)
  • flags are now preserved when copying Requests (:issue:`3342`);
  • FormRequest.from_response clickdata shouldn’t ignore elements with input[type=image] (:issue:`3153`).
  • FormRequest.from_response should preserve duplicate keys (:issue:`3247`)


Documentation improvements


Deprecation removals

Compatibility shims for pre-1.0 Scrapy module names are removed (:issue:`3318`):

  • scrapy.command
  • scrapy.contrib (with all submodules)
  • scrapy.contrib_exp (with all submodules)
  • scrapy.dupefilter
  • scrapy.linkextractor
  • scrapy.project
  • scrapy.spider
  • scrapy.spidermanager
  • scrapy.squeue
  • scrapy.stats
  • scrapy.statscol
  • scrapy.utils.decorator

See Module Relocations for more information, or use suggestions from Scrapy 1.5.x deprecation warnings to update your code.

Other deprecation removals:

  • Deprecated scrapy.interfaces.ISpiderManager is removed; please use scrapy.interfaces.ISpiderLoader.
  • Deprecated CrawlerSettings class is removed (:issue:`3327`).
  • Deprecated Settings.overrides and Settings.defaults attributes are removed (:issue:`3327`, :issue:`3359`).


Other improvements, cleanups


Scrapy 1.5.2 (2019-01-22)

  • Security bugfix: Telnet console extension can be easily exploited by rogue websites POSTing content to http://localhost:6023, we haven’t found a way to exploit it from Scrapy, but it is very easy to trick a browser to do so and elevates the risk for local development environment.

    The fix is backward incompatible, it enables telnet user-password authentication by default with a random generated password. If you can’t upgrade right away, please consider setting :setting:`TELNETCONSOLE_PORT` out of its default value.

    See telnet console documentation for more info

  • Backport CI build failure under GCE environment due to boto import error.


Scrapy 1.5.1 (2018-07-12)

This is a maintenance release with important bug fixes, but no new features:


Scrapy 1.5.0 (2017-12-29)

This release brings small new features and improvements across the codebase. Some highlights:

  • Google Cloud Storage is supported in FilesPipeline and ImagesPipeline.
  • Crawling with proxy servers becomes more efficient, as connections to proxies can be reused now.
  • Warnings, exception and logging messages are improved to make debugging easier.
  • scrapy parse command now allows to set custom request meta via --meta argument.
  • Compatibility with Python 3.6, PyPy and PyPy3 is improved; PyPy and PyPy3 are now supported officially, by running tests on CI.
  • Better default handling of HTTP 308, 522 and 524 status codes.
  • Documentation is improved, as usual.

Backward Incompatible Changes

  • Scrapy 1.5 drops support for Python 3.3.
  • Default Scrapy User-Agent now uses https link to scrapy.org (:issue:`2983`). This is technically backward-incompatible; override :setting:`USER_AGENT` if you relied on old value.
  • Logging of settings overridden by custom_settings is fixed; this is technically backward-incompatible because the logger changes from [scrapy.utils.log] to [scrapy.crawler]. If you’re parsing Scrapy logs, please update your log parsers (:issue:`1343`).
  • LinkExtractor now ignores m4v extension by default, this is change in behavior.
  • 522 and 524 status codes are added to RETRY_HTTP_CODES (:issue:`2851`)


New features


Bug fixes


Docs


Scrapy 1.4.0 (2017-05-18)

Scrapy 1.4 does not bring that many breathtaking new features but quite a few handy improvements nonetheless.

Scrapy now supports anonymous FTP sessions with customizable user and password via the new :setting:`FTP_USER` and :setting:`FTP_PASSWORD` settings. And if you’re using Twisted version 17.1.0 or above, FTP is now available with Python 3.

There’s a new response.follow method for creating requests; it is now a recommended way to create Requests in Scrapy spiders. This method makes it easier to write correct spiders; response.follow has several advantages over creating scrapy.Request objects directly:

  • it handles relative URLs;
  • it works properly with non-ascii URLs on non-UTF8 pages;
  • in addition to absolute and relative URLs it supports Selectors; for <a> elements it can also extract their href values.

For example, instead of this:

for href in response.css('li.page a::attr(href)').extract():
    url = response.urljoin(href)
    yield scrapy.Request(url, self.parse, encoding=response.encoding)

One can now write this:

for a in response.css('li.page a'):
    yield response.follow(a, self.parse)

Link extractors are also improved. They work similarly to what a regular modern browser would do: leading and trailing whitespace are removed from attributes (think href="   http://example.com%22) when building Link objects. This whitespace-stripping also happens for action attributes with FormRequest.

Please also note that link extractors do not canonicalize URLs by default anymore. This was puzzling users every now and then, and it’s not what browsers do in fact, so we removed that extra transformation on extracted links.

For those of you wanting more control on the Referer: header that Scrapy sends when following links, you can set your own Referrer Policy. Prior to Scrapy 1.4, the default RefererMiddleware would simply and blindly set it to the URL of the response that generated the HTTP request (which could leak information on your URL seeds). By default, Scrapy now behaves much like your regular browser does. And this policy is fully customizable with W3C standard values (or with something really custom of your own if you wish). See :setting:`REFERRER_POLICY` for details.

To make Scrapy spiders easier to debug, Scrapy logs more stats by default in 1.4: memory usage stats, detailed retry stats, detailed HTTP error code stats. A similar change is that HTTP cache path is also visible in logs now.

Last but not least, Scrapy now has the option to make JSON and XML items more human-readable, with newlines between items and even custom indenting offset, using the new :setting:`FEED_EXPORT_INDENT` setting.

Enjoy! (Or read on for the rest of changes in this release.)

Deprecations and Backward Incompatible Changes


New Features


Bug fixes


Cleanups & Refactoring


Documentation


Scrapy 1.3.3 (2017-03-10)

Bug fixes

  • Make SpiderLoader raise ImportError again by default for missing dependencies and wrong :setting:`SPIDER_MODULES`. These exceptions were silenced as warnings since 1.3.0. A new setting is introduced to toggle between warning or exception if needed ; see :setting:`SPIDER_LOADER_WARN_ONLY` for details.


Scrapy 1.3.2 (2017-02-13)

Bug fixes

  • Preserve request class when converting to/from dicts (utils.reqser) (:issue:`2510`).
  • Use consistent selectors for author field in tutorial (:issue:`2551`).
  • Fix TLS compatibility in Twisted 17+ (:issue:`2558`)


Scrapy 1.3.1 (2017-02-08)

New features

  • Support 'True' and 'False' string values for boolean settings (:issue:`2519`); you can now do something like scrapy crawl myspider -s REDIRECT_ENABLED=False.
  • Support kwargs with response.xpath() to use XPath variables and ad-hoc namespaces declarations ; this requires at least Parsel v1.1 (:issue:`2457`).
  • Add support for Python 3.6 (:issue:`2485`).
  • Run tests on PyPy (warning: some tests still fail, so PyPy is not supported yet).


Bug fixes

  • Enforce DNS_TIMEOUT setting (:issue:`2496`).
  • Fix view command ; it was a regression in v1.3.0 (:issue:`2503`).
  • Fix tests regarding *_EXPIRES settings with Files/Images pipelines (:issue:`2460`).
  • Fix name of generated pipeline class when using basic project template (:issue:`2466`).
  • Fix compatibility with Twisted 17+ (:issue:`2496`, :issue:`2528`).
  • Fix scrapy.Item inheritance on Python 3.6 (:issue:`2511`).
  • Enforce numeric values for components order in SPIDER_MIDDLEWARES, DOWNLOADER_MIDDLEWARES, EXTENSIONS and SPIDER_CONTRACTS (:issue:`2420`).


Documentation


Cleanups

  • Remove redundant check in MetaRefreshMiddleware (:issue:`2542`).
  • Faster checks in LinkExtractor for allow/deny patterns (:issue:`2538`).
  • Remove dead code supporting old Twisted versions (:issue:`2544`).


Scrapy 1.3.0 (2016-12-21)

This release comes rather soon after 1.2.2 for one main reason: it was found out that releases since 0.18 up to 1.2.2 (included) use some backported code from Twisted (scrapy.xlib.tx.*), even if newer Twisted modules are available. Scrapy now uses twisted.web.client and twisted.internet.endpoints directly. (See also cleanups below.)

As it is a major change, we wanted to get the bug fix out quickly while not breaking any projects using the 1.2 series.

New Features

  • MailSender now accepts single strings as values for to and cc arguments (:issue:`2272`)
  • scrapy fetch url, scrapy shell url and fetch(url) inside Scrapy shell now follow HTTP redirections by default (:issue:`2290`); See fetch and shell for details.
  • HttpErrorMiddleware now logs errors with INFO level instead of DEBUG; this is technically backward incompatible so please check your log parsers.
  • By default, logger names now use a long-form path, e.g. [scrapy.extensions.logstats], instead of the shorter “top-level” variant of prior releases (e.g. [scrapy]); this is backward incompatible if you have log parsers expecting the short logger name part. You can switch back to short logger names using :setting:`LOG_SHORT_NAMES` set to True.


Dependencies & Cleanups

  • Scrapy now requires Twisted >= 13.1 which is the case for many Linux distributions already.
  • As a consequence, we got rid of scrapy.xlib.tx.* modules, which copied some of Twisted code for users stuck with an “old” Twisted version
  • ChunkedTransferMiddleware is deprecated and removed from the default downloader middlewares.


Scrapy 1.2.3 (2017-03-03)

  • Packaging fix: disallow unsupported Twisted versions in setup.py


Scrapy 1.2.2 (2016-12-06)

Bug fixes

  • Fix a cryptic traceback when a pipeline fails on open_spider() (:issue:`2011`)
  • Fix embedded IPython shell variables (fixing :issue:`396` that re-appeared in 1.2.0, fixed in :issue:`2418`)
  • A couple of patches when dealing with robots.txt:


Documentation


Other changes

  • Advertize conda-forge as Scrapy’s official conda channel (:issue:`2387`)
  • More helpful error messages when trying to use .css() or .xpath() on non-Text Responses (:issue:`2264`)
  • startproject command now generates a sample middlewares.py file (:issue:`2335`)
  • Add more dependencies’ version info in scrapy version verbose output (:issue:`2404`)
  • Remove all *.pyc files from source distribution (:issue:`2386`)


Scrapy 1.2.1 (2016-10-21)

Bug fixes

  • Include OpenSSL’s more permissive default ciphers when establishing TLS/SSL connections (:issue:`2314`).
  • Fix “Location” HTTP header decoding on non-ASCII URL redirects (:issue:`2321`).


Documentation


Other changes

  • Removed www. from start_urls in built-in spider templates (:issue:`2299`).


Scrapy 1.2.0 (2016-10-03)

New Features


Bug fixes

  • DefaultRequestHeaders middleware now runs before UserAgent middleware (:issue:`2088`). Warning: this is technically backward incompatible, though we consider this a bug fix.
  • HTTP cache extension and plugins that use the .scrapy data directory now work outside projects (:issue:`1581`). Warning: this is technically backward incompatible, though we consider this a bug fix.
  • Selector does not allow passing both response and text anymore (:issue:`2153`).
  • Fixed logging of wrong callback name with scrapy parse (:issue:`2169`).
  • Fix for an odd gzip decompression bug (:issue:`1606`).
  • Fix for selected callbacks when using CrawlSpider with scrapy parse (:issue:`2225`).
  • Fix for invalid JSON and XML files when spider yields no items (:issue:`872`).
  • Implement flush() for StreamLogger avoiding a warning in logs (:issue:`2125`).


Refactoring


Tests & Requirements

Scrapy’s new requirements baseline is Debian 8 “Jessie”. It was previously Ubuntu 12.04 Precise. What this means in practice is that we run continuous integration tests with these (main) packages versions at a minimum: Twisted 14.0, pyOpenSSL 0.14, lxml 3.4.

Scrapy may very well work with older versions of these packages (the code base still has switches for older Twisted versions for example) but it is not guaranteed (because it’s not tested anymore).


Documentation


Scrapy 1.1.4 (2017-03-03)

  • Packaging fix: disallow unsupported Twisted versions in setup.py


Scrapy 1.1.3 (2016-09-22)

Bug fixes

  • Class attributes for subclasses of ImagesPipeline and FilesPipeline work as they did before 1.1.1 (:issue:`2243`, fixes :issue:`2198`)


Documentation


Scrapy 1.1.2 (2016-08-18)

Bug fixes

  • Introduce a missing :setting:`IMAGES_STORE_S3_ACL` setting to override the default ACL policy in ImagesPipeline when uploading images to S3 (note that default ACL policy is “private” – instead of “public-read” – since Scrapy 1.1.0)
  • :setting:`IMAGES_EXPIRES` default value set back to 90 (the regression was introduced in 1.1.1)


Scrapy 1.1.1 (2016-07-13)

Bug fixes


New features


Documentation


Tests

  • Upgrade py.test requirement on Travis CI and Pin pytest-cov to 2.2.1 (:issue:`2095`)


Scrapy 1.1.0 (2016-05-11)

This 1.1 release brings a lot of interesting features and bug fixes:

  • Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See Beta Python 3 Support for more details and some limitations.
  • Hot new features:
  • These bug fixes may require your attention:
    • Don’t retry bad requests (HTTP 400) by default (:issue:`1289`). If you need the old behavior, add 400 to :setting:`RETRY_HTTP_CODES`.
    • Fix shell files argument handling (:issue:`1710`, :issue:`1550`). If you try scrapy shell index.html it will try to load the URL http://index.html, use scrapy shell ./index.html to load a local file.
    • Robots.txt compliance is now enabled by default for newly-created projects (:issue:`1724`). Scrapy will also wait for robots.txt to be downloaded before proceeding with the crawl (:issue:`1735`). If you want to disable this behavior, update :setting:`ROBOTSTXT_OBEY` in settings.py file after creating a new project.
    • Exporters now work on unicode, instead of bytes by default (:issue:`1080`). If you use PythonItemExporter, you may want to update your code to disable binary mode which is now deprecated.
    • Accept XML node names containing dots as valid (:issue:`1533`).
    • When uploading files or images to S3 (with FilesPipeline or ImagesPipeline), the default ACL policy is now “private” instead of “public” Warning: backward incompatible!. You can use :setting:`FILES_STORE_S3_ACL` to change it.
    • We’ve reimplemented canonicalize_url() for more correct output, especially for URLs with non-ASCII characters (:issue:`1947`). This could change link extractors output compared to previous Scrapy versions. This may also invalidate some cache entries you could still have from pre-1.1 runs. Warning: backward incompatible!.

Keep reading for more details on other improvements and bug fixes.

Beta Python 3 Support

We have been hard at work to make Scrapy run on Python 3. As a result, now you can run spiders on Python 3.3, 3.4 and 3.5 (Twisted >= 15.5 required). Some features are still missing (and some may never be ported).

Almost all builtin extensions/middlewares are expected to work. However, we are aware of some limitations in Python 3:

  • Scrapy does not work on Windows with Python 3
  • Sending emails is not supported
  • FTP download handler is not supported
  • Telnet console is not supported


Additional New Features and Enhancements


Deprecations and Removals

  • Added to_bytes and to_unicode, deprecated str_to_unicode and unicode_to_str functions (:issue:`778`).
  • binary_is_text is introduced, to replace use of isbinarytext (but with inverse return value) (:issue:`1851`)
  • The optional_features set has been removed (:issue:`1359`).
  • The --lsprof command line option has been removed (:issue:`1689`). Warning: backward incompatible, but doesn’t break user code.
  • The following datatypes were deprecated (:issue:`1720`):
    • scrapy.utils.datatypes.MultiValueDictKeyError
    • scrapy.utils.datatypes.MultiValueDict
    • scrapy.utils.datatypes.SiteNode
  • The previously bundled scrapy.xlib.pydispatch library was deprecated and replaced by pydispatcher.


Relocations


Bugfixes


Scrapy 1.0.7 (2017-03-03)

  • Packaging fix: disallow unsupported Twisted versions in setup.py


Scrapy 1.0.6 (2016-05-04)

  • FIX: RetryMiddleware is now robust to non-standard HTTP status codes (:issue:`1857`)
  • FIX: Filestorage HTTP cache was checking wrong modified time (:issue:`1875`)
  • DOC: Support for Sphinx 1.4+ (:issue:`1893`)
  • DOC: Consistency in selectors examples (:issue:`1869`)


Scrapy 1.0.5 (2016-02-04)


Scrapy 1.0.4 (2015-12-30)


Scrapy 1.0.3 (2015-08-11)


Scrapy 1.0.2 (2015-08-06)


Scrapy 1.0.1 (2015-07-01)


Scrapy 1.0.0 (2015-06-19)

You will find a lot of new features and bugfixes in this major release. Make sure to check our updated overview to get a glance of some of the changes, along with our brushed tutorial.

Support for returning dictionaries in spiders

Declaring and returning Scrapy Items is no longer necessary to collect the scraped data from your spider, you can now return explicit dictionaries instead.

Classic version

class MyItem(scrapy.Item):
    url = scrapy.Field()

class MySpider(scrapy.Spider):
    def parse(self, response):
        return MyItem(url=response.url)

New version

class MySpider(scrapy.Spider):
    def parse(self, response):
        return {'url': response.url}

Per-spider settings (GSoC 2014)

Last Google Summer of Code project accomplished an important redesign of the mechanism used for populating settings, introducing explicit priorities to override any given setting. As an extension of that goal, we included a new level of priority for settings that act exclusively for a single spider, allowing them to redefine project settings.

Start using it by defining a custom_settings class variable in your spider:

class MySpider(scrapy.Spider):
    custom_settings = {
        "DOWNLOAD_DELAY": 5.0,
        "RETRY_ENABLED": False,
    }

Read more about settings population: Settings


Python Logging

Scrapy 1.0 has moved away from Twisted logging to support Python built in’s as default logging system. We’re maintaining backward compatibility for most of the old custom interface to call logging functions, but you’ll get warnings to switch to the Python logging API entirely.

Old version

from scrapy import log
log.msg('MESSAGE', log.INFO)

New version

import logging
logging.info('MESSAGE')

Logging with spiders remains the same, but on top of the log() method you’ll have access to a custom logger created for the spider to issue log events:

class MySpider(scrapy.Spider):
    def parse(self, response):
        self.logger.info('Response received')

Read more in the logging documentation: Logging


Crawler API refactoring (GSoC 2014)

Another milestone for last Google Summer of Code was a refactoring of the internal API, seeking a simpler and easier usage. Check new core interface in: Core API

A common situation where you will face these changes is while running Scrapy from scripts. Here’s a quick example of how to run a Spider manually with the new API:

from scrapy.crawler import CrawlerProcess

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()

Bear in mind this feature is still under development and its API may change until it reaches a stable status.

See more examples for scripts running Scrapy: Common Practices


Module Relocations

There’s been a large rearrangement of modules trying to improve the general structure of Scrapy. Main changes were separating various subpackages into new projects and dissolving both scrapy.contrib and scrapy.contrib_exp into top level packages. Backward compatibility was kept among internal relocations, while importing deprecated modules expect warnings indicating their new place.

Full list of relocations

Outsourced packages

Note

These extensions went through some minor changes, e.g. some setting names were changed. Please check the documentation in each new repository to get familiar with the new usage.


Old location New location
scrapy.commands.deploy scrapyd-client (See other alternatives here: Deploying Spiders)
scrapy.contrib.djangoitem scrapy-djangoitem
scrapy.webservice scrapy-jsonrpc

scrapy.contrib_exp and scrapy.contrib dissolutions

Old location New location
scrapy.contrib_exp.downloadermiddleware.decompression scrapy.downloadermiddlewares.decompression
scrapy.contrib_exp.iterators scrapy.utils.iterators
scrapy.contrib.downloadermiddleware scrapy.downloadermiddlewares
scrapy.contrib.exporter scrapy.exporters
scrapy.contrib.linkextractors scrapy.linkextractors
scrapy.contrib.loader scrapy.loader
scrapy.contrib.loader.processor scrapy.loader.processors
scrapy.contrib.pipeline scrapy.pipelines
scrapy.contrib.spidermiddleware scrapy.spidermiddlewares
scrapy.contrib.spiders scrapy.spiders
  • scrapy.contrib.closespider
  • scrapy.contrib.corestats
  • scrapy.contrib.debug
  • scrapy.contrib.feedexport
  • scrapy.contrib.httpcache
  • scrapy.contrib.logstats
  • scrapy.contrib.memdebug
  • scrapy.contrib.memusage
  • scrapy.contrib.spiderstate
  • scrapy.contrib.statsmailer
  • scrapy.contrib.throttle
scrapy.extensions.*

Plural renames and Modules unification

Old location New location
scrapy.command scrapy.commands
scrapy.dupefilter scrapy.dupefilters
scrapy.linkextractor scrapy.linkextractors
scrapy.spider scrapy.spiders
scrapy.squeue scrapy.squeues
scrapy.statscol scrapy.statscollectors
scrapy.utils.decorator scrapy.utils.decorators

Class renames

Old location New location
scrapy.spidermanager.SpiderManager scrapy.spiderloader.SpiderLoader

Settings renames

Old location New location
SPIDER_MANAGER_CLASS SPIDER_LOADER_CLASS


Changelog

New Features and Enhancements

Deprecations and Removals

Relocations

Documentation

Bugfixes

Python 3 In Progress Support

Tests

Code refactoring

  • CSVFeedSpider cleanup: use iterate_spider_output (:issue:`1079`)
  • remove unnecessary check from scrapy.utils.spider.iter_spider_output (:issue:`1078`)
  • Pydispatch pep8 (:issue:`992`)
  • Removed unused ‘load=False’ parameter from walk_modules() (:issue:`871`)
  • For consistency, use job_dir helper in SpiderState extension. (:issue:`805`)
  • rename “sflo” local variables to less cryptic “log_observer” (:issue:`775`)


Scrapy 0.24.6 (2015-04-20)


Scrapy 0.24.5 (2015-02-25)


Scrapy 0.24.4 (2014-08-09)


Scrapy 0.24.3 (2014-08-09)


Scrapy 0.24.2 (2014-07-08)


Scrapy 0.24.1 (2014-06-27)

  • Fix deprecated CrawlerSettings and increase backward compatibility with .defaults attribute (:commit:`8e3f20a`)


Scrapy 0.24.0 (2014-06-26)

Enhancements


Bugfixes


Scrapy 0.22.2 (released 2014-02-14)


Scrapy 0.22.1 (released 2014-02-08)


Scrapy 0.22.0 (released 2014-01-17)

Enhancements


Fixes


Scrapy 0.20.2 (released 2013-12-09)


Scrapy 0.20.1 (released 2013-11-28)

  • include_package_data is required to build wheels from published sources (:commit:`5ba1ad5`)
  • process_parallel was leaking the failures on its internal deferreds. closes #458 (:commit:`419a780`)


Scrapy 0.20.0 (released 2013-11-08)

Enhancements


Bugfixes


Other

  • Dropped Python 2.6 support (:issue:`448`)
  • Add cssselect python package as install dependency
  • Drop libxml2 and multi selector’s backend support, lxml is required from now on.
  • Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support.
  • Running test suite now requires mock python library (:issue:`390`)


Thanks

Thanks to everyone who contribute to this release!

List of contributors sorted by number of commits:

69 Daniel Graña <dangra@...>
37 Pablo Hoffman <pablo@...>
13 Mikhail Korobov <kmike84@...>
 9 Alex Cepoi <alex.cepoi@...>
 9 alexanderlukanin13 <alexander.lukanin.13@...>
 8 Rolando Espinoza La fuente <darkrho@...>
 8 Lukasz Biedrycki <lukasz.biedrycki@...>
 6 Nicolas Ramirez <nramirez.uy@...>
 3 Paul Tremberth <paul.tremberth@...>
 2 Martin Olveyra <molveyra@...>
 2 Stefan <misc@...>
 2 Rolando Espinoza <darkrho@...>
 2 Loren Davie <loren@...>
 2 irgmedeiros <irgmedeiros@...>
 1 Stefan Koch <taikano@...>
 1 Stefan <cct@...>
 1 scraperdragon <dragon@...>
 1 Kumara Tharmalingam <ktharmal@...>
 1 Francesco Piccinno <stack.box@...>
 1 Marcos Campal <duendex@...>
 1 Dragon Dave <dragon@...>
 1 Capi Etheriel <barraponto@...>
 1 cacovsky <amarquesferraz@...>
 1 Berend Iwema <berend@...>

Scrapy 0.18.4 (released 2013-10-10)


Scrapy 0.18.3 (released 2013-10-03)


Scrapy 0.18.2 (released 2013-09-03)

  • Backport scrapy check command fixes and backward compatible multi crawler process(:issue:`339`)


Scrapy 0.18.1 (released 2013-08-27)


Scrapy 0.18.0 (released 2013-08-09)

  • Lot of improvements to testsuite run using Tox, including a way to test on pypi
  • Handle GET parameters for AJAX crawleable urls (:commit:`3fe2a32`)
  • Use lxml recover option to parse sitemaps (:issue:`347`)
  • Bugfix cookie merging by hostname and not by netloc (:issue:`352`)
  • Support disabling HttpCompressionMiddleware using a flag setting (:issue:`359`)
  • Support xml namespaces using iternodes parser in XMLFeedSpider (:issue:`12`)
  • Support dont_cache request meta flag (:issue:`19`)
  • Bugfix scrapy.utils.gz.gunzip broken by changes in python 2.7.4 (:commit:`4dc76e`)
  • Bugfix url encoding on SgmlLinkExtractor (:issue:`24`)
  • Bugfix TakeFirst processor shouldn’t discard zero (0) value (:issue:`59`)
  • Support nested items in xml exporter (:issue:`66`)
  • Improve cookies handling performance (:issue:`77`)
  • Log dupe filtered requests once (:issue:`105`)
  • Split redirection middleware into status and meta based middlewares (:issue:`78`)
  • Use HTTP1.1 as default downloader handler (:issue:`109` and :issue:`318`)
  • Support xpath form selection on FormRequest.from_response (:issue:`185`)
  • Bugfix unicode decoding error on SgmlLinkExtractor (:issue:`199`)
  • Bugfix signal dispatching on pypi interpreter (:issue:`205`)
  • Improve request delay and concurrency handling (:issue:`206`)
  • Add RFC2616 cache policy to HttpCacheMiddleware (:issue:`212`)
  • Allow customization of messages logged by engine (:issue:`214`)
  • Multiples improvements to DjangoItem (:issue:`217`, :issue:`218`, :issue:`221`)
  • Extend Scrapy commands using setuptools entry points (:issue:`260`)
  • Allow spider allowed_domains value to be set/tuple (:issue:`261`)
  • Support settings.getdict (:issue:`269`)
  • Simplify internal scrapy.core.scraper slot handling (:issue:`271`)
  • Added Item.copy (:issue:`290`)
  • Collect idle downloader slots (:issue:`297`)
  • Add ftp:// scheme downloader handler (:issue:`329`)
  • Added downloader benchmark webserver and spider tools Benchmarking
  • Moved persistent (on disk) queues to a separate project (queuelib) which Scrapy now depends on
  • Add Scrapy commands using external libraries (:issue:`260`)
  • Added --pdb option to scrapy command line tool
  • Added XPathSelector.remove_namespaces which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in Selectors.
  • Several improvements to spider contracts
  • New default middleware named MetaRefreshMiddldeware that handles meta-refresh html tag redirections,
  • MetaRefreshMiddldeware and RedirectMiddleware have different priorities to address #62
  • added from_crawler method to spiders
  • added system tests with mock server
  • more improvements to macOS compatibility (thanks Alex Cepoi)
  • several more cleanups to singletons and multi-spider support (thanks Nicolas Ramirez)
  • support custom download slots
  • added –spider option to “shell” command.
  • log overridden settings when Scrapy starts

Thanks to everyone who contribute to this release. Here is a list of contributors sorted by number of commits:

130 Pablo Hoffman <pablo@...>
 97 Daniel Graña <dangra@...>
 20 Nicolás Ramírez <nramirez.uy@...>
 13 Mikhail Korobov <kmike84@...>
 12 Pedro Faustino <pedrobandim@...>
 11 Steven Almeroth <sroth77@...>
  5 Rolando Espinoza La fuente <darkrho@...>
  4 Michal Danilak <mimino.coder@...>
  4 Alex Cepoi <alex.cepoi@...>
  4 Alexandr N Zamaraev (aka tonal) <tonal@...>
  3 paul <paul.tremberth@...>
  3 Martin Olveyra <molveyra@...>
  3 Jordi Llonch <llonchj@...>
  3 arijitchakraborty <myself.arijit@...>
  2 Shane Evans <shane.evans@...>
  2 joehillen <joehillen@...>
  2 Hart <HartSimha@...>
  2 Dan <ellisd23@...>
  1 Zuhao Wan <wanzuhao@...>
  1 whodatninja <blake@...>
  1 vkrest <v.krestiannykov@...>
  1 tpeng <pengtaoo@...>
  1 Tom Mortimer-Jones <tom@...>
  1 Rocio Aramberri <roschegel@...>
  1 Pedro <pedro@...>
  1 notsobad <wangxiaohugg@...>
  1 Natan L <kuyanatan.nlao@...>
  1 Mark Grey <mark.grey@...>
  1 Luan <luanpab@...>
  1 Libor Nenadál <libor.nenadal@...>
  1 Juan M Uys <opyate@...>
  1 Jonas Brunsgaard <jonas.brunsgaard@...>
  1 Ilya Baryshev <baryshev@...>
  1 Hasnain Lakhani <m.hasnain.lakhani@...>
  1 Emanuel Schorsch <emschorsch@...>
  1 Chris Tilden <chris.tilden@...>
  1 Capi Etheriel <barraponto@...>
  1 cacovsky <amarquesferraz@...>
  1 Berend Iwema <berend@...>

Scrapy 0.16.5 (released 2013-05-30)


Scrapy 0.16.4 (released 2013-01-23)


Scrapy 0.16.3 (released 2012-12-07)


Scrapy 0.16.2 (released 2012-11-09)


Scrapy 0.16.1 (released 2012-10-26)


Scrapy 0.16.0 (released 2012-10-18)

Scrapy changes:

  • added Spiders Contracts, a mechanism for testing spiders in a formal/reproducible way
  • added options -o and -t to the runspider command
  • documented AutoThrottle extension and added to extensions installed by default. You still need to enable it with :setting:`AUTOTHROTTLE_ENABLED`
  • major Stats Collection refactoring: removed separation of global/per-spider stats, removed stats-related signals (stats_spider_opened, etc). Stats are much simpler now, backward compatibility is kept on the Stats Collector API and signals.
  • added process_start_requests() method to spider middlewares
  • dropped Signals singleton. Signals should now be accessed through the Crawler.signals attribute. See the signals documentation for more info.
  • dropped Stats Collector singleton. Stats can now be accessed through the Crawler.stats attribute. See the stats collection documentation for more info.
  • documented Core API
  • lxml is now the default selectors backend instead of libxml2
  • ported FormRequest.from_response() to use lxml instead of ClientForm
  • removed modules: scrapy.xlib.BeautifulSoup and scrapy.xlib.ClientForm
  • SitemapSpider: added support for sitemap urls ending in .xml and .xml.gz, even if they advertise a wrong content type (:commit:`10ed28b`)
  • StackTraceDump extension: also dump trackref live references (:commit:`fe2ce93`)
  • nested items now fully supported in JSON and JSONLines exporters
  • added :reqmeta:`cookiejar` Request meta key to support multiple cookie sessions per spider
  • decoupled encoding detection code to w3lib.encoding, and ported Scrapy code to use that module
  • dropped support for Python 2.5. See https://blog.scrapinghub.com/2012/02/27/scrapy-0-15-dropping-support-for-python-2-5/
  • dropped support for Twisted 2.5
  • added :setting:`REFERER_ENABLED` setting, to control referer middleware
  • changed default user agent to: Scrapy/VERSION (+http://scrapy.org)
  • removed (undocumented) HTMLImageLinkExtractor class from scrapy.contrib.linkextractors.image
  • removed per-spider settings (to be replaced by instantiating multiple crawler objects)
  • USER_AGENT spider attribute will no longer work, use user_agent attribute instead
  • DOWNLOAD_TIMEOUT spider attribute will no longer work, use download_timeout attribute instead
  • removed ENCODING_ALIASES setting, as encoding auto-detection has been moved to the w3lib library
  • promoted DjangoItem to main contrib
  • LogFormatter method now return dicts(instead of strings) to support lazy formatting (:issue:`164`, :commit:`dcef7b0`)
  • downloader handlers (:setting:`DOWNLOAD_HANDLERS` setting) now receive settings as the first argument of the __init__ method
  • replaced memory usage acounting with (more portable) resource module, removed scrapy.utils.memory module
  • removed signal: scrapy.mail.mail_sent
  • removed TRACK_REFS setting, now trackrefs is always enabled
  • DBM is now the default storage backend for HTTP cache middleware
  • number of log messages (per level) are now tracked through Scrapy stats (stat name: log_count/LEVEL)
  • number received responses are now tracked through Scrapy stats (stat name: response_received_count)
  • removed scrapy.log.started attribute


Scrapy 0.14.4


Scrapy 0.14.3

  • forgot to include pydispatch license. #118 (:commit:`fd85f9c`)
  • include egg files used by testsuite in source distribution. #118 (:commit:`c897793`)
  • update docstring in project template to avoid confusion with genspider command, which may be considered as an advanced feature. refs #107 (:commit:`2548dcc`)
  • added note to docs/topics/firebug.rst about google directory being shut down (:commit:`668e352`)
  • don’t discard slot when empty, just save in another dict in order to recycle if needed again. (:commit:`8e9f607`)
  • do not fail handling unicode xpaths in libxml2 backed selectors (:commit:`b830e95`)
  • fixed minor mistake in Request objects documentation (:commit:`bf3c9ee`)
  • fixed minor defect in link extractors documentation (:commit:`ba14f38`)
  • removed some obsolete remaining code related to sqlite support in Scrapy (:commit:`0665175`)


Scrapy 0.14.2


Scrapy 0.14.1


Scrapy 0.14

New features and settings

  • check the documentation for more details
  • Added builtin caching DNS resolver (:rev:`2728`)
  • Moved Amazon AWS-related components/extensions (SQS spider queue, SimpleDB stats collector) to a separate project: [scaws](https://github.com/scrapinghub/scaws) (:rev:`2706`, :rev:`2714`)
  • Moved spider queues to scrapyd: scrapy.spiderqueue -> scrapyd.spiderqueue (:rev:`2708`)
  • Moved sqlite utils to scrapyd: scrapy.utils.sqlite -> scrapyd.sqlite (:rev:`2781`)
  • Real support for returning iterators on start_requests() method. The iterator is now consumed during the crawl when the spider is getting idle (:rev:`2704`)
  • Added :setting:`REDIRECT_ENABLED` setting to quickly enable/disable the redirect middleware (:rev:`2697`)
  • Added :setting:`RETRY_ENABLED` setting to quickly enable/disable the retry middleware (:rev:`2694`)
  • Added CloseSpider exception to manually close spiders (:rev:`2691`)
  • Improved encoding detection by adding support for HTML5 meta charset declaration (:rev:`2690`)
  • Refactored close spider behavior to wait for all downloads to finish and be processed by spiders, before closing the spider (:rev:`2688`)
  • Added SitemapSpider (see documentation in Spiders page) (:rev:`2658`)
  • Added LogStats extension for periodically logging basic stats (like crawled pages and scraped items) (:rev:`2657`)
  • Make handling of gzipped responses more robust (#319, :rev:`2643`). Now Scrapy will try and decompress as much as possible from a gzipped response, instead of failing with an IOError.
  • Simplified !MemoryDebugger extension to use stats for dumping memory debugging info (:rev:`2639`)
  • Added new command to edit spiders: scrapy edit (:rev:`2636`) and -e flag to genspider command that uses it (:rev:`2653`)
  • Changed default representation of items to pretty-printed dicts. (:rev:`2631`). This improves default logging by making log more readable in the default case, for both Scraped and Dropped lines.
  • Added :signal:`spider_error` signal (:rev:`2628`)
  • Added :setting:`COOKIES_ENABLED` setting (:rev:`2625`)
  • Stats are now dumped to Scrapy log (default value of :setting:`STATS_DUMP` setting has been changed to True). This is to make Scrapy users more aware of Scrapy stats and the data that is collected there.
  • Added support for dynamically adjusting download delay and maximum concurrent requests (:rev:`2599`)
  • Added new DBM HTTP cache storage backend (:rev:`2576`)
  • Added listjobs.json API to Scrapyd (:rev:`2571`)
  • CsvItemExporter: added join_multivalued parameter (:rev:`2578`)
  • Added namespace support to xmliter_lxml (:rev:`2552`)
  • Improved cookies middleware by making COOKIES_DEBUG nicer and documenting it (:rev:`2579`)
  • Several improvements to Scrapyd and Link extractors

  • Code rearranged and removed

    • *; Merged item passed and item scraped concepts, as they have often proved confusing in the past. This means: (:rev:`2630`)
      *;* original item_scraped signal was removed
      • original item_passed signal was renamed to item_scraped
      • old log lines Scraped Item... were removed
      • old log lines Passed Item... were renamed to Scraped Item... lines and downgraded to DEBUG level
    • *; Reduced Scrapy codebase by striping part of Scrapy code into two new libraries:
      *;* w3lib (several functions from scrapy.utils.{http,markup,multipart,response,url}, done in :rev:`2584`)
    • Removed unused function: scrapy.utils.request.request_info() (:rev:`2577`)
    • Removed googledir project from examples/googledir. There’s now a new example project called dirbot available on GitHub: https://github.com/scrapy/dirbot
    • Removed support for default field values in Scrapy items (:rev:`2616`)
    • Removed experimental crawlspider v2 (:rev:`2632`)
    • Removed scheduler middleware to simplify architecture. Duplicates filter is now done in the scheduler itself, using the same dupe fltering class as before (DUPEFILTER_CLASS setting) (:rev:`2640`)
    • Removed support for passing urls to scrapy crawl command (use scrapy parse instead) (:rev:`2704`)
    • Removed deprecated Execution Queue (:rev:`2704`)
    • Removed (undocumented) spider context extension (from scrapy.contrib.spidercontext) (:rev:`2780`)
    • removed CONCURRENT_SPIDERS setting (use scrapyd maxproc instead) (:rev:`2789`)
    • Renamed attributes of core components: downloader.sites -> downloader.slots, scraper.sites -> scraper.slots (:rev:`2717`, :rev:`2718`)
    • Renamed setting CLOSESPIDER_ITEMPASSED to :setting:`CLOSESPIDER_ITEMCOUNT` (:rev:`2655`). Backward compatibility kept.


    Scrapy 0.12

    The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

    New features and improvements

    • Passed item is now sent in the item argument of the :signal:`item_passed ` (#273)
    • Added verbose option to scrapy version command, useful for bug reports (#298)
    • HTTP cache now stored by default in the project data dir (#279)
    • Added project data storage directory (#276, #277)
    • Documented file structure of Scrapy projects (see command-line tool doc)
    • New lxml backend for XPath selectors (#147)
    • Per-spider settings (#245)
    • Support exit codes to signal errors in Scrapy commands (#248)
    • Added -c argument to scrapy shell command
    • Made libxml2 optional (#260)
    • New deploy command (#261)
    • Added :setting:`CLOSESPIDER_PAGECOUNT` setting (#253)
    • Added :setting:`CLOSESPIDER_ERRORCOUNT` setting (#254)


    Scrapyd changes

    • Scrapyd now uses one process per spider
    • It stores one log file per spider run, and rotate them keeping the latest 5 logs per spider (by default)
    • A minimal web ui was added, available at http://localhost:6800 by default
    • There is now a scrapy server command to start a Scrapyd server of the current project


    Changes to settings

    • added HTTPCACHE_ENABLED setting (False by default) to enable HTTP cache middleware
    • changed HTTPCACHE_EXPIRATION_SECS semantics: now zero means “never expire”.


    Deprecated/obsoleted functionality

    • Deprecated runserver command in favor of server command which starts a Scrapyd server. See also: Scrapyd changes
    • Deprecated queue command in favor of using Scrapyd schedule.json API. See also: Scrapyd changes
    • Removed the !LxmlItemLoader (experimental contrib which never graduated to main contrib)


    Scrapy 0.10

    The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

    New features and improvements

    • New Scrapy service called scrapyd for deploying Scrapy crawlers in production (#218) (documentation available)
    • Simplified Images pipeline usage which doesn’t require subclassing your own images pipeline now (#217)
    • Scrapy shell now shows the Scrapy log by default (#206)
    • Refactored execution queue in a common base code and pluggable backends called “spider queues” (#220)
    • New persistent spider queue (based on SQLite) (#198), available by default, which allows to start Scrapy in server mode and then schedule spiders to run.
    • Added documentation for Scrapy command-line tool and all its available sub-commands. (documentation available)
    • Feed exporters with pluggable backends (#197) (documentation available)
    • Deferred signals (#193)
    • Added two new methods to item pipeline open_spider(), close_spider() with deferred support (#195)
    • Support for overriding default request headers per spider (#181)
    • Replaced default Spider Manager with one with similar functionality but not depending on Twisted Plugins (#186)
    • Split Debian package into two packages - the library and the service (#187)
    • Scrapy log refactoring (#188)
    • New extension for keeping persistent spider contexts among different runs (#203)
    • Added dont_redirect request.meta key for avoiding redirects (#233)
    • Added dont_retry request.meta key for avoiding retries (#234)


    Command-line tool changes

    • New scrapy command which replaces the old scrapy-ctl.py (#199) - there is only one global scrapy command now, instead of one scrapy-ctl.py per project - Added scrapy.bat script for running more conveniently from Windows
    • Added bash completion to command-line tool (#210)
    • Renamed command start to runserver (#209)


    API changes

    • url and body attributes of Request objects are now read-only (#230)
    • Request.copy() and Request.replace() now also copies their callback and errback attributes (#231)
    • Removed UrlFilterMiddleware from scrapy.contrib (already disabled by default)
    • Offsite middleware doesn’t filter out any request coming from a spider that doesn’t have a allowed_domains attribute (#225)
    • Removed Spider Manager load() method. Now spiders are loaded in the __init__ method itself.
    • *; Changes to Scrapy Manager (now called “Crawler”):
      *;* scrapy.core.manager.ScrapyManager class renamed to scrapy.crawler.Crawler
      • scrapy.core.manager.scrapymanager singleton moved to scrapy.project.crawler
    • Moved module: scrapy.contrib.spidermanager to scrapy.spidermanager
    • Spider Manager singleton moved from scrapy.spider.spiders to the spiders` attribute of ``scrapy.project.crawler singleton.
    • *; moved Stats Collector classes: (#204)
      *;* scrapy.stats.collector.StatsCollector to scrapy.statscol.StatsCollector
      • scrapy.stats.collector.SimpledbStatsCollector to scrapy.contrib.statscol.SimpledbStatsCollector
    • default per-command settings are now specified in the default_settings attribute of command object class (#201)
    • *; changed arguments of Item pipeline process_item() method from (spider, item) to (item, spider)
      *;* backward compatibility kept (with deprecation warning)
    • *; moved scrapy.core.signals module to scrapy.signals
      *;* backward compatibility kept (with deprecation warning)
    • *; moved scrapy.core.exceptions module to scrapy.exceptions
      *;* backward compatibility kept (with deprecation warning)
    • added handles_request() class method to BaseSpider
    • dropped scrapy.log.exc() function (use scrapy.log.err() instead)
    • dropped component argument of scrapy.log.msg() function
    • dropped scrapy.log.log_level attribute
    • Added from_settings() class methods to Spider Manager, and Item Pipeline Manager


    Changes to settings

    • Added HTTPCACHE_IGNORE_SCHEMES setting to ignore certain schemes on !HttpCacheMiddleware (#225)
    • Added SPIDER_QUEUE_CLASS setting which defines the spider queue to use (#220)
    • Added KEEP_ALIVE setting (#220)
    • Removed SERVICE_QUEUE setting (#220)
    • Removed COMMANDS_SETTINGS_MODULE setting (#201)
    • Renamed REQUEST_HANDLERS to DOWNLOAD_HANDLERS and make download handlers classes (instead of functions)


    Scrapy 0.9

    The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

    New features and improvements


    API changes


    Changes to default settings


    Scrapy 0.8

    The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.

    New features


    Backward-incompatible changes

    • Changed scrapy.utils.response.get_meta_refresh() signature (:rev:`1804`)
    • Removed deprecated scrapy.item.ScrapedItem class - use scrapy.item.Item instead (:rev:`1838`)
    • Removed deprecated scrapy.xpath module - use scrapy.selector instead. (:rev:`1836`)
    • Removed deprecated core.signals.domain_open signal - use core.signals.domain_opened instead (:rev:`1822`)
    • *; log.msg() now receives a spider argument (:rev:`1822`)
      *;* Old domain argument has been deprecated and will be removed in 0.9. For spiders, you should always use the spider argument and pass spider references. If you really want to pass a string, use the component argument instead.
    • Changed core signals domain_opened, domain_closed, domain_idle
    • *; Changed Item pipeline to use spiders instead of domains
      *;* The domain argument of process_item() item pipeline method was changed to spider, the new signature is: process_item(spider, item) (:rev:`1827` | #105)
      • To quickly port your code (to work with Scrapy 0.8) just use spider.domain_name where you previously used domain.
    • *; Changed Stats API to use spiders instead of domains (:rev:`1849` | #113)
      *;* StatsCollector was changed to receive spider references (instead of domains) in its methods (set_value, inc_value, etc).
      • added StatsCollector.iter_spider_stats() method
      • removed StatsCollector.list_domains() method
      • Also, Stats signals were renamed and now pass around spider references (instead of domains). Here’s a summary of the changes:
      • To quickly port your code (to work with Scrapy 0.8) just use spider.domain_name where you previously used domain. spider_stats contains exactly the same data as domain_stats.
    • *; CloseDomain extension moved to scrapy.contrib.closespider.CloseSpider (:rev:`1833`)
      *;* *;*; Its settings were also renamed:
      • *;*;* CLOSEDOMAIN_TIMEOUT to CLOSESPIDER_TIMEOUT
      • CLOSEDOMAIN_ITEMCOUNT to CLOSESPIDER_ITEMCOUNT
  • Removed deprecated SCRAPYSETTINGS_MODULE environment variable - use SCRAPY_SETTINGS_MODULE instead (:rev:`1840`)
  • Renamed setting: REQUESTS_PER_DOMAIN to CONCURRENT_REQUESTS_PER_SPIDER (:rev:`1830`, :rev:`1844`)
  • Renamed setting: CONCURRENT_DOMAINS to CONCURRENT_SPIDERS (:rev:`1830`)
  • Refactored HTTP Cache middleware
  • HTTP Cache middleware has been heavilty refactored, retaining the same functionality except for the domain sectorization which was removed. (:rev:`1843` )
  • Renamed exception: DontCloseDomain to DontCloseSpider (:rev:`1859` | #120)
  • Renamed extension: DelayedCloseDomain to SpiderCloseDelay (:rev:`1861` | #121)
  • Removed obsolete scrapy.utils.markup.remove_escape_chars function - use scrapy.utils.markup.replace_escape_chars instead (:rev:`1865`)

  • Scrapy 0.7

    First release of Scrapy.