Coroutines — Scrapy documentation

From Get docs
Scrapy/docs/latest/topics/coroutines

Coroutines

New in version 2.0.


Scrapy has partial support for the coroutine syntax.

Supported callables

The following callables may be defined as coroutines using async def, and hence use coroutine syntax (e.g. await, async for, async with):


Usage

There are several use cases for coroutines in Scrapy. Code that would return Deferreds when written for previous Scrapy versions, such as downloader middlewares and signal handlers, can be rewritten to be shorter and cleaner:

from itemadapter import ItemAdapter

class DbPipeline:
    def _update_item(self, data, item):
        adapter = ItemAdapter(item)
        adapter['field'] = data
        return item

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        dfd = db.get_some_data(adapter['id'])
        dfd.addCallback(self._update_item, item)
        return dfd

becomes:

from itemadapter import ItemAdapter

class DbPipeline:
    async def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        adapter['field'] = await db.get_some_data(adapter['id'])
        return item

Coroutines may be used to call asynchronous code. This includes other coroutines, functions that return Deferreds and functions that return awaitable objects such as Future. This means you can use many useful Python libraries providing such code:

class MySpiderDeferred(Spider):
    # ...
    async def parse(self, response):
        additional_response = await treq.get('https://additional.url')
        additional_data = await treq.content(additional_response)
        # ... use response and additional_data to yield items and requests

class MySpiderAsyncio(Spider):
    # ...
    async def parse(self, response):
        async with aiohttp.ClientSession() as session:
            async with session.get('https://additional.url') as additional_response:
                additional_data = await additional_response.text()
        # ... use response and additional_data to yield items and requests

Note

Many libraries that use coroutines, such as aio-libs, require the asyncio loop and to use them you need to enable asyncio support in Scrapy.


Note

If you want to await on Deferreds while using the asyncio reactor, you need to wrap them.


Common use cases for asynchronous code include:

  • requesting data from websites, databases and other services (in callbacks, pipelines and middlewares);
  • storing data in databases (in pipelines and middlewares);
  • delaying the spider initialization until some external event (in the :signal:`spider_opened` handler);
  • calling asynchronous Scrapy methods like ExecutionEngine.download (see the screenshot pipeline example).