Recording a Scrapy crawl¶

Scrapy Recorder is a set of Scrapy middlewares that will allow you to record a scrapy crawl and store it into a Graph Manager.

This can be useful to perform frontier tests without having to crawl the entire site again or even using Scrapy.

Activating the recorder¶

The recorder uses 2 different middlewares: CrawlRecorderSpiderMiddleware and CrawlRecorderDownloaderMiddleware.

To activate the recording in your Scrapy project, just add them to the SPIDER_MIDDLEWARES and DOWNLOADER_MIDDLEWARES settings:

SPIDER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.recording.CrawlRecorderSpiderMiddleware': 1000,
})

DOWNLOADER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.recording.CrawlRecorderDownloaderMiddleware': 1000,
})

Choosing your storage engine¶

As Graph Manager is internally used by the recorder to store crawled pages, you can choose between different storage engines.

We can set the storage engine with the RECORDER_STORAGE_ENGINE setting:

RECORDER_STORAGE_ENGINE = 'sqlite:///my_record.db'

You can also choose to reset database tables or just reset data with this settings:

RECORDER_STORAGE_DROP_ALL_TABLES = True
RECORDER_STORAGE_CLEAR_CONTENT = True

Running the Crawl¶

Just run your Scrapy spider as usual from the command line:

scrapy crawl myspider

Once it’s finished you should have the recording available and ready for use.

In case you need to disable recording, you can do it by overriding the RECORDER_ENABLED setting:

scrapy crawl myspider -s RECORDER_ENABLED=False

Recorder settings¶

Here’s a list of all available Scrapy Recorder settings, in alphabetical order, along with their default values and the scope where they apply.

RECORDER_ENABLED¶

Default: True

Activate or deactivate recording middlewares.

RECORDER_STORAGE_CLEAR_CONTENT¶

Default: True

Deletes table content from storage database in Graph Manager.

RECORDER_STORAGE_DROP_ALL_TABLES¶

Default: True

Drop storage database tables in Graph Manager.

RECORDER_STORAGE_ENGINE¶

Default: None

Sets Graph Manager storage engine used to store the recording.