Using the Frontier with Scrapy¶
Using Crawl Frontier is quite easy, it includes a set of Scrapy middlewares that encapsulates frontier usage and can be easily configured using Scrapy settings.
Activating the frontier¶
The frontier uses 2 different middlewares: CrawlFrontierSpiderMiddleware and CrawlFrontierDownloaderMiddleware.
To activate the frontier in your Scrapy project, just add them to the SPIDER_MIDDLEWARES and DOWNLOADER_MIDDLEWARES settings:
SPIDER_MIDDLEWARES.update({
'crawlfrontier.contrib.scrapy.middlewares.frontier.CrawlFrontierSpiderMiddleware': 1000,
})
DOWNLOADER_MIDDLEWARES.update({
'crawlfrontier.contrib.scrapy.middlewares.frontier.CrawlFrontierDownloaderMiddleware': 1000,
})
Create a Crawl Frontier settings.py file and add it to your Scrapy settings:
FRONTIER_SETTINGS = 'tutorial/frontier/settings.py'
Organizing files¶
When using frontier with a Scrapy project, we propose the following directory structure:
my_scrapy_project/
my_scrapy_project/
frontier/
__init__.py
settings.py
middlewares.py
backends.py
spiders/
...
__init__.py
settings.py
scrapy.cfg
These are basically:
- my_scrapy_project/frontier/settings.py: the frontier settings file.
- my_scrapy_project/frontier/middlewares.py: the middlewares used by the frontier.
- my_scrapy_project/frontier/backends.py: the backend(s) used by the frontier.
- my_scrapy_project/spiders: the Scrapy spiders folder
- my_scrapy_project/settings.py: the Scrapy settings file
- scrapy.cfg: the Scrapy config file
Running the Crawl¶
Just run your Scrapy spider as usual from the command line:
scrapy crawl myspider
In case you need to disable frontier, you can do it by overriding the FRONTIER_ENABLED setting:
scrapy crawl myspider -s FRONTIER_ENABLED=False
Frontier Scrapy settings¶
Here’s a list of all available Crawl Frontier Scrapy settings, in alphabetical order, along with their default values and the scope where they apply:
FRONTIER_SCHEDULER_CONCURRENT_REQUESTS¶
Default: 256
Number of concurrent requests that the middleware will maintain while asking for next pages.
FRONTIER_SCHEDULER_INTERVAL¶
Default: 0.01
Interval of number of requests check in seconds. Indicates how often the frontier will be asked for new pages if there is gap for new requests.