Using the Frontier with Scrapy¶
Using Frontera is quite easy, it includes a set of Scrapy middlewares and Scrapy scheduler that encapsulates Frontera usage and can be easily configured using Scrapy settings.
Activating the frontier¶
The Frontera uses 2 different middlewares: SchedulerSpiderMiddleware and SchedulerDownloaderMiddleware, and it’s own scheduler FronteraScheduler
To activate the Frontera in your Scrapy project, just add them to the SPIDER_MIDDLEWARES, DOWNLOADER_MIDDLEWARES and SCHEDULER settings:
SPIDER_MIDDLEWARES.update({
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
})
DOWNLOADER_MIDDLEWARES.update({
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
})
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'
Create a Frontera settings.py file and add it to your Scrapy settings:
FRONTERA_SETTINGS = 'tutorial/frontera/settings.py'
Organizing files¶
When using frontier with a Scrapy project, we propose the following directory structure:
my_scrapy_project/
my_scrapy_project/
frontera/
__init__.py
settings.py
middlewares.py
backends.py
spiders/
...
__init__.py
settings.py
scrapy.cfg
These are basically:
- my_scrapy_project/frontera/settings.py: the Frontera settings file.
- my_scrapy_project/frontera/middlewares.py: the middlewares used by the Frontera.
- my_scrapy_project/frontera/backends.py: the backend(s) used by the Frontera.
- my_scrapy_project/spiders: the Scrapy spiders folder
- my_scrapy_project/settings.py: the Scrapy settings file
- scrapy.cfg: the Scrapy config file
Running the Crawl¶
Just run your Scrapy spider as usual from the command line:
scrapy crawl myspider