Crawl Frontier at a glance¶
Crawl Frontier is an application framework that is meant to be used as part of a Crawling System, allowing you to easily manage and define tasks related to a Crawling Frontier.
Even though it was originally designed for Scrapy, it can also be used with any other Crawling Framework/System as the framework offers a generic frontier functionality.
The purpose of this document is to introduce you to the concepts behind Crawl Frontier so that you can get an idea of how it works and to decide if it is suited to your needs.
1. Create your crawler¶
Create your Scrapy project as you usually do. Enter a directory where you’d like to store your code and then run:
scrapy startproject tutorial
This will create a tutorial directory with the following contents:
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
These are basically:
- scrapy.cfg: the project configuration file
- tutorial/: the project’s python module, you’ll later import your code from here.
- tutorial/items.py: the project’s items file.
- tutorial/pipelines.py: the project’s pipelines file.
- tutorial/settings.py: the project’s settings file.
- tutorial/spiders/: a directory where you’ll later put your spiders.
2. Integrate your crawler with the frontier¶
Add the Scrapy Crawl Frontier middlewares to your settings:
SPIDER_MIDDLEWARES.update({
'crawlfrontier.contrib.scrapy.middlewares.frontier.CrawlFrontierSpiderMiddleware': 1000,
})
DOWNLOADER_MIDDLEWARES.update({
'crawlfrontier.contrib.scrapy.middlewares.frontier.CrawlFrontierDownloaderMiddleware': 1000,
})
Create a Crawl Frontier settings.py file and add it to your Scrapy settings:
FRONTIER_SETTINGS = 'tutorial/frontier/settings.py'
3. Choose your backend¶
Configure frontier settings to use a built-in backend like in-memory BFS:
BACKEND = 'crawlfrontier.contrib.backends.memory.heapq.BFS'
4. Run the spider¶
Run your Scrapy spider as usual from the command line:
scrapy crawl myspider
And that’s it! You got your spider running integrated with Crawl Frontier.
What else?¶
You’ve seen a simple example of how to use Crawl Frontier with Scrapy, but this is just the surface. Crawl Frontier provides many powerful features for making Frontier management easy and efficient, such as:
- Easy built-in integration with Scrapy and any other crawler through its API.
- Creating different crawling logic/policies defining your own backend.
- Built-in support for database storage for crawled pages.
- Support for extending Crawl Frontier by plugging your own functionality using middlewares.
- Built-in middlewares for:
- Extracting domain info from page URLs.
- Create unique fingerprints for page URLs and domain names.
- Create fake sitemaps and reproduce crawling without crawler with the graph Manager.
- Tools for easy frontier testing.
- Record your Scrapy crawls and use it later for frontier testing.
- Logging facility that you can hook on to for catching errors and debug your frontiers.
What’s next?¶
The next obvious steps are for you to install Crawl Frontier, and read the architecture overview and API docs. Thanks for your interest!