Frontera 0.8 documentation¶

Frontera is a web crawling tool box, allowing to build crawlers of any scale and purpose. It includes:

crawl frontier framework managing when and what to crawl and checking for crawling goal* accomplishment,
workers, Scrapy wrappers, and data bus components to scale and distribute the crawler.

Frontera contain components to allow creation of fully-operational web crawler with Scrapy. Even though it was originally designed for Scrapy, it can also be used with any other crawling framework/system.

Introduction¶

The purpose of this chapter is to introduce you to the concepts behind Frontera so that you can get an idea of how it works and decide if it is suited to your needs.

Frontera at a glance: Understand what Frontera is and how it can help you.
Run modes: High level architecture and Frontera run modes.
Quick start single process: using Scrapy as a container for running Frontera.
Quick start distributed mode: with SQLite and ZeroMQ.
Cluster setup guide: Setting up clustered version of Frontera on multiple machines with HBase and Kafka.

Using Frontera¶

Installation Guide: HOWTO and Dependencies options.
Crawling strategies: A list of built-in crawling strategies.
Frontier objects: Understand the classes used to represent requests and responses.
Middlewares: Filter or alter information for links and documents.
Canonical URL Solver: Identify and make use of canonical url of document.
Backends: Built-in backends, and tips on implementing your own.
Message bus: Built-in message bus reference.
Writing custom crawling strategy: Implementing your own crawling strategy.
Using the Frontier with Scrapy: Learn how to use Frontera with Scrapy.
Settings: Settings reference.

Advanced usage¶

What is a Crawl Frontier?: Learn Crawl Frontier theory.
Graph Manager: Define fake crawlings for websites to test your frontier.
Recording a Scrapy crawl: Create Scrapy crawl recordings and reproduce them later.
Fine tuning of Frontera cluster: Cluster deployment and fine tuning information.
DNS Service: Few words about DNS service setup.

Developer documentation¶

Architecture overview: See how Frontera works and its different components.
Frontera API: Learn how to use the frontier.
Using the Frontier with Requests: Learn how to use Frontera with Requests.
Examples: Some example projects and scripts using Frontera.
Tests: How to run and write Frontera tests.
Logging: A list of loggers for use with python native logging system.
Testing a Frontier: Test your frontier in an easy way.
Contribution guidelines: HOWTO contribute.
Glossary: Glossary of terms.