Production broad crawling

These are the topics you need to consider when deploying Frontera-based broad crawler in production system.

DNS Service

Along with what was mentioned in Prerequisites you may need also a dedicated DNS Service with caching. Especially, if your crawler is expected to generate substantial number of DNS queries. It is true for breadth-first crawling, or any other strategies, implying accessing large number of websites, within short period of time.

Because of huge load DNS service may get blocked by your network provider eventually. We recommend to setup a dedicated DNS instance locally on every spider machine with upstream using massive DNS caches like OpenDNS or Verizon.

Choose message bus

There are two options available:

  • Kafka, requires properly configured partitions,
  • ZeroMQ (default), requires broker process.

Configuring Kafka

The main thing to do here is to set the number of partitions for OUTGOING_TOPIC equal to the number of spider instances and for INCOMING_TOPIC equal to number of strategy worker instances. For other topics it makes sense to set more than one partition to better distribute the load across Kafka cluster.

Kafka throughput is key performance issue, make sure that Kafka brokers have enough IOPS, and monitor the network load.

Configuring ZeroMQ

ZeroMQ requires almost no configuration except hostname and base port where to bind it’s sockets. Please see ZMQ_HOSTNAME and ZMQ_BASE_PORT settings for more detail. ZeroMQ also requires distributed frontera broker process running and accessible to connect. See Start cluster.

Configure Frontera workers

There are two type of workers: DB and Strategy.

DB worker is doing three tasks in particular:

  • Reading spider log stream and updates metadata in DB,
  • Consult lags in message bus, gets new batches and pushes them to spider feed,
  • Reads scoring log stream and updates DB with new score and schedule URLs to download if needed.

Strategy worker is reading spider log, calculating score, deciding if URL needs to be crawled and pushes update_score events to scoring log.

Before setting it up you have to decide how many spider instances you need. One spider is able to download and parse about 700 pages/minute in average. Therefore if you want to fetch 1K per second you probably need about 10 spiders. For each 4 spiders you would need one pair of workers (strategy and DB). If your strategy worker is lightweight (not processing content for example) then 1 strategy worker per 15 spider instances could be enough.

Your spider log stream should have as much partitions as strategy workers you need. Each strategy worker is assigned to specific partition using option SCORING_PARTITION_ID.

Your spider feed stream, containing new batches should have as much partitions as spiders you will have in your cluster.

Now, let’s create a Frontera workers settings file under frontera subfolder and name it

from frontera.settings.default_settings import MIDDLEWARES

MAX_NEXT_REQUESTS = 128     # Size of batch to generate per partition, should be consistent with
                            # CONCURRENT_REQUESTS in spider. General recommendation is 5-7x CONCURRENT_REQUESTS

# Url storage
BACKEND = 'distributed_frontera.contrib.backends.hbase.HBaseBackend'
HBASE_THRIFT_HOST = 'localhost'


# Logging

You should add there settings related to message bus you have chosen. Default is ZeroMQ, running on local host.

Configure Frontera spiders

Next step is to create own Frontera settings file for every spider instance. Often it’s a good idea to name settings file according to partition ids assigned. E.g.

from distributed_frontera.settings.default_settings import MIDDLEWARES

MAX_NEXT_REQUESTS = 256     # Should be consistent with MAX_NEXT_REQUESTS set for Frontera worker


# Crawl frontier backend
BACKEND = 'distributed_frontera.backends.remote.KafkaOverusedBackend'
SPIDER_PARTITION_ID = 0                 # Partition ID assigned

# Logging

Again, add message bus related options.

You should end up having as much settings files as spider instances your system will have. You can also store permanent options in common module, and import it’s contents from each instance-specific config file.

It is recommended to run spiders on a dedicated machines, they quite likely to consume lots of CPU and network bandwidth.

The same thing have to be done for strategy workers, each strategy worker should have it’s own partition id (see SCORING_PARTITION_ID) assigned in config files named

Starting the cluster

First, let’s start storage worker. It’s recommended to dedicate one worker instance for new batches generation and others for the rest. Batch generation instance isn’t much dependent on the count of spider instances, but saving to storage is. Here is how to run all in the same process:

# start DB worker, enabling batch generation, DB saving and scoring log consumption
$ python -m frontera.worker.db --config frontera.worker_settings

Next, let’s start strategy worker with sample strategy for crawling the internet in Breadth-first manner.:

$ python -m frontera.worker.strategy --config frontera.strategy0 --strategy frontera.worker.strategies.bfs
$ python -m frontera.worker.strategy --config frontera.strategy1 --strategy frontera.worker.strategies.bfs
$ python -m frontera.worker.strategy --config frontera.strategyN --strategy frontera.worker.strategies.bfs

You should notice that all processes are writing messages to the output. It’s ok if nothing is written in streams, because of absence of seed URLs in the system.

Let’s put our seeds in text file, one URL per line. Starting the spiders::

$ scrapy crawl tutorial -L INFO -s FRONTERA_SETTINGS=frontera.settings0 -s SEEDS_SOURCE = 'seeds.txt'
$ scrapy crawl tutorial -L INFO -s FRONTERA_SETTINGS=frontera.settings1
$ scrapy crawl tutorial -L INFO -s FRONTERA_SETTINGS=frontera.settings2
$ scrapy crawl tutorial -L INFO -s FRONTERA_SETTINGS=frontera.settings3
$ scrapy crawl tutorial -L INFO -s FRONTERA_SETTINGS=frontera.settingsN

You should end up with N spider processes running. Each should read it’s own Frontera config, and first one is using SEEDS_SOURCE variable to pass seeds to Frontera cluster.

After some time seeds will pass the streams and get scheduled for downloading by workers. Crawler is bootstrapped.