Run modes¶
A diagram showing architecture of running modes:

Mode | Parent class | Components needed | Available backends |
---|---|---|---|
Single process | Backend |
single process running the crawler | Memory, SQLAlchemy |
Distributed spiders | Backend |
spiders and single db worker | Memory, SQLAlchemy |
Distributed backends | DistributedBackend |
spiders, strategy worker (s) and db worker(s). | SQLAlchemy, HBase |
Single process¶
Frontera is instantiated in the same process as fetcher (for example in Scrapy). To achieve that use BACKEND
setting set to storage backend subclass of Backend
. This run mode is
suitable for small number of documents and time non-critical applications.
Distributed spiders¶
Spiders are distributed and backend isn’t. Backend is running in db worker and it’s communicating with spiders using message bus.
- Use
BACKEND
in spider processes set toMessageBusBackend
- In DB worker
BACKEND
should point toBackend
subclasse. - Every spider process should have it’s own
SPIDER_PARTITION_ID
, starting from 0 toSPIDER_FEED_PARTITIONS
. - Both spiders and workers should have it’s
MESSAGE_BUS
setting set to the message bus class of your choice, and other implementation depending settings.
This mode is suitable for applications where it’s critical to fetch documents fast, at the same time amount of them is relatively small.
Distributed spiders and backend¶
Spiders and backend are distributed. Backend is divided on two parts: strategy worker and db worker. Strategy worker instances are assigned to their own part of spider log.
- Use
BACKEND
in spider processes set toMessageBusBackend
- In DB and SW workers
BACKEND
should point toDistributedBackend
subclasses. And selected backend have to be configured. - Every spider process should have it’s own
SPIDER_PARTITION_ID
, starting from 0 toSPIDER_FEED_PARTITIONS
. Last must be accessible also to all DB worker instances. - Every SW worker process should have it’s own
SCORING_PARTITION_ID
, starting from 0 toSPIDER_LOG_PARTITIONS
. Last must be accessible to all SW worker instances. - Both spiders and workers should have it’s
MESSAGE_BUS
setting set to the message bus class of your choice and selected message bus have to be configured.
Only Kafka message bus can be used in this mode out of the box and SQLAlchemy and HBase distributed backends.
This mode is suitable for broad crawling and large amount of pages.