Backends

A DistributedBackend is used to separate higher level code of crawling strategy from low level storage API. Queue, Metadata, States and

DomainMetadata are inner components of the DistributedBackend.

The latter is meant to instantiate and hold the references to the objects of above mentioned classes. Frontera is bundled with database and in-memory implementations of Queue, Metadata, States and DomainMetadata which can be combined in your custom backends or used standalone by directly instantiating specific variant of FrontierManager.

DistributedBackend methods are called by the FrontierManager after Middleware, using hooks for Request and Response processing according to frontier data flow.

Unlike Middleware, that can have many different instances activated, only one DistributedBackend can be used per frontier.

Activating a backend

To activate the specific backend, set it through the BACKEND setting.

Here’s an example:

BACKEND = 'frontera.contrib.backends.memory.MemoryDistributedBackend'

Keep in mind that some backends may need to be additionally configured through a particular setting. See backends documentation for more info.

Writing your own backend

Each backend component is a single Python class inherited from DistributedBackend and using one or all of Queue, Metadata, States and DomainMetadata.

FrontierManager will communicate with active backend through the methods described below.

class frontera.core.components.Backend

Interface definition for frontier backend.

Methods

frontier_start()

Called when the frontier starts, see starting/stopping the frontier.

Returns:None.
frontier_stop()

Called when the frontier stops, see starting/stopping the frontier.

Returns:None.
finished()

Quick check if crawling is finished. Called pretty often, please make sure calls are lightweight.

Returns:boolean
page_crawled(response)

This method is called every time a page has been crawled.

Parameters:response (object) – The Response object for the crawled page.
Returns:None.
request_error(page, error)

This method is called each time an error occurs when crawling a page.

Parameters:
  • request (object) – The crawled with error Request object.
  • error (string) – A string identifier for the error.
Returns:

None.

get_next_requests(max_n_requests, **kwargs)

Returns a list of next requests to be crawled.

Parameters:
  • max_next_requests (int) – Maximum number of requests to be returned by this method.
  • kwargs (dict) – A parameters from downloader component.
Returns:

list of Request objects.

Class Methods

classmethod from_manager(manager)

Class method called from FrontierManager passing the manager itself.

Example of usage:

def from_manager(cls, manager):
    return cls(settings=manager.settings)

Properties

queue
Returns:associated Queue object
states
Returns:associated States object
metadata
Returns:associated Metadata object
class frontera.core.components.DistributedBackend

Interface definition for distributed frontier backend. Implies using in strategy worker and DB worker.

Inherits all methods of Backend, and has two more class methods, which are called during strategy and db worker instantiation.

classmethod DistributedBackend.strategy_worker(manager)
classmethod DistributedBackend.db_worker(manager)

Backend should communicate with low-level storage by means of these classes:

Metadata

Is used to store the contents of the crawl.

class frontera.core.components.Metadata

Interface definition for a frontier metadata class. This class is responsible for storing documents metadata, including content and optimized for write-only data flow.

Methods

request_error(page, error)

This method is called each time an error occurs when crawling a page.

Parameters:
  • request (object) – The crawled with error Request object.
  • error (string) – A string identifier for the error.
page_crawled(response)

This method is called every time a page has been crawled.

Parameters:response (object) – The Response object for the crawled page.

Known implementations are: MemoryMetadata and sqlalchemy.components.Metadata.

Queue

Is a priority queue and used to persist requests scheduled for crawling.

class frontera.core.components.Queue

Interface definition for a frontier queue class. The queue has priorities and partitions.

Methods

get_next_requests(max_n_requests, partition_id, **kwargs)

Returns a list of next requests to be crawled, and excludes them from internal storage.

Parameters:
  • max_next_requests (int) – Maximum number of requests to be returned by this method.
  • kwargs (dict) – A parameters from downloader component.
Returns:

list of Request objects.

schedule(batch)

Schedules a new documents for download from batch, and updates score in metadata.

Parameters:batch – list of tuples(fingerprint, score, request, schedule), if schedule is True, then document needs to be scheduled for download, False - only update score in metadata.
count()

Returns count of documents in the queue.

Returns:int

Known implementations are: MemoryQueue and sqlalchemy.components.Queue.

States

Is a storage used for checking and storing the link states. Where state is a short integer of one of states descibed in frontera.core.components.States.

class frontera.core.components.States

Interface definition for a link states management class. This class is responsible for providing actual link state, and persist the state changes in batch-oriented manner.

Methods

update_cache(objs)

Reads states from meta[‘state’] field of request in objs and stores states in internal cache.

Parameters:objs – list or tuple of Request objects.
set_states(objs)

Sets meta[‘state’] field from cache for every request in objs.

Parameters:objs – list or tuple of Request objects.
flush()

Flushes internal cache to storage.

fetch(fingerprints)

Get states from the persistent storage to internal cache.

Parameters:fingerprints – list document fingerprints, which state to read

Known implementations are: MemoryStates and sqlalchemy.components.States.

DomainMetadata

Is used to store per-domain flags, counters or even robots.txt contents to help crawling strategy maintain features like per-domain number of crawled pages limit or automatic banning.

class frontera.core.components.DomainMetadata

Interface definition for a domain metadata storage. It’s main purpose is to store the per-domain metadata using Python-friendly structures. Meant to be used by crawling strategy to store counters and flags in low level facilities provided by Backend.

Methods

__setitem__(key, value)

Puts key, value tuple in storage.

Parameters:
  • key – str
  • value – Any
__getitem__(key)

Retrieves the value associated with the storage. Raises KeyError if key is absent.

Parameters:key – str
Return value:Any
__delitem__(key)

Removes the tuple associated with key from storage. Raises KeyError if key is absent.

Parameters:key – str
__contains__(key)

Checks if key is present in the storage.

Parameters:key – str
Returns:boolean

Known implementations are: native dict and sqlalchemy.components.DomainMetadata.

Built-in backend reference

This article describes all backend components that come bundled with Frontera.

Memory backend

This implementation is using heapq module to store the requests queue and native dicts for other purposes and is meant to be used for educational or testing purposes only.

SQLAlchemy backends

This implementations is using RDBMS storage with SQLAlchemy library.

By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.

If you need to use your own declarative sqlalchemy models, you can do it by using the SQLALCHEMYBACKEND_MODELS setting.

For a complete list of all settings used for SQLAlchemy backends check the settings section.

HBase backend

Is more suitable for large scale web crawlers. Settings reference can be found here HBase backend. Consider tunning a block cache to fit states within one block for average size website. To achieve this it’s recommended to use hostname_local_fingerprint to achieve documents closeness within the same host. This function can be selected with URL_FINGERPRINT_FUNCTION setting.

Redis backend

This is similar to the HBase backend. It is suitable for large scale crawlers that still has a limited scope. It is recommended to ensure Redis is allowed to use enough memory to store all data the crawler needs. In case of Redis running out of memory, the crawler will log this and continue. When the crawler is unable to write metadata or queue items to the database; that metadata or queue items are lost.

In case of connection errors; the crawler will attempt to reconnect three times. If the third attempt at connecting to Redis fails, the worker will skip that Redis operation and continue operating.