Backends¶
A DistributedBackend
is used to separate higher level code
of crawling strategy from low level storage API. Queue
,
Metadata
, States
and
DomainMetadata
are inner components of the DistributedBackend.
The latter is meant to instantiate and hold the references to the objects of above mentioned classes. Frontera is
bundled with database and in-memory implementations of Queue, Metadata, States and DomainMetadata which can be combined
in your custom backends or used standalone by directly instantiating specific variant of
FrontierManager
.
DistributedBackend methods are called by the FrontierManager after
Middleware
, using hooks for
Request
and Response
processing
according to frontier data flow.
Unlike Middleware, that can have many different instances activated, only one DistributedBackend can be used per frontier.
Activating a backend¶
To activate the specific backend, set it through the BACKEND
setting.
Here’s an example:
BACKEND = 'frontera.contrib.backends.memory.MemoryDistributedBackend'
Keep in mind that some backends may need to be additionally configured through a particular setting. See backends documentation for more info.
Writing your own backend¶
Each backend component is a single Python class inherited from
DistributedBackend
and using one or all of
Queue
, Metadata
, States
and DomainMetadata
.
FrontierManager
will communicate with active backend through the methods described below.
-
class
frontera.core.components.
Backend
¶ Interface definition for frontier backend.
Methods
-
frontier_start
()¶ Called when the frontier starts, see starting/stopping the frontier.
Returns: None.
-
frontier_stop
()¶ Called when the frontier stops, see starting/stopping the frontier.
Returns: None.
-
finished
()¶ Quick check if crawling is finished. Called pretty often, please make sure calls are lightweight.
Returns: boolean
-
page_crawled
(response)¶ This method is called every time a page has been crawled.
Parameters: response (object) – The Response
object for the crawled page.Returns: None.
-
request_error
(page, error)¶ This method is called each time an error occurs when crawling a page.
Parameters: - request (object) – The crawled with error
Request
object. - error (string) – A string identifier for the error.
Returns: None.
- request (object) – The crawled with error
-
get_next_requests
(max_n_requests, **kwargs)¶ Returns a list of next requests to be crawled.
Parameters: - max_next_requests (int) – Maximum number of requests to be returned by this method.
- kwargs (dict) – A parameters from downloader component.
Returns: list of
Request
objects.
Class Methods
-
classmethod
from_manager
(manager)¶ Class method called from
FrontierManager
passing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
Properties
-
-
class
frontera.core.components.
DistributedBackend
¶ Interface definition for distributed frontier backend. Implies using in strategy worker and DB worker.
Inherits all methods of Backend, and has two more class methods, which are called during strategy and db worker instantiation.
Backend should communicate with low-level storage by means of these classes:
Metadata¶
Is used to store the contents of the crawl.
-
class
frontera.core.components.
Metadata
¶ Interface definition for a frontier metadata class. This class is responsible for storing documents metadata, including content and optimized for write-only data flow.
Methods
Known implementations are: MemoryMetadata
and sqlalchemy.components.Metadata
.
Queue¶
Is a priority queue and used to persist requests scheduled for crawling.
-
class
frontera.core.components.
Queue
¶ Interface definition for a frontier queue class. The queue has priorities and partitions.
Methods
-
get_next_requests
(max_n_requests, partition_id, **kwargs)¶ Returns a list of next requests to be crawled, and excludes them from internal storage.
Parameters: - max_next_requests (int) – Maximum number of requests to be returned by this method.
- kwargs (dict) – A parameters from downloader component.
Returns: list of
Request
objects.
-
schedule
(batch)¶ Schedules a new documents for download from batch, and updates score in metadata.
Parameters: batch – list of tuples(fingerprint, score, request, schedule), if schedule
is True, then document needs to be scheduled for download, False - only update score in metadata.
-
count
()¶ Returns count of documents in the queue.
Returns: int
-
Known implementations are: MemoryQueue
and sqlalchemy.components.Queue
.
States¶
Is a storage used for checking and storing the link states. Where state is a short integer of one of states descibed in
frontera.core.components.States
.
-
class
frontera.core.components.
States
¶ Interface definition for a link states management class. This class is responsible for providing actual link state, and persist the state changes in batch-oriented manner.
Methods
-
update_cache
(objs)¶ Reads states from meta[‘state’] field of request in objs and stores states in internal cache.
Parameters: objs – list or tuple of Request
objects.
-
set_states
(objs)¶ Sets meta[‘state’] field from cache for every request in objs.
Parameters: objs – list or tuple of Request
objects.
-
flush
()¶ Flushes internal cache to storage.
-
fetch
(fingerprints)¶ Get states from the persistent storage to internal cache.
Parameters: fingerprints – list document fingerprints, which state to read
-
Known implementations are: MemoryStates
and sqlalchemy.components.States
.
DomainMetadata¶
Is used to store per-domain flags, counters or even robots.txt contents to help crawling strategy maintain features like per-domain number of crawled pages limit or automatic banning.
-
class
frontera.core.components.
DomainMetadata
¶ Interface definition for a domain metadata storage. It’s main purpose is to store the per-domain metadata using Python-friendly structures. Meant to be used by crawling strategy to store counters and flags in low level facilities provided by Backend.
Methods
-
__setitem__
(key, value)¶ Puts key, value tuple in storage.
Parameters: - key – str
- value – Any
-
__getitem__
(key)¶ Retrieves the value associated with the storage. Raises KeyError if key is absent.
Parameters: key – str Return value: Any
-
__delitem__
(key)¶ Removes the tuple associated with key from storage. Raises KeyError if key is absent.
Parameters: key – str
-
__contains__
(key)¶ Checks if key is present in the storage.
Parameters: key – str Returns: boolean
-
Known implementations are: native dict and sqlalchemy.components.DomainMetadata
.
Built-in backend reference¶
This article describes all backend components that come bundled with Frontera.
Memory backend¶
This implementation is using heapq module to store the requests queue and native dicts for other purposes and is meant to be used for educational or testing purposes only.
SQLAlchemy backends¶
This implementations is using RDBMS storage with SQLAlchemy library.
By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.
If you need to use your own declarative sqlalchemy models, you can do it by using the
SQLALCHEMYBACKEND_MODELS
setting.
For a complete list of all settings used for SQLAlchemy backends check the settings section.
HBase backend¶
Is more suitable for large scale web crawlers. Settings reference can be found here HBase backend. Consider
tunning a block cache to fit states within one block for average size website. To achieve this it’s recommended to use
hostname_local_fingerprint
to achieve documents
closeness within the same host. This function can be selected with URL_FINGERPRINT_FUNCTION
setting.
Redis backend¶
This is similar to the HBase backend. It is suitable for large scale crawlers that still has a limited scope. It is recommended to ensure Redis is allowed to use enough memory to store all data the crawler needs. In case of Redis running out of memory, the crawler will log this and continue. When the crawler is unable to write metadata or queue items to the database; that metadata or queue items are lost.
In case of connection errors; the crawler will attempt to reconnect three times. If the third attempt at connecting to Redis fails, the worker will skip that Redis operation and continue operating.