Backends¶
Frontier Backend
is where the crawling logic/policies lies, essentially a
brain of your crawler. Queue
,
Metadata
and States
are classes
where all low level code is meant to be placed, and
Backend opposite, operates on a higher levels. Frontera is bundled with database and in-memory implementations of
Queue, Metadata and States which can be combined in your custom backends or used standalone by directly
instantiating FrontierManager
and Backend.
Backend methods are called by the FrontierManager after
Middleware
, using hooks for
Request
and Response
processing
according to frontier data flow.
Unlike Middleware, that can have many different instances activated, only one Backend can be used per frontier.
Activating a backend¶
To activate the frontier backend component, set it through the BACKEND
setting.
Here’s an example:
BACKEND = 'frontera.contrib.backends.memory.FIFO'
Keep in mind that some backends may need to be additionally configured through a particular setting. See backends documentation for more info.
Writing your own backend¶
Each backend component is a single Python class inherited from Backend
or
DistributedBackend
and using one or all of
Queue
, Metadata
and States
.
FrontierManager
will communicate with active backend through the methods described below.
-
class
frontera.core.components.
Backend
¶ Interface definition for frontier backend.
Methods
-
frontier_start
()¶ Called when the frontier starts, see starting/stopping the frontier.
Returns: None.
-
frontier_stop
()¶ Called when the frontier stops, see starting/stopping the frontier.
Returns: None.
-
finished
()¶ Quick check if crawling is finished. Called pretty often, please make sure calls are lightweight.
Returns: boolean
-
add_seeds
(seeds)¶ This method is called when new seeds are added to the frontier.
Parameters: seeds (list) – A list of Request
objects.Returns: None.
-
page_crawled
(response, links)¶ This method is called each time a page has been crawled.
Parameters: Returns: None.
-
request_error
(page, error)¶ This method is called each time an error occurs when crawling a page
Parameters: - request (object) – The crawled with error
Request
object. - error (string) – A string identifier for the error.
Returns: None.
- request (object) – The crawled with error
-
get_next_requests
(max_n_requests, **kwargs)¶ Returns a list of next requests to be crawled.
Parameters: - max_next_requests (int) – Maximum number of requests to be returned by this method.
- kwargs (dict) – A parameters from downloader component.
Returns: list of
Request
objects.
Class Methods
-
from_manager
(manager)¶ Class method called from
FrontierManager
passing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
Properties
-
-
class
frontera.core.components.
DistributedBackend
¶ Interface definition for distributed frontier backend. Implies using in strategy worker and DB worker.
Inherits all methods of Backend, and has two more class methods, which are called during strategy and db worker instantiation.
Backend should communicate with low-level storage by means of these classes:
Metadata¶
-
class
frontera.core.components.
Metadata
¶ Interface definition for a frontier metadata class. This class is responsible for storing documents metadata, including content and optimized for write-only data flow.
Methods
-
add_seeds
(seeds)¶ This method is called when new seeds are added to the frontier.
Parameters: seeds (list) – A list of Request
objects.
-
Known implementations are: MemoryMetadata
and sqlalchemy.components.Metadata
.
Queue¶
-
class
frontera.core.components.
Queue
¶ Interface definition for a frontier queue class. The queue has priorities and partitions.
Methods
-
get_next_requests
(max_n_requests, partition_id, **kwargs)¶ Returns a list of next requests to be crawled, and excludes them from internal storage.
Parameters: - max_next_requests (int) – Maximum number of requests to be returned by this method.
- kwargs (dict) – A parameters from downloader component.
Returns: list of
Request
objects.
-
schedule
(batch)¶ Schedules a new documents for download from batch, and updates score in metadata.
Parameters: batch – list of tuples(fingerprint, score, request, schedule), if schedule
is True, then document needs to be scheduled for download, False - only update score in metadata.
-
count
()¶ Returns count of documents in the queue.
Returns: int
-
Known implementations are: MemoryQueue
and sqlalchemy.components.Queue
.
States¶
-
class
frontera.core.components.
States
¶ Interface definition for a document states management class. This class is responsible for providing actual documents state, and persist the state changes in batch-oriented manner.
Methods
-
update_cache
(objs)¶ Reads states from meta[‘state’] field of request in objs and stores states in internal cache.
Parameters: objs – list or tuple of Request
objects.
-
set_states
(objs)¶ Sets meta[‘state’] field from cache for every request in objs.
Parameters: objs – list or tuple of Request
objects.
-
flush
(force_clear)¶ Flushes internal cache to storage.
Parameters: force_clear – boolean, True - signals to clear cache after flush
-
fetch
(fingerprints)¶ Get states from the persistent storage to internal cache.
Parameters: fingerprints – list document fingerprints, which state to read
-
Known implementations are: MemoryStates
and sqlalchemy.components.States
.
Built-in backend reference¶
This article describes all backend components that come bundled with Frontera.
To know the default activated Backend
check the
BACKEND
setting.
Basic algorithms¶
Some of the built-in Backend
objects implement basic algorithms as
as FIFO/LIFO or DFS/BFS for page visit ordering.
Differences between them will be on storage engine used. For instance,
memory.FIFO
and
sqlalchemy.FIFO
will use the same logic but with different
storage engines.
All these backend variations are using the same CommonBackend
class
implementing one-time visit crawling policy with priority queue.
-
class
frontera.contrib.backends.
CommonBackend
¶ A simpliest possible backend, performing one-time crawl: if page was crawled once, it will not be crawled again.
Memory backends¶
This set of Backend
objects will use an heapq module as queue and native
dictionaries as storage for basic algorithms.
SQLAlchemy backends¶
This set of Backend
objects will use SQLAlchemy as storage for
basic algorithms.
By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.
If you need to use your own declarative sqlalchemy models, you can do it by using the
SQLALCHEMYBACKEND_MODELS
setting.
This setting uses a dictionary where key
represents the name of the model to define and value
the model to use.
For a complete list of all settings used for SQLAlchemy backends check the settings section.
-
class
frontera.contrib.backends.sqlalchemy.
FIFO
¶
-
class
frontera.contrib.backends.sqlalchemy.
LIFO
¶
Revisiting backend¶
Based on custom SQLAlchemy backend, and queue. Crawling starts with seeds. After seeds are crawled, every new
document will be scheduled for immediate crawling. On fetching every new document will be scheduled for recrawling
after fixed interval set by SQLALCHEMYBACKEND_REVISIT_INTERVAL
.
Current implementation of revisiting backend has no prioritization. During long term runs spider could go idle, because there are no documents available for crawling, but there are documents waiting for their scheduled revisit time.
HBase backend¶
-
class
frontera.contrib.backends.hbase.
HBaseBackend
(manager)¶
Is more suitable for large scale web crawlers. Settings reference can be found here HBase backend. Consider
tunning a block cache to fit states within one block for average size website. To achieve this it’s recommended to use
hostname_local_fingerprint
to achieve documents closeness within the same host. This function can be selected with URL_FINGERPRINT_FUNCTION
setting.