Frontera API

This section documents the Frontera core API, and is intended for developers of middlewares and backends.

Frontera API / Manager

The main entry point to Frontera API is the FrontierManager object, passed to middlewares and backend through the from_manager class method. This object provides access to all Frontera core components, and is the only way for middlewares and backend to access them and hook their functionality into Frontera.

The FrontierManager is responsible for loading the installed middlewares and backend, as well as for managing the data flow around the whole frontier.

Loading from settings

Although FrontierManager can be initialized using parameters the most common way of doing this is using Frontera Settings.

This can be done through the from_settings class method, using either a string path:

>>> from frontera import FrontierManager
>>> frontier = FrontierManager.from_settings('my_project.frontier.settings')

or a BaseSettings object instance:

>>> from frontera import FrontierManager, Settings
>>> settings = Settings()
>>> settings.MAX_PAGES = 0
>>> frontier = FrontierManager.from_settings(settings)

It can also be initialized without parameters, in this case the frontier will use the default settings:

>>> from frontera import FrontierManager, Settings
>>> frontier = FrontierManager.from_settings()

Frontier Manager

class frontera.core.manager.FrontierManager(request_model, response_model, backend, logger, event_log_manager, middlewares=None, test_mode=False, max_requests=0, max_next_requests=0, auto_start=True, settings=None, canonicalsolver=None, db_worker=False, strategy_worker=False)

The FrontierManager object encapsulates the whole frontier, providing an API to interact with. It’s also responsible of loading and communicating all different frontier components.

Parameters:
  • request_model (object/string) – The Request object to be used by the frontier.
  • response_model (object/string) – The Response object to be used by the frontier.
  • backend (object/string) – The Backend object to be used by the frontier.
  • logger (object/string) – The Logger object to be used by the frontier.
  • event_log_manager (object/string) – The EventLogger object to be used by the frontier.
  • middlewares (list) – A list of Middleware objects to be used by the frontier.
  • test_mode (bool) – Activate/deactivate frontier test mode.
  • max_requests (int) – Number of pages after which the frontier would stop (See Finish conditions).
  • max_next_requests (int) – Maximum number of requests returned by get_next_requests method.
  • auto_start (bool) – Activate/deactivate automatic frontier start (See starting/stopping the frontier).
  • settings (object/string) – The Settings object used by the frontier.
  • canonicalsolver (object/string) – The CanonicalSolver object to be used by frontier.
  • db_worker (bool) – True if class is instantiated in DB worker environment
  • strategy_worker (bool) – True if class is instantiated in strategy worker environment

Attributes

request_model

The Request object to be used by the frontier. Can be defined with REQUEST_MODEL setting.

response_model

The Response object to be used by the frontier. Can be defined with RESPONSE_MODEL setting.

backend

The Backend object to be used by the frontier. Can be defined with BACKEND setting.

logger

The Logger object to be used by the frontier. Can be defined with LOGGER setting.

event_log_manager

The EventLogger object to be used by the frontier. Can be defined with EVENT_LOGGER setting.

middlewares

A list of Middleware objects to be used by the frontier. Can be defined with MIDDLEWARES setting.

test_mode

Boolean value indicating if the frontier is using frontier test mode. Can be defined with TEST_MODE setting.

max_requests

Number of pages after which the frontier would stop (See Finish conditions). Can be defined with MAX_REQUESTS setting.

max_next_requests

Maximum number of requests returned by get_next_requests method. Can be defined with MAX_NEXT_REQUESTS setting.

auto_start

Boolean value indicating if automatic frontier start is activated. See starting/stopping the frontier. Can be defined with AUTO_START setting.

settings

The Settings object used by the frontier.

iteration

Current frontier iteration.

n_requests

Number of accumulated requests returned by the frontier.

finished

Boolean value indicating if the frontier has finished. See Finish conditions.

API Methods

start()

Notifies all the components of the frontier start. Typically used for initializations (See starting/stopping the frontier).

Returns:None.
stop()

Notifies all the components of the frontier stop. Typically used for finalizations (See starting/stopping the frontier).

Returns:None.
add_seeds(seeds)

Adds a list of seed requests (seed URLs) as entry point for the crawl.

Parameters:seeds (list) – A list of Request objects.
Returns:None.
get_next_requests(max_next_requests=0, **kwargs)

Returns a list of next requests to be crawled. Optionally a maximum number of pages can be passed. If no value is passed, FrontierManager.max_next_requests will be used instead. (MAX_NEXT_REQUESTS setting).

Parameters:
  • max_next_requests (int) – Maximum number of requests to be returned by this method.
  • kwargs (dict) – Arbitrary arguments that will be passed to backend.
Returns:

list of Request objects.

page_crawled(response, links=None)

Informs the frontier about the crawl result and extracted links for the current page.

Parameters:
  • response (object) – The Response object for the crawled page.
  • links (list) – A list of Request objects generated from the links extracted for the crawled page.
Returns:

None.

request_error(request, error)

Informs the frontier about a page crawl error. An error identifier must be provided.

Parameters:
  • request (object) – The crawled with error Request object.
  • error (string) – A string identifier for the error.
Returns:

None.

Class Methods

classmethod from_settings(settings=None, db_worker=False, strategy_worker=False)

Returns a FrontierManager instance initialized with the passed settings argument. If no settings is given, frontier default settings are used.

Starting/Stopping the frontier

Sometimes, frontier components need to perform initialization and finalization operations. The frontier mechanism to notify the different components of the frontier start and stop is done by the start() and stop() methods respectively.

By default auto_start frontier value is activated, this means that components will be notified once the FrontierManager object is created. If you need to have more fine control of when different components are initialized, deactivate auto_start and manually call frontier API start() and stop() methods.

Note

Frontier stop() method is not automatically called when auto_start is active (because frontier is not aware of the crawling state). If you need to notify components of frontier end you should call the method manually.

Frontier iterations

Once frontier is running, the usual process is the one described in the data flow section.

Crawler asks the frontier for next pages using the get_next_requests() method. Each time the frontier returns a non empty list of pages (data available), is what we call a frontier iteration.

Current frontier iteration can be accessed using the iteration attribute.

Finishing the frontier

Crawl can be finished either by the Crawler or by the Frontera. Frontera will finish when a maximum number of pages is returned. This limit is controlled by the max_requests attribute (MAX_REQUESTS setting).

If max_requests has a value of 0 (default value) the frontier will continue indefinitely.

Once the frontier is finished, no more pages will be returned by the get_next_requests method and finished attribute will be True.

Component objects

class frontera.core.components.Component

Interface definition for a frontier component The Component object is the base class for frontier Middleware and Backend objects.

FrontierManager communicates with the active components using the hook methods listed below.

Implementations are different for Middleware and Backend objects, therefore methods are not fully described here but in their corresponding section.

Attributes

name

The component name

Abstract methods

frontier_start()

Called when the frontier starts, see starting/stopping the frontier.

frontier_stop()

Called when the frontier stops, see starting/stopping the frontier.

add_seeds(seeds)

This method is called when new seeds are added to the frontier.

Parameters:seeds (list) – A list of Request objects.
page_crawled(response, links)

This method is called each time a page has been crawled.

Parameters:
  • response (object) – The Response object for the crawled page.
  • links (list) – A list of Request objects generated from the links extracted for the crawled page.
request_error(page, error)

This method is called each time an error occurs when crawling a page

Parameters:
  • request (object) – The crawled with error Request object.
  • error (string) – A string identifier for the error.

Class Methods

classmethod from_manager(manager)

Class method called from FrontierManager passing the manager itself.

Example of usage:

def from_manager(cls, manager):
    return cls(settings=manager.settings)

Test mode

In some cases while testing, frontier components need to act in a different way than they usually do (for instance domain middleware accepts non valid URLs like 'A1' or 'B1' when parsing domain urls in test mode).

Components can know if the frontier is in test mode via the boolean test_mode attribute.

Another ways of using the frontier

Communication with the frontier can also be done through other mechanisms such as an HTTP API or a queue system. These functionalities are not available for the time being, but hopefully will be included in future versions.