Crawling strategy

Use frontera.worker.strategies.bfs module for reference. In general, you need to write a crawling strategy class implementing the interface:

class frontera.worker.strategies.BaseCrawlingStrategy(manager, mb_stream, states_context)

Interface definition for a crawling strategy.

Before calling these methods strategy worker is adding ‘state’ key to meta field in every Request with state of the URL. Pleases refer for the states to HBaseBackend implementation.

After exiting from all of these methods states from meta field are passed back and stored in the backend.

Methods

classmethod from_worker(manager, mb_stream, states_context)

Called on instantiation in strategy worker.

Parameters:
  • manager
    class:Backend <frontera.core.manager.FrontierManager> instance
  • mb_stream
    class:UpdateScoreStream <frontera.worker.strategy.UpdateScoreStream> instance
Returns:

new instance

add_seeds(seeds)

Called when add_seeds event is received from spider log.

Parameters:seeds (list) – A list of Request objects.
page_crawled(response, links)

Called every time document was successfully crawled, and receiving page_crawled event from spider log.

Parameters:
  • response (object) – The Response object for the crawled page.
  • links (list) – A list of Request objects generated from the links extracted for the crawled page.
page_error(request, error)

Called every time there was error during page downloading.

Parameters:
  • request (object) – The fetched with error Request object.
  • error (str) – A string identifier for the error.
finished()

Called by Strategy worker, after finishing processing each cycle of spider log. If this method returns true, then Strategy worker reports that crawling goal is achieved, stops and exits.

Returns:bool
close()

Called when strategy worker is about to close crawling strategy.

The class can be put in any module and passed to strategy worker using command line option or CRAWLING_STRATEGY setting on startup.

The strategy class instantiated in strategy worker, and can use it’s own storage or any other kind of resources. All items from spider log will be passed through these methods. Scores returned doesn’t have to be the same as in method arguments. Periodically finished() method is called to check if crawling goal is achieved.