Crawling strategy

Use frontera.worker.strategies.bfs module for reference. In general, you need to write a CrawlingStrategy class implementing the interface:

class frontera.core.components.BaseCrawlingStrategy

Interface definition for a crawling strategy.

Before calling these methods strategy worker is adding ‘state’ key to meta field in every Request with state of the URL. Pleases refer for the states to HBaseBackend implementation.

After exiting from all of these methods states from meta field are passed back and stored in the backend.

Methods

add_seeds(seeds)

Called when add_seeds event is received from spider log.

Parameters:seeds (list) – A list of Request objects.
Returns:dict with keys as fingerprints (as hex string) and values as float scores, if no scheduling is needed, no fingerprint should be returned
page_crawled(response, links)

Called every time document was successfully crawled, and receiving page_crawled event from spider log.

Parameters:
  • response (object) – The Response object for the crawled page.
  • links (list) – A list of Request objects generated from the links extracted for the crawled page.
Returns:

dict with keys as fingerprints (as hex string) and values as float scores, if no scheduling is needed, no fingerprint should be returned

page_error(request, error)

Called every time there was error during page downloading.

Parameters:
  • request (object) – The fetched with error Request object.
  • error (str) – A string identifier for the error.
Returns:

dict with one key as fingerprint (as hex string) and value as float score, if no scheduling is needed, empty dict should be returned

finished()

Called by Strategy worker, after finishing processing each cycle of spider log. If this method returns true, then Strategy worker reports that crawling goal is achieved, stops and exits.

Returns:bool

The class named CrawlingStrategy should put in a standalone module and passed to strategy worker using command line option on startup.

The strategy class instantiated in strategy worker, and can use it’s own storage or any other kind of resources. All items from spider log will be passed through these methods. Scores returned doesn’t have to be the same as in method arguments. Periodically finished() method is called to check if crawling goal is achieved.