Middlewares

Frontier Middleware sits between FrontierManager and Backend objects, using hooks for Request and Response processing according to frontier data flow.

It’s a light, low-level system for filtering and altering Frontier’s requests and responses.

Activating a middleware

To activate a Middleware component, add it to the MIDDLEWARES setting, which is a list whose values can be class paths or instances of Middleware objects.

Here’s an example:

MIDDLEWARES = [
    'frontera.contrib.middlewares.domain.DomainMiddleware',
]

Middlewares are called in the same order they’ve been defined in the list, to decide which order to assign to your middleware pick a value according to where you want to insert it. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.

Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware documentation for more info.

Writing your own middleware

Writing your own frontier middleware is easy. Each Middleware component is a single Python class inherited from Component.

FrontierManager will communicate with all active middlewares through the methods described below.

class frontera.core.components.Middleware

Interface definition for a Frontier Middlewares

Methods

frontier_start()

Called when the frontier starts, see starting/stopping the frontier.

frontier_stop()

Called when the frontier stops, see starting/stopping the frontier.

add_seeds(seeds)

This method is called when new seeds are added to the frontier.

Parameters:seeds (list) – A list of Request objects.
Returns:Request object list or None

Should either return None or a list of Request objects.

If it returns None, FrontierManager won’t continue processing any other middleware and seed will never reach the Backend.

If it returns a list of Request objects, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to the Backend.

If you want to filter any seed, just don’t include it in the returned object list.

page_crawled(response, links)

This method is called each time a page has been crawled.

Parameters:
  • response (object) – The Response object for the crawled page.
  • links (list) – A list of Request objects generated from the links extracted for the crawled page.
Returns:

Response or None

Should either return None or a Response object.

If it returns None, FrontierManager won’t continue processing any other middleware and Backend will never be notified.

If it returns a Response object, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to the Backend.

If you want to filter a page, just return None.

request_error(page, error)

This method is called each time an error occurs when crawling a page

Parameters:
  • request (object) – The crawled with error Request object.
  • error (string) – A string identifier for the error.
Returns:

Request or None

Should either return None or a Request object.

If it returns None, FrontierManager won’t continue processing any other middleware and Backend will never be notified.

If it returns a Response object, this will be passed to next middleware. This process will repeat for all active middlewares until result is finally passed to the Backend.

If you want to filter a page error, just return None.

Class Methods

from_manager(manager)

Class method called from FrontierManager passing the manager itself.

Example of usage:

def from_manager(cls, manager):
    return cls(settings=manager.settings)

Built-in middleware reference

This page describes all Middleware components that come with Frontera. For information on how to use them and how to write your own middleware, see the middleware usage guide..

For a list of the components enabled by default (and their orders) see the MIDDLEWARES setting.

DomainMiddleware

class frontera.contrib.middlewares.domain.DomainMiddleware

This Middleware will add a domain info field for every Request.meta and Response.meta if is activated.

domain object will contains the following fields:

  • netloc: URL netloc according to RFC 1808 syntax specifications
  • name: Domain name
  • scheme: URL scheme
  • tld: Top level domain
  • sld: Second level domain
  • subdomain: URL subdomain(s)

An example for a Request object:

>>> request.url
'http://www.scrapinghub.com:8080/this/is/an/url'

>>> request.meta['domain']
{
    "name": "scrapinghub.com",
    "netloc": "www.scrapinghub.com",
    "scheme": "http",
    "sld": "scrapinghub",
    "subdomain": "www",
    "tld": "com"
}

If TEST_MODE is active, It will accept testing URLs, parsing letter domains:

>>> request.url
'A1'

>>> request.meta['domain']
{
    "name": "A",
    "netloc": "A",
    "scheme": "-",
    "sld": "-",
    "subdomain": "-",
    "tld": "-"
}

UrlFingerprintMiddleware

class frontera.contrib.middlewares.fingerprint.UrlFingerprintMiddleware

This Middleware will add a fingerprint field for every Request.meta and Response.meta if is activated.

Fingerprint will be calculated from object URL, using the function defined in URL_FINGERPRINT_FUNCTION setting. You can write your own fingerprint calculation function and use by changing this setting.

An example for a Request object:

>>> request.url
'http//www.scrapinghub.com:8080'

>>> request.meta['fingerprint']
'60d846bc2969e9706829d5f1690f11dafb70ed18'
frontera.utils.fingerprint.hostname_local_fingerprint(key)

This function is used for URL fingerprinting, which serves to uniquely identify the document in storage. hostname_local_fingerprint is constructing fingerprint getting first 4 bytes as Crc32 from host, and rest is MD5 from rest of the URL. Default option is set to make use of HBase block cache. It is expected to fit all the documents of average website within one cache block, which can be efficiently read from disk once.

Parameters:key – str URL
Returns:str 20 bytes hex string

DomainFingerprintMiddleware

class frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware

This Middleware will add a fingerprint field for every Request.meta and Response.meta domain fields if is activated.

Fingerprint will be calculated from object URL, using the function defined in DOMAIN_FINGERPRINT_FUNCTION setting. You can write your own fingerprint calculation function and use by changing this setting.

An example for a Request object:

>>> request.url
'http//www.scrapinghub.com:8080'

>>> request.meta['domain']
{
    "fingerprint": "5bab61eb53176449e25c2c82f172b82cb13ffb9d",
    "name": "scrapinghub.com",
    "netloc": "www.scrapinghub.com",
    "scheme": "http",
    "sld": "scrapinghub",
    "subdomain": "www",
    "tld": "com"
}