Backends¶

Frontier Backend is where the crawling logic/policies lies. It’s responsible for receiving all the crawl info and selecting the next pages to be crawled. It’s called by the FrontierManager after Middleware, using hooks for Request and Response processing according to frontier data flow.

Unlike Middleware, that can have many different instances activated, only one Backend can be used per frontier.

Some backends require, depending on the logic implemented, a persistent storage to manage Request and Response objects info.

Activating a backend¶

To activate the frontier middleware component, set it through the BACKEND setting.

Here’s an example:

BACKEND = 'crawlfrontier.contrib.backends.memory.FIFO'

Keep in mind that some backends may need to be enabled through a particular setting. See each backend documentation for more info.

Writing your own backend¶

Writing your own frontier backend is easy. Each Backend component is a single Python class inherited from Component.

FrontierManager will communicate with active Backend through the methods described below.

Built-in backend reference¶

This page describes all each backend documentation components that come with Crawl Frontier. For information on how to use them and how to write your own middleware, see the backend usage guide..

To know the default activated Backend check the BACKEND setting.

Basic algorithms¶

Some of the built-in Backend objects implement basic algorithms as as FIFO/LIFO or DFS/BFS for page visit ordering.

Differences between them will be on storage engine used. For instance, memory.FIFO and sqlalchemy.FIFO will use the same logic but with different storage engines.

Memory backends¶

This set of Backend objects will use an heapq object as storage for basic algorithms.

class crawlfrontier.contrib.backends.memory.BASE¶: Base class for in-memory heapq Backend objects.

class crawlfrontier.contrib.backends.memory.FIFO¶: In-memory heapq Backend implementation of FIFO algorithm.

class crawlfrontier.contrib.backends.memory.LIFO¶: In-memory heapq Backend implementation of LIFO algorithm.

class crawlfrontier.contrib.backends.memory.BFS¶: In-memory heapq Backend implementation of BFS algorithm.

class crawlfrontier.contrib.backends.memory.DFS¶: In-memory heapq Backend implementation of DFS algorithm.

class crawlfrontier.contrib.backends.memory.RANDOM¶: In-memory heapq Backend implementation of a random selection algorithm.

SQLAlchemy backends¶

This set of Backend objects will use SQLAlchemy as storage for basic algorithms.

By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.

Request and Response are represented by a declarative sqlalchemy model:

class Page(Base):
    __tablename__ = 'pages'
    __table_args__ = (
        UniqueConstraint('url'),
    )
    class State:
        NOT_CRAWLED = 'NOT CRAWLED'
        QUEUED = 'QUEUED'
        CRAWLED = 'CRAWLED'
        ERROR = 'ERROR'

    url = Column(String(1000), nullable=False)
    fingerprint = Column(String(40), primary_key=True, nullable=False, index=True, unique=True)
    depth = Column(Integer, nullable=False)
    created_at = Column(TIMESTAMP, nullable=False)
    status_code = Column(String(20))
    state = Column(String(10))
    error = Column(String(20))

If you need to create your own models, you can do it by using the DEFAULT_MODELS setting:

DEFAULT_MODELS = {
    'Page': 'crawlfrontier.contrib.backends.sqlalchemy.models.Page',
}

This setting uses a dictionary where key represents the name of the model to define and value the model to use. If you want for instance to create a model to represent domains:

DEFAULT_MODELS = {
    'Page': 'crawlfrontier.contrib.backends.sqlalchemy.models.Page',
    'Domain': 'myproject.backends.sqlalchemy.models.Domain',
}

Models can be accessed from the Backend dictionary attribute models.

For a complete list of all settings used for sqlalchemy backends check the settings section.

class crawlfrontier.contrib.backends.sqlalchemy.BASE¶: Base class for SQLAlchemy Backend objects.

class crawlfrontier.contrib.backends.sqlalchemy.FIFO¶: SQLAlchemy Backend implementation of FIFO algorithm.

class crawlfrontier.contrib.backends.sqlalchemy.LIFO¶: SQLAlchemy Backend implementation of LIFO algorithm.

class crawlfrontier.contrib.backends.sqlalchemy.BFS¶: SQLAlchemy Backend implementation of BFS algorithm.

class crawlfrontier.contrib.backends.sqlalchemy.DFS¶: SQLAlchemy Backend implementation of DFS algorithm.

class crawlfrontier.contrib.backends.sqlalchemy.RANDOM¶: SQLAlchemy Backend implementation of a random selection algorithm.