Testing a Frontier¶
Frontier Tester is a helper class for easy frontier testing.
Basically it runs a fake crawl against a Frontier, crawl info is faked using a Graph Manager instance.
Creating a Frontier Tester¶
FrontierTester needs a Graph Manager and a FrontierManager instances:
>>> from crawlfrontier import FrontierManager, FrontierTester, graphs
>>> graph = graphs.Manager('sqlite:///graph.db') # Crawl fake data loading
>>> frontier = FrontierManager.from_settings() # Create frontier from default settings
>>> tester = FrontierTester(frontier, graph)
Running a Test¶
The tester is now initialized, to run the test just call the method run:
>>> tester.run()
When run method is called the tester will:
- Add all the seeds from the graph.
- Ask the frontier about next pages.
- Fake page response and inform the frontier about page crawl and its links.
Steps 1 and 2 are repeated until crawl or frontier ends.
Once the test is finished, the crawling page sequence is available as a list of frontier Request objects:
Test Parameters¶
In some test cases you may want to add all graph pages as seeds, this can be done with the parameter add_all_pages:
>>> tester.run(add_all_pages=True)
Maximum number of returned pages per get_next_requests call can be set using frontier settings, but also can be modified when creating the FrontierTester with the max_next_pages argument:
>>> tester = FrontierTester(frontier, graph, max_next_pages=10)
An example of use¶
A working example using test data from graphs and basic backends:
from crawlfrontier import FrontierManager, Settings, FrontierTester, graphs
def test_backend(backend):
# Graph
graph = graphs.Manager()
graph.add_site_list(graphs.data.SITE_LIST_02)
# Frontier
settings = Settings()
settings.BACKEND = backend
settings.TEST_MODE = True
frontier = FrontierManager.from_settings(settings)
# Tester
tester = FrontierTester(frontier, graph)
tester.run(add_all_pages=True)
# Show crawling sequence
print '-'*40
print frontier.backend.name
print '-'*40
for page in tester.sequence:
print page.url
if __name__ == '__main__':
test_backend('crawlfrontier.contrib.backends.memory.heapq.FIFO')
test_backend('crawlfrontier.contrib.backends.memory.heapq.LIFO')
test_backend('crawlfrontier.contrib.backends.memory.heapq.BFS')
test_backend('crawlfrontier.contrib.backends.memory.heapq.DFS')