GlossaryΒΆ

spider log
A stream of encoded messages from spiders. Each message is product of extraction from document content. Most of the time it is links, scores, classification results.
scoring log
Contains score updating events and scheduling flag (if link needs to be scheduled for download) going from strategy worker to db worker.
spider feed
A stream of messages from db worker to spiders containing new batches of documents to crawl.
strategy worker
Special type of worker, running the crawling strategy code: scoring the links, deciding if link needs to be scheduled (consults state cache) and when to stop crawling. That type of worker is sharded.
db worker
Is responsible for communicating with storage DB, and mainly saving metadata and content along with retrieving new batches to download.
state cache
In-memory data structure containing information about state of documents, whatever they were scheduled or not. Periodically synchronized with persistent storage.
message bus
Transport layer abstraction mechanism. Provides interface for transport layer abstraction and several implementations.