Canonical URL Solver¶
Is a special middleware object responsible for identifying canonical URL address of the document and modifying request or response metadata accordingly. Canonical URL solver always executes last in the middleware chain, before calling Backend methods.
The main purpose of this component is preventing metadata records duplication and confusing crawler behavior connected with it. The causes of this are: - Different redirect chains could lead to the same document. - The same document can be accessible by more than one different URL.
Well designed system has it’s own, stable algorithm of choosing the right URL for each document. Also see Canonical link element.
Canonical URL solver is instantiated during Frontera Manager initialization using class from
Built-in canonical URL solvers reference¶
Used as default.
Implements a simple CanonicalSolver taking always first URL from redirect chain, if there were redirects. It allows easily to avoid leaking of requests in Frontera (e.g. when request issued by
get_next_requests()never matched in
page_crawled()) at the price of duplicating records in Frontera for pages having more than one URL or complex redirects chains.