LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using lxml’s robust HTMLParser.
allow (strorlist) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
deny (strorlist) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (i.e. not extracted). It has precedence over the allow parameter. If not given (or empty) it won’t exclude any links.
allow_domains (strorlist) – a single value or a list of string containing domains which will be considered for extracting the links
deny_domains (strorlist) – a single value or a list of strings containing domains which won’t be considered for extracting the links
deny_extensions (list) –
a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to scrapy.linkextractors.IGNORED_EXTENSIONS.
Changed in version 2.0: IGNORED_EXTENSIONS now includes 7z, 7zip, apk, bz2, cdr, dmg, ico, iso, tar, tar.gz, webm, and xz.
restrict_xpaths (strorlist) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
restrict_css (strorlist) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has the same behaviour as restrict_xpaths.
restrict_text (strorlist) – a single regular expression (or list of regular expressions) that the link’s text must match in order to be extracted. If not given (or empty), it will match all links. If a list of regular expressions is given, the link will be extracted if it matches at least one.
tags (strorlist) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').
attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)
canonicalize (bool) – canonicalize each extracted url (using w3lib.url.canonicalize_url). Defaults to False. Note that canonicalize_url is meant for duplicate checking; it can change the URL visible at server side, so the response can be different for requests with canonicalized and raw URLs. If you’re using LinkExtractor to follow links it is more robust to keep the default canonicalize=False.
unique (bool) – whether duplicate filtering should be applied to extracted links.
process_value (collections.abc.Callable) –
a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x.
You can use the following function in process_value:
strip (bool) – whether to strip whitespaces from extracted attributes. According to HTML5 standard, leading and trailing whitespaces must be stripped from href attributes of <a>, <area> and many other elements, src attribute of <img>, <iframe> elements, etc., so LinkExtractor strips space chars by default. Set strip=False to turn it off (e.g. if you’re extracting urls from elements or attributes which allow leading/trailing whitespaces).