Wget/Recursive-Accept 002fReject-Options

2.12 Recursive Accept/Reject Options

‘-A acclist --accept acclist’ ‘-R rejlist --reject rejlist’

Specify comma-separated lists of file name suffixes or patterns to accept or reject (see Types of Files). Note that if any of the wildcard characters, ‘*’, ‘?’, ‘[’ or ‘]’, appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix. In this case, you have to enclose the pattern into quotes to prevent your shell from expanding it, like in ‘-A "*.mp3"’ or ‘-A '*.mp3'’.

‘--accept-regex urlregex’ ‘--reject-regex urlregex’

Specify a regular expression to accept or reject the complete URL.

‘--regex-type regextype’

Specify the regular expression type. Possible types are ‘posix’ or ‘pcre’. Note that to be able to use ‘pcre’ type, wget has to be compiled with libpcre support.

‘-D domain-list’ ‘--domains=domain-list’

Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does not turn on ‘-H’.

‘--exclude-domains domain-list’

Specify the domains that are not to be followed (see Spanning Hosts).

‘--follow-ftp’

Follow FTP links from HTML documents. Without this option, Wget will ignore all the FTP links.

‘--follow-tags=list’

Wget has an internal table of HTML tag / attribute pairs that it considers when looking for linked documents during a recursive retrieval. If a user wants only a subset of those tags to be considered, however, he or she should be specify such tags in a comma-separated list with this option.

‘--ignore-tags=list’

This is the opposite of the ‘--follow-tags’ option. To skip certain HTML tags when recursively looking for documents to download, specify them in a comma-separated list.

In the past, this option was the best bet for downloading a single page and its requisites, using a command-line like:

wget --ignore-tags=a,area -H -k -K -r http://site/document

However, the author of this option came across a page with tags like <LINK REL="home" HREF="/"> and came to the realization that specifying tags to ignore was not enough. One can’t just tell Wget to ignore <LINK>, because then stylesheets will not be downloaded. Now the best bet for downloading a single page and its requisites is the dedicated ‘--page-requisites’ option.

‘--ignore-case’

Ignore case when matching files and directories. This influences the behavior of -R, -A, -I, and -X options, as well as globbing implemented when downloading from FTP sites. For example, with this option, ‘-A "*.txt"’ will match ‘file1.txt’, but also ‘file2.TXT’, ‘file3.TxT’, and so on. The quotes in the example are to prevent the shell from expanding the pattern.

‘-H’ ‘--span-hosts’

Enable spanning across hosts when doing recursive retrieving (see Spanning Hosts).

‘-L’ ‘--relative’

Follow relative links only. Useful for retrieving a specific home page without any distractions, not even those from the same hosts (see Relative Links).

‘-I list’ ‘--include-directories=list’

Specify a comma-separated list of directories you wish to follow when downloading (see Directory-Based Limits). Elements of list may contain wildcards.

‘-X list’ ‘--exclude-directories=list’

Specify a comma-separated list of directories you wish to exclude from download (see Directory-Based Limits). Elements of list may contain wildcards.

‘-np’ ‘--no-parent’

Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded. See Directory-Based Limits, for more details.