Directory-Based Limits (GNU Wget 1.21.1-dirty Manual)

From Get docs
Wget/docs/latest/Directory 002dBased-Limits /
Revision as of 03:54, 6 December 2021 by Notes (talk | contribs) (Page commit)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

4.3 Directory-Based Limits

Regardless of other link-following facilities, it is often useful to place the restriction of what files to retrieve based on the directories those files are placed in. There can be many reasons for this—the home pages may be organized in a reasonable directory structure; or some directories may contain useless information, e.g. /cgi-bin or /dev directories.

Wget offers three different options to deal with this requirement. Each option description lists a short name, a long name, and the equivalent command in .wgetrc.

-I list
--include list
include_directories = list

-I’ option accepts a comma-separated list of directories included in the retrieval. Any other directories will simply be ignored. The directories are absolute paths.

So, if you wish to download from ‘http://host/people/bozo/’ following only links to bozo’s colleagues in the /people directory and the bogus scripts in /cgi-bin, you can specify:

wget -I /people,/cgi-bin http://host/people/bozo/

-X list
--exclude list
exclude_directories = list

-X’ option is exactly the reverse of ‘-I’—this is a list of directories excluded from the download. E.g. if you do not want Wget to download things from /cgi-bin directory, specify ‘-X /cgi-bin’ on the command line.

The same as with ‘-A’/‘-R’, these two options can be combined to get a better fine-tuning of downloading subdirectories. E.g. if you want to load all the files from /pub hierarchy except for /pub/worthless, specify ‘-I/pub -X/pub/worthless’.

no_parent = on

The simplest, and often very useful way of limiting directories is disallowing retrieval of the links that refer to the hierarchy above than the beginning directory, i.e. disallowing ascent to the parent directory/directories.

The ‘--no-parent’ option (short ‘-np’) is useful in this case. Using it guarantees that you will never leave the existing hierarchy. Supposing you issue Wget with:

wget -r --no-parent http://somehost/~luzer/my-archive/

You may rest assured that none of the references to /~his-girls-homepage/ or /~luzer/all-my-mpegs/ will be followed. Only the archive you are interested in will be downloaded. Essentially, ‘--no-parent’ is similar to ‘-I/~luzer/my-archive’, only it handles redirections in a more intelligent fashion.

Note that, for HTTP (and HTTPS), the trailing slash is very important to ‘--no-parent’. HTTP has no concept of a “directory”—Wget relies on you to indicate what’s a directory and what isn’t. In ‘http://foo/bar/’, Wget will consider ‘bar’ to be a directory, while in ‘http://foo/bar’ (no trailing slash), ‘bar’ will be considered a filename (so ‘--no-parent’ would be meaningless, as its parent is ‘/’).