NAME

WebCrawl - download web sites, following links

SYNOPSIS

webcrawl [ options ] host[:port]/filename directory

DESCRIPTION

WebCrawl is a program designed to download an entire website without user interaction (although an interactive mode is available).

WebCrawl will download the page web-address into a directory called destination-dir under the compiled in server root directory (which can be changed with the -o option, see below). web-address should not contain a leading http://

It works simply by starting with a single web page, and following all links from that page to attempt to recreate the directory structure on the remote server.

As well as downloading the pages, it also rewrites them to use a local URL where URLs that would otherwise not work on the local system are used in the page (eg URLs that begin with http:// or the begin with a /).

It stores the downloaded files in a directory structure that mirrors the original site's, under a directory called server.domain.com:port. This way, multiple sites can all be loaded into the same directory structure, and if they link to each other, they can be rewritten to link to the local, rather than remote, versions.

Comprehensive URL selection facilities allow you to describe what documents you want to download, so that you don't end up downloading much more than you need.

WebCrawl is written in ANSI C, and should work on any POSIX system. With minor modifications, it should be possible to make it work on any operating system that supports TCP/IP sockets. It has been tested only on Linux.

OPTIONS

URL selection

-a

This causes the program to ask the user whether to download a page that it hasn't been otherwise instructed to (by default, this means off-site pages)

-f string

This causes the program to always follow links to URLs that contain the string. You can use this, for example, to prevent a crawl from going up beyond a single directory on a site (in conjunction with the -x option below); say you wanted to get http://www.web-sites.co.uk/jules but not any other site located on the same server. You could use the command line:

webcrawl -x -f /jules www.web-sites.co.uk/jules/ mirror

Another use would be if a site contained links to (eg) pictures, videos or sound clips on a remote server, you could use the following command line to get them:

webcrawl -f .jpg -f .gif -f .mpg -f .wav -f .au www.site.com/ mirror

Note that webcrawl always downloads inline images.

-d string

The opposite of -f, this option tells webcrawl never to get a URL containing the string. -d takes priority over all other URL selection options (except that it won't stop it from downloading inline images, which are always downloaded).

-u filename

Causes webcrawl to log unfollowed links to the file filename.

-x

Causes webcrawl not to automatically follow links to pages on the same server. This is useful in conjuction with the -f option to specify a subsection of an entire site to download.

-X

Causes webcrawl not to automatically download inline images (which it would otherwise do even when other options did not indicate that the image should be loaded). This is useful in conjunction with the -f option to specify a subsection of an entire site to download, when even the images concerned need careful selection.

Page re-writing:

-n: Turns off page rewriting completely.
-rx: Select which URLs to rewrite. Only URLs that begin with / or http: are considered for rewriting, all others are always left unchanged. This options selects which of these URLs are rewritten to point to local files, depending on the value of x.

a: all absolute URLs are rewritten
l: Only URLs that point to pages on the same site are rewritten.
f (default): URLs for which the file that the rewritten URL would point to exists are rewritten. Note that rewriting occurs after all links in a page have been followed (if required), so this represents probably the most sensible option, and is therefore the default.

-k

Keep original filenames - disables changing of filenames to remove metacharacters that may confuse a web server, and to ensure that the extension on the end of the filename is a correct .html or .htm whenever the page has a text/html content type. (See Configuration Files below for a discusssion of how to achieve this with other file types).

-q

Disable process ID insertion into query filenames. Without this flag, and whenever -k is not in use, webcrawl rewrites the filenames of queries (defined as any fetch from a web server that includes a '?' character in the filename) to include the process ID of the webcrawl fetching the query in hexadecimal after the (escaped) '?' in the filename; this may be desirable if performing the same query multiple times to get different results. This flag disables this behaviour.

Recursion limiting:

-l[x] number

This option is used to limit the depth to which webcrawl will search the tree (forest) of interlinked pages. There are two parameters that may be set; with x as l, the initial limit is set, with x as r, the limit used after jumping to a remote site is set. If x is missed out, both limits are set.

-v

Increases the program's verbosity. Without this option, no reports on status are made unless errors occur, etc. Used once, webcrawl will report which URLs it is trying to download, and also which links it has decided not to follow. -v may be used more than once, but this is probably only useful for debugging purposes.

-o dir

Change the server root directory. This is the directory that the path specified at the end of the command line is relative to.

-p dir

Change the URL rewriting prefix. This is prepended to rewritten URLs, and should be a (relative) URL that points to the current server root directory. An example of the use of the -o and -p options is given below:

webcrawl -o /home/jules/public_html -p /~jules www.site.com/page.html mirrors

HTTP-related options
-A string: Causes webcrawl to send the specified string as the HTTP 'User-Agent' value, rather than the compiled in default (normally `Mozilla/4.05 [en] (X11; I; Linux 2.0.27 i586; Nav)', although this can be changed in the file web.h at compile time).
-t n: Specifies a timeout, in seconds. Default behaviour is to give up after this length of time from the initial connection attempt.
-T: Changes the timeout behaviour. With this flag, the timeout occurs only if no data is received from the server for the specified length of time.

CONFIGURATION FILES

webcrawl uses configuration files at present to specify rules for the rewriting of filenames. It searches for files in /etc/webcrawl.conf, /usr/local/etc/webcrawl.conf, and $HOME/.webcrawl and processes all files it finds in that order. Parameters set in one file may be overriden by subsequent files. Note that it is perfectly possible to use webcrawl without a configuration file - it is only for advanced features that are too complex to configure on the command line that it is required.

The overall syntax of the webcrawl file is a set of sections, each headed by a line of the form [section-name].

At present, only the [rename] section is defined. This may contain the following commands:

meta string

Sets metacharacter list. Any character in the list specified will be quoted in filenames produced (unless filename rewriting is disabled with the -k option). Quoting is performed by prepending the quoting character (default @) to the hexadecimal ASCII value of the character being quoted. The default metacharacter list is: ?&*%=#

quote char

Sets the quoting character, as described above. The default is: @

type content/type preferred [extra extra ...]

Sets the list of acceptable extensions for the specifed MIME content type. The first item in the list is the preferred extension; if renaming is not disabled (with the -k option) and the extension of a file of this type is not on the list, then the first extension on the list will be appended to its name.

An implicit line is defined internally, which reads:

type text/html html htm

This could be overriden; if say you preferred the 'htm' extension over 'html', you could use:

type text/html htm html

in a configuration file to cause .htm extensions to be used whenever a new extension was added.

AUTHOR

WebCrawl was written by Julian R. Hall <jules@acris.co.uk> with suggestions and prompting by Andy Smith.

Bugs should be submitted to Julian Hall at the address above. Please include information about what architecture, version, etc, you are using.