|
|
| |
WEBCRAWL(1) |
FreeBSD General Commands Manual |
WEBCRAWL(1) |
WebCrawl - download web sites, following links
webcrawl [ options ] host[:port]/filename directory
WebCrawl is a program designed to download an entire website without user
interaction (although an interactive mode is available).
WebCrawl will download the page web-address into a
directory called destination-dir under the compiled in server root
directory (which can be changed with the -o option, see below).
web-address should not contain a leading http://
It works simply by starting with a single web page, and following
all links from that page to attempt to recreate the directory structure on
the remote server.
As well as downloading the pages, it also rewrites them to use a
local URL where URLs that would otherwise not work on the local system are
used in the page (eg URLs that begin with http:// or the begin with a
/).
It stores the downloaded files in a directory structure that
mirrors the original site's, under a directory called
server.domain.com:port. This way, multiple sites can all be loaded
into the same directory structure, and if they link to each other, they can
be rewritten to link to the local, rather than remote, versions.
Comprehensive URL selection facilities allow you to describe what
documents you want to download, so that you don't end up downloading much
more than you need.
WebCrawl is written in ANSI C, and should work on any POSIX
system. With minor modifications, it should be possible to make it work on
any operating system that supports TCP/IP sockets. It has been tested only
on Linux.
URL selection
- -a
- This causes the program to ask the user whether to download a page that it
hasn't been otherwise instructed to (by default, this means off-site
pages)
- -f string
- This causes the program to always follow links to URLs that contain the
string. You can use this, for example, to prevent a crawl from going up
beyond a single directory on a site (in conjunction with the -x
option below); say you wanted to get http://www.web-sites.co.uk/jules but
not any other site located on the same server. You could use the command
line:
webcrawl -x -f /jules www.web-sites.co.uk/jules/
mirror
Another use would be if a site contained links to (eg)
pictures, videos or sound clips on a remote server, you could use the
following command line to get them:
webcrawl -f .jpg -f .gif -f .mpg -f .wav -f .au
www.site.com/ mirror
Note that webcrawl always downloads inline images.
- -d string
- The opposite of -f, this option tells webcrawl never to get a URL
containing the string. -d takes priority over all other URL
selection options (except that it won't stop it from downloading inline
images, which are always downloaded).
- -u filename
- Causes webcrawl to log unfollowed links to the file filename.
- -x
- Causes webcrawl not to automatically follow links to pages on the same
server. This is useful in conjuction with the -f option to specify
a subsection of an entire site to download.
- -X
- Causes webcrawl not to automatically download inline images (which it
would otherwise do even when other options did not indicate that the image
should be loaded). This is useful in conjunction with the -f option
to specify a subsection of an entire site to download, when even the
images concerned need careful selection.
Page re-writing:
- -n
- Turns off page rewriting completely.
- -rx
- Select which URLs to rewrite. Only URLs that begin with / or http: are
considered for rewriting, all others are always left unchanged. This
options selects which of these URLs are rewritten to point to local files,
depending on the value of x.
- a
- all absolute URLs are rewritten
- l
- Only URLs that point to pages on the same site are rewritten.
- f (default)
- URLs for which the file that the rewritten URL would point to exists are
rewritten. Note that rewriting occurs after all links in a page have been
followed (if required), so this represents probably the most sensible
option, and is therefore the default.
- -k
- Keep original filenames - disables changing of filenames to remove
metacharacters that may confuse a web server, and to ensure that the
extension on the end of the filename is a correct .html or .htm whenever
the page has a text/html content type. (See Configuration Files below for
a discusssion of how to achieve this with other file types).
- -q
- Disable process ID insertion into query filenames. Without this flag, and
whenever -k is not in use, webcrawl rewrites the filenames
of queries (defined as any fetch from a web server that includes a '?'
character in the filename) to include the process ID of the webcrawl
fetching the query in hexadecimal after the (escaped) '?' in the filename;
this may be desirable if performing the same query multiple times to get
different results. This flag disables this behaviour.
- Recursion limiting:
- -l[x] number
- This option is used to limit the depth to which webcrawl will search the
tree (forest) of interlinked pages. There are two parameters that may be
set; with x as l, the initial limit is set, with x as r, the
limit used after jumping to a remote site is set. If x is missed
out, both limits are set.
- -v
- Increases the program's verbosity. Without this option, no reports on
status are made unless errors occur, etc. Used once, webcrawl will report
which URLs it is trying to download, and also which links it has decided
not to follow. -v may be used more than once, but this is probably
only useful for debugging purposes.
- -o dir
- Change the server root directory. This is the directory that the path
specified at the end of the command line is relative to.
- -p dir
- Change the URL rewriting prefix. This is prepended to rewritten URLs, and
should be a (relative) URL that points to the current server root
directory. An example of the use of the -o and -p options is
given below:
webcrawl -o /home/jules/public_html -p /~jules
www.site.com/page.html mirrors
- HTTP-related options
- -A string
- Causes webcrawl to send the specified string as the HTTP
'User-Agent' value, rather than the compiled in default (normally
`Mozilla/4.05 [en] (X11; I; Linux 2.0.27 i586; Nav)', although this
can be changed in the file web.h at compile time).
- -t n
- Specifies a timeout, in seconds. Default behaviour is to give up after
this length of time from the initial connection attempt.
- -T
- Changes the timeout behaviour. With this flag, the timeout occurs only if
no data is received from the server for the specified length of time.
webcrawl uses configuration files at present to specify rules for the
rewriting of filenames. It searches for files in /etc/webcrawl.conf,
/usr/local/etc/webcrawl.conf, and $HOME/.webcrawl and processes all
files it finds in that order. Parameters set in one file may be overriden by
subsequent files. Note that it is perfectly possible to use webcrawl
without a configuration file - it is only for advanced features that are too
complex to configure on the command line that it is required.
The overall syntax of the webcrawl file is a set of sections, each
headed by a line of the form [section-name].
At present, only the [rename] section is defined. This may
contain the following commands:
- meta string
- Sets metacharacter list. Any character in the list specified will be
quoted in filenames produced (unless filename rewriting is disabled with
the -k option). Quoting is performed by prepending the quoting
character (default @) to the hexadecimal ASCII value of the
character being quoted. The default metacharacter list is:
?&*%=#
- quote char
- Sets the quoting character, as described above. The default is:
@
- type content/type preferred [extra extra ...]
- Sets the list of acceptable extensions for the specifed MIME content type.
The first item in the list is the preferred extension; if renaming is not
disabled (with the -k option) and the extension of a file of this
type is not on the list, then the first extension on the list will be
appended to its name.
An implicit line is defined internally, which reads:
type text/html html htm
This could be overriden; if say you preferred the 'htm'
extension over 'html', you could use:
type text/html htm html
in a configuration file to cause .htm extensions to be used
whenever a new extension was added.
WebCrawl was written by Julian R. Hall <jules@acris.co.uk> with
suggestions and prompting by Andy Smith.
Bugs should be submitted to Julian Hall at the address above.
Please include information about what architecture, version, etc, you are
using.
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |