GROMIT

Gromit Web Robot Manual

NAME

gromit - recursively download HTML pages.

VERSION

These notes relate to Version 2.3 (20 April 1998).

SYNOPSIS

gromit [ -mirror | -raw ] [ -q ] [ -debug | +debug ] [ -include URL ] [ -clearinc ] [ -to directory ] [ -nolog | +nolog ] [ -load filename ] [ -save filename ] [ --long_options ] URL . . .

DESCRIPTION

The Gromit Web Robot (Gromit) is a single program that recursively downloads all text files on a site for indexing by AustLII's SINO Search Engine. Gromit is a Targeted Web Spider, and is not designed to traverse the Web generally. It's behaviour is limited to the site specified in the original URL (provided on the command line).

Gromit takes a list of URLs (must be fully qualified - relative URLs are not supported) as a starting point. For each URL given on the command line, it downloads the specified file. If the file is a HTML file (identified as type text/html by the Web server) the file will be parsed for additional links, which may also be downloaded. Whether or not these new links are downloaded depends on command line options, and built in safety logic (see USAGE, below).

Version 2.2 includes new delay mechanisms to tailor the download delay to a particular time of day at the remote server. There have also been changes to logging, mirroring and the way file permissions are set. See the HISTORY file for more information.

NEW IN THIS RELEASE

New features added since 2.2:

OPTIONS

Gromit processes command line options as it finds them. That is, the order of options as they appear on the command line is very important. If you want a particular URL to be mirrored for example, you must specify -mirror before the URL. Options are cumulative (although some will cancel out others, cf -raw to -mirror) and operate on all URLs that follow them. See EXAMPLES for example usage.
-clearinc
Resets the list of allowed domains to the empty list. Only has effect on subsequent URLs. See -include, below. By default, the list of allowed domains will only include the original URL given for downloading.

-debug
Turns on debugging mode for arguments and URLs that follow the flag. Turn off debugging mode with +debug. Debugging mode is very verbose, so consider directing stderr to a file. Note that a debug argument will only take effect from the point on the command line that it appears. So if the arguments are "url1 -debug url2", debugging output will only appear during the downloading of url2. The default is for debugging to be off.

-include URL-regexp
Adds additional URL "domains" to the list of allowed domains. By default, only pages that fall under the original URL (ie they share a common base URL, or appear lower in the directory heirarchy) are downloaded. You can over-ride this behaviour by adding additional URLs to use as allowed domains. Multiple -include options can be used. Their effect is cumulative and operates on all URLs that follow them. To reset the list for subsequent URLs, use the -clearinc flag. See INCLUDE EXAMPLES, below.

-load filename
Follow with a filename to load an initialisation file contain standard settings for Gromit. By default, Gromit will load the first initialisation file it finds, looking in three standard places: (1) the current directory; (2) the current users home directory; (3) the directory /usr/local/bin. You can load additional files using the -load directive. The effects are cumulative and operate on all subsequent URLs. See INITIALISATION FILES, below. See also -save, below.

-mirror
Turns on mirror mode. In this mode, images are downloaded (as well as other non-text files) and URLs in HTML files are rewritten to point to locally stored documents (where available). In addition, file permission settings are set to word readable. The opposite of -mirror is -raw, which specifies text only mode, with no URL rewriting. Once invoked, the -mirror option applies to all URLs listed after the -mirror flag. The default is raw mode (below).

-nolog | +nolog
From version 2.1, Gromit will log the sites it visited (with time stamp and status responses) to a .SITELOG file, which appears in the host download directory for that site. Use -nolog to turn this behaviour off, or +nolog to re-activate the site logging.

-q
Specifies quiet operation. Only fatal error messages are reported. The default is for quiet mode to be off.

-raw
Specifies raw mode or text only mode. Only documents with MIME type text/html or text/plain are downloaded (MIME types may be guessed from file extensions). File permissions are set to user and group readable only. Links may break because the URLs in the files will not be re-written. This mode is used when all you want is the text of files for later indexing. Once invoked, the -raw option applies to all URLs listed after the -raw flag. Raw mode is the default.

-save filename
Dumps the current Gromit settings to the file you specify. They can then be loaded later either by copying the file to a standard location, or with the -load option (see above). See also INITIALISATION FILES, below. By default, no dump is performed.

-stdin
Tells the program to read its options not from the command line (any additional options are ignored) but from standard input (STDIN). Must be the first argument used. Arguments should be specified on seperate lines, and argument processing is terminated when the end of file is reached (eg pressing Ctrl-D on the terminal). Flags that take arguments should be split: the flag appearing on one line (eg -to) and the argument appearing on the next (eg /home/mirrors).

-to
The argument following -to must be the name of an existing directory. This is used as the base directory into which the files are downloaded. The remote directory structure is reproduced by Gromit, using this directory as base. The default (/home/dan/mirrors) is used if no -to argument is present. You can download different URLs to different directories on the one command line (see EXAMPLES, below).

Extended (Long) Options
Greater control over how Gromit runs can be gained using the long options. These directly modify the internal configuration parameters, and allows access to settings not otherwise modifiable from the command line. All long options are preceded with the "double dash" ("--") and expect an argument to follow.

The long options available are: --db_temp, --delay, --delay_offset, --force_cache, --from, --head_delay, --log_file, --max_files, --no_cgi, --no_recurse, --retry_delay, --retry_max, --shortagent, --skip_download, --stack_crash, --stack_trace, --url, --useragent, --webhost, --webpath, --base_dir, --rewrite_urls, --text_only.

The last two options above are best managed using the -mirror and -raw arguments. The -to argument is a synonym for --base_dir.

--base_dir directory_name
Works in exactly the same way as the -to argument.

--db_temp directory
Gromit uses tied hashes and disk storage to reduce its main memory requirements when running, especially over large sites (legislation sites tend to be very large in terms of the number of URLs Gromit must keep track of). This option sets the base temporary directory that Gromit will use to store these files. At runtime, Gromit will add the process id and a random number to create a new unique directory into which it can stuff its database files. Defaults to /tmp/gromit.

--delay seconds or format_string
Sets the number of seconds that the robot will wait between requests. Previous versions measured delays in minutes so watch out for old config files - "1" means 1 second now, not 1 minute. It's possible to use a value of "0" (for testing against your own servers) but system administrators get very upset when you flood their web servers this way. The default is 60 seconds.

From version 2.2 you can use customised delays, so that the delay value used changes depending on the time of day on the remote server. You can therefore go faster at "off-peak" times, but users are expected to check with remote site admins as to when such off-peak times are. See CUSTOM DELAYS.

--delay_offset [-]HH:[-]MM
If Gromit is not able to determine the time at the remote server it will use this default offset to guess the time. Useful where you know the time zone of where you're going to be targeting Gromit. By default it is undefined. See CUSTOM DELAYS.

--force_cache -1 or 0 or 1
Follow this option with a 1 to force Gromit to use local copies (cached copies) of files, even if the remote copy is more recent. Useful for resuming an aborted download, because it will quickly jump to the point where files stopped being downloaded.

Use -1 to force Gromit to download a new copy of the file, even if the remote copy has not been updated. This is useful to force a refresh where local copies have become corrupted.

The default is 0 (use cache only if remote file newer).

--from user@host
Sets the From header field that Gromit will use when contacting web servers. This must be a valid e-mail address that a server administrator can use to contact the operator of the robot. A default is set in the body of the program. The default is the operating users username, followed by the host Gromit is running on.

--head_delay seconds or -1
You can set a separate delay period for HEAD requests than you can for normal downloads (see --delay for how to set the delay period for downloads). The delay period is the amount of time to wait between requests, in order to minimise Gromit's impact on remote servers. Setting to 0 causes no delay, which may cause Gromit to flood a server with HEAD requests if there are many cached files to check. If ommitted, it defaults to whatever --delay has been set to. To turn off (ie force use of normal --delay setting), set to -1.

--log_file filename
This is the file used for site logging. From version 2.1, Gromit logs all the URLs downloaded for a particular site (with timestamp and status code). The file specified here is created under the host directory into which downloads occur. Defaults to .SITELOG.

--max_files integer
Sets an upper limit on the number of files to download. The default is 10000. If Gromit downloads more files than the set maximum it will abort. This is a safety cut-off to prevent the robot from going crazy, especially if it hits "virtual document" areas. This only applies to files actually downloaded, it does not apply to cached files.

--no_cgi 1 or 0
Defaults to 1 -- do not follow CGI links. If Gromit detects that a URL is actually a CGI call it will not attempt to download it by default. Set this field to 0 to force Gromit to call CGI's. This is only recommended where you know what's going on with the CGI. The CGI will still have to be accessible according to robots.txt rules.

--no_recurse 1 or 0
Defaults to 0 -- setting this to one means Gromit will not follow any links. Normally useless, but implemented for SCALEplus update compatability (cases are one file).

--rewrite_urls 1 or 0
If a 1 follows this, URLs will be rewritten in the downloaded file to point to locally cached copies of files. Setting to 0 skips this process. This field is automatically set to 1 when -mirror is specified (and is set to 0 when -raw is specified). The default is 0 (rewriting off).

--retry_delay seconds
If Gromit is unable to connect or times-out waiting for data, the error is considered non-fatal and the operation will be tried again. Before retrying, Gromit will sleep this number of seconds (defaults to 600 -- 10 minutes). See also --retry_max.

--retry_max integer
The maximum number of times Gromit should successively retry to connect to a server that is down. Defaults to 10. The total time Gromit will wait for a down server is therefore retry_delay x retry_max, which by default is 100 minutes (1:40 hours). After that time Gromit will give up on the entire site. An intervening successful connection will reset the counter.

--shortagent string
This is the agent string used by Gromit when checking robots.txt files to see if it is allowed to download a particular page. By default it is set to "Gromit". If you change the --useragent setting, you should change the --shortagent to a stripped down version (eg no version numbers or e-mail address).

--skip_download 0 or 1
If you set this option to 1 no downloads will be performed -- Gromit will skip immediately to massaging HTML for mirroring, which it reads from the cache. This is different from setting --force_cache to 1 because that will cause Gromit to still parse HTML and download new files. Using --skip_download means Gromit will not even check HTML for links. Processing resumes where URL rewriting occurs. The default is 0 (off).

--stack_crash 1 to 255
Set to greater than zero during debugging to treat all stack traces as a fatal error (causes Gromit to quit). This means that even exceptions will result in Gromit exiting, instead of handing control over to exception handlers. Defaults to 0, as is right and just.

--stack_trace 1 or 0
Set to 1 to cause Gromit to print a stack trace whenever a fatal error or exception occurs. Useful in debugging, may be useful in a production environment to see paramaters as they are used internally. Defaults to 0.

--text_only 1 or 0
If a 1 follows this argument, only files matching the MIME type text/html or text/plain will be downloaded. Otherwise, all files linked to will be downloaded. MIME types are taken either from the server's response field, or are guessed from the file extension. The default is 1 (download text files only).

--umask umask
Sets the file creation mask for all downloaded files and created directories. By default, the umask is 027, which creates files readable by the group but not the world. If you specify -mirror on the command line, the umask is changed to 022, which creates world readable files. To over-ride this, you must use the --umask option after the -mirror option. See umask(1) for how to construct a umask.

--url URL
Set this to a URL that explains what your robot is, who you are, and who remote sysadmins can contact if they have problems. Gromit will pass the URL as part of its requests to remote servers.

--useragent string
Sets the User-Agent header field that Gromit will use to identify itself to remote servers. For example, to setup Gromit so that it identifies itself as Netscape version 1.1N, use --useragent Netscape/1.1N. Note that this will cause Gromit to ignore lines in robot.txt files meant for it, because Gromit will now think it is called "Netscape". Gromit will still honour robots.txt files that apply to the name you use or "*". The default user agent is Gromit/X.Y where X.Y is an internal version number.

--webhost fully.qualified.hostname
This option is used when you are mirroring pages. Once all pages from a site are downloaded, mirrored pages are "massaged" from the base URL down. You must supply the name of the web server that will be serving your mirror with this option. By default, --webhost is set to the current server that Gromit is executing on.

--webpath /path
This option is used when you are mirroring pages. Set this to the starting path from where mirrored sites are served (this may be different for each site). For example, if mirrored sites are stored in /pub/mirrors, use --webpath /pub/mirror. The default directory is /mirrors.

USAGE

The simplest invocation is to call Gromit with a single URL:

% gromit http://www.hcourt.gov.au/

This will cause Gromit to download the URL http://www.hcourt.gov.au/. Because there is no file name at the end of the URL, the contents returned will be placed in the file _index.html (underscore precedes "index.html"). The contents of the file are then parsed and all links extracted. Any links that fall under the original URL will be downloaded. Links outside that scope are ignored. Because we are in text only mode (the default), image links will also be ignored, as will any links that do not appear to be of the MIME type text/html or text/plain.

If we wanted to download images as well, and build a full mirror of the remote site, we would add -mirror before the URL:

% gromit -mirror http://www.hcourt.gov.au/

We may want to download two URLs using the one process. To do this, spcecify the second URL after the first (you can list as many URLs as your shell will allow). For example:

% gromit -mirror http://www.hcourt.gov.au/ -raw \
        http://netscape.com/help/

This downloads the "hcourt" URL as a mirror (includes images and rewrites URLs) and the second URL as text only (does not rewrite URLs). Note that only extracted links that appear below the original URL (ie lower down in the file heirarchy on the same server) are followed - the Gromit robot is not allowed to wander "off site" (at least in this example).

Once downloaded, the files are stored in under the directory specified in the -to argument. The first part of the path is the host name the file came from. Then the file structure of the original URL is reproduced. Assuming that the argument -to /home/mirrors was specified on the command line, the URL http://www.hcourt.gov.au/cases/reported/clr/166.html will be saved in the directory /home/mirrors/www.hcourt.gov.au/cases/reported/clr/166.html.

TECHNICAL NOTES

Gromit works by downloading the initial "starting URL" passed to it on the command line. Once downloaded, all links are extracted and examined by the robot to see if they are candidates for downloading. Whether or not a new link will be added to the "URL pool" (used to track which URLs must be downloaded next) depends on

As of version 1.1, Gromit uses a breadth-first iterative download algorithm, rather than the previous depth-first recursive algorithm. This reduces Gromit's memory and stack requirements.

URL rewriting occurs after an entire site has been downloaded. All HTML documents below the mirror root are modified to link to locally-cached copies of documents if they exist, or to point to the original remote server otherwise. Links are not followed -- the files to be converted are read directly from the file system.

If the URL contains a port number (other than port 80) a sub-directory is created under the host name when storing files.

INITIALISATION FILES

From version 1.2, Gromit supports the use of initialisation files or ("rc files") to reproduce standard operating environments. On startup, Gromit looks for an rc file in one of three standard places:
  1. The current directory
  2. The current users home directory (environment variable $HOME)
  3. The directory /usr/local/bin
Gromit looks for the file .gromitrc in each of these three places in turn, and will use the first one it finds. Normally, Gromit will then stop searching. For example, if a .gromitrc exists in the current directory, no search will be done in the users home directory. However this default behaviour can be over-ridden, by including the word "chain" at the end of the rc file.

GENERAL SYNTAX

The rc file is a plain-text ASCII file, each directive appearing on one line. Comments begin with the "#" (hash or pound) character and continue to the end of the line. Comments cannot appear at the end of directives -- they must have their own line.

SET COMMAND

To change a configuration setting, use set, followed by the setting, and then the value. Setting names are the same as those for long options (above), without the leading "--". For example, to set a delay period of 5 minutes, use "set delay 5". To clear the URL option, use just "set url".

INCLUDE COMMAND

Adds whatever follows the word "include" to the list of allowed domains. Use "clearinc" to clear the list. See help for -include, above.

ECHO COMMAND

When found, whatever follows the word "echo" will be printed on STDOUT. Useful for debugging purposes, to determine which rc file is being used.

DEBUG COMMAND

Turns debugging ON. Use "debug off" to turn debugging off, or "debug" to turn it back on.

CHAIN REQUEST

Normally Gromit stops looking for additional rc files after finding one it can read, however your rc file can contain the word "chain" which will tell Gromit to keep looking for rc files after processing this one. Has no effect on rc files loaded using the -load directive.

EXAMPLES

Note that command line options are cumulative and take effect only on URLs that follow them. To download a single site:

% gromit http://www.hcourt.gov.au/

To download the site with debugging turned on:

% gromit -debug http://www.hcourt.gov.au/

To mirror the site locally:

% gromit -mirror http://www.hcourt.gov.au/

To download two sites, the second one being mirrored locally:

% gromit http://www.fedct.gov.au/ -mirror http://www.hcourt.gov.au/

To download the first site into /tmp/webindex and the second site to /home/mirrors:

% gromit -to /tmp/webindex http://www.fedct.gov.au/ -to /home/mirrors
	-mirror http://www.hcourt.gov.au/

And finally, to download the first site into /tmp/webindex, with debugging turned on, and the second site into /home/mirrors, with debugging turned off:

% gromit -to /tmp/webindex -debug http://www.fedct.gov.au/
	-to /home/mirrors +debug http://www.hcourt.gov.au/

INCLUDE EXAMPLES

Normally, Gromit will only download URLs that occur below the original URL in the directory heirarchy on the same server. However, you can ask Gromit to "stray" onto other sites and directories as well. For example, a site may have an index, located at /index/acts.html. This page may point to resources located at /acts. We could attempt to download the site this way:

% gromit -mirror http://www.site.com/index/acts.html

The problem here is that the site point to resources in /acts, which is outside the heirarchy (Gromit will only follow URLs that point to /index and below). In order to tell Gromit to follow links to /acts, we add an include directive:

% gromit -mirror -include http://www.site.com/acts \
             http://www.site.com/index/acts.html

If anything in /acts points back to a new page in /index, the new page will be downloaded. If anything in /acts points to a new page in /cases, the new pages will not be downloaded. If we wanted to also download resources in /cases, we would add two -includes:

% gromit -mirror -include http://www.site.com/acts -include \
            http://www.site.com/cases \
            http://www.site.com/index/acts.html

Recall that we can download more that one URL from the same session. If we wanted to download /docs/report.html, we can add it to the command line:

% gromit -mirror -include http://www.site.com/acts \
            http://www.site.com/index/acts.html \
            http://www.site.com/docs/report.html

Because the include list is maintained for each site, anything pointed to in /acts by /docs/report.html (that hasn't already been downloaded) will be downloaded. If we wanted to avoid this happening, we would clear the include list, just prior to downloading /docs/report.html:

% gromit -mirror -include http://www.site.com/acts \
    http://www.site.com/index/acts.html -clearinc \
    http://www.site.com/docs/report.html

There is currently no -exclude mechanism, to cut out portions of a site for download.

CUSTOM DELAYS

From version 2.2 Gromit supports customised delays (the changes have been made to the underlying GromitUA engine). This allows you to tell Gromit to use different delay periods between downloads, depending on the time of day at the remote server. If the time of day at the remote server cannot be determined, a default is used.

To set a custom delay, use the --delay option on the command line (see OPTIONS) or the set delay command in an rc file (see INITIALISATION FILES). Instead of specifying a simple integer (such as "60" for 60 seconds), specify a string in the form

default,timeA-timeB=secs[,timeA-timeB=secs...]
Where:

For example, to wait 60 seconds by default, but wait 120 seconds during normal working hours use "60,0800-1800=120". You must use 24hr time and you must provide a default. Ranges cannot wrap over the midnight hour at this stage, so you'll need to specify multiple ranges if you want that kind of control.

If we took the example above, but modified it so that we could wait 30 seconds outside normal working hours, we'd say "60,0800-1800=120,1801-2359=30,0000-0759=30". Note that you can't say "1801-0759" because time ranges must go from a lower "number" to a higher one. That's why we had to split the range into before and after midnight pairs.

Gromit connects to the daytime port on the remote server, which gives the server's localtime as an ASCII string (see RFC867). There is no set format for the string, so Gromit might not be able to determine the time -- in which case the default number is used. Also, not all hosts run a daytime server.

If you think you know what the time would be at the remote server (for example, it's in the same time zone as you) you can get Gromit to take its time off the local server and use a default offset to guess the time at the remote server. Do this using the --delay_offset setting.

One final note needs to be made: this feature was not designed for general use on server's whose admins you haven't talked to. It was developed so that SCALEplus admins could tell us to "go slow" at certain times of the day and night when they were doing heavy processing.

NOTES

Gromit is not intended to be used directly by a human operator. Typically, it runs under the control of wallace, a control script that fires off gromit processes over blocks of URLs. See wallaced(1).

Gromit uses recently developed Vomit(tm) technology.

SEE ALSO

wallaced(1); commet(1).

BUGS

URL rewriting with the -mirror flag is broken in this release. It will be fixed when actual mirroring becomes a priority.

A custom delay string of "1800-0700=10" doesn't do what you think it should -- no wrapping over midnight occurs.

RFC867 was never really designed to be used in this way! I'd use RFC868 but that returns time in GMT land, which is useless for our purposes.

AUTHOR

Daniel Austin (dan at austlii.edu.au)
Australasian Legal Information Institute (AustLII)

Gromit Web Toolbox / http://avoca.austlii.edu.au/~dan/gromit/