gromit - recursively download HTML pages.
These notes relate to Version 2.3 (20 April 1998).
gromit [ -mirror | -raw ] [ -q ] [ -debug | +debug ] [ -include URL ] [ -clearinc ] [ -to directory ] [ -nolog | +nolog ] [ -load filename ] [ -save filename ] [ --long_options ] URL . . .
The Gromit Web Robot (Gromit) is a single program that recursively downloads all text files on a site for indexing by AustLII's SINO Search Engine. Gromit is a Targeted Web Spider, and is not designed to traverse the Web generally. It's behaviour is limited to the site specified in the original URL (provided on the command line).Gromit takes a list of URLs (must be fully qualified - relative URLs are not supported) as a starting point. For each URL given on the command line, it downloads the specified file. If the file is a HTML file (identified as type text/html by the Web server) the file will be parsed for additional links, which may also be downloaded. Whether or not these new links are downloaded depends on command line options, and built in safety logic (see USAGE, below).
Version 2.2 includes new delay mechanisms to tailor the download delay to a particular time of day at the remote server. There have also been changes to logging, mirroring and the way file permissions are set. See the HISTORY file for more information.
New features added since 2.2:
- --shortagent configuration option added. This is the agent string to be used when checking robots.txt files.
- Stack traces and forced exits on fatal errors (or die exceptions). Used only in debugging, enable with --stack_trace 1.
- Fixed URL rewriting to point to locally cached files for mirrors, along with numerous bugs that this triggered.
- The --db_temp directory must exist before Gromit is started (Gromit used to create the directory itself if it wasn't there). It should be owned by the user Gromit runs as, with no other write permissions enabled (eg use mode 640). This is a security fix, to prevent so-called "/tmp races".
- Added slightly more robust handling of non-fatal (transient) network errors. Gromit will sleep for 10 minutes before trying again, up to a maximum of 10 times (see --retry_delay and --retry-max).
- Added --no_cgi argument. By default, Gromit will no longer try and follow CGI links if they can be detected.
- Added --max_consec_errors option and removed internal max_errors code. Gromit will no longer abort after 50 failed downloads (which a large site can reach quite easily). Gromit will abort if it encounters 50 errors in a row -- this is configureable.
- Bug fixes, bug fixes and minor bug fixes.
Gromit processes command line options as it finds them. That is, the order of options as they appear on the command line is very important. If you want a particular URL to be mirrored for example, you must specify -mirror before the URL. Options are cumulative (although some will cancel out others, cf -raw to -mirror) and operate on all URLs that follow them. See EXAMPLES for example usage.
The long options available are: --db_temp, --delay, --delay_offset, --force_cache, --from, --head_delay, --log_file, --max_files, --no_cgi, --no_recurse, --retry_delay, --retry_max, --shortagent, --skip_download, --stack_crash, --stack_trace, --url, --useragent, --webhost, --webpath, --base_dir, --rewrite_urls, --text_only.
The last two options above are best managed using the -mirror and -raw arguments. The -to argument is a synonym for --base_dir.
From version 2.2 you can use customised delays, so that the delay value used changes depending on the time of day on the remote server. You can therefore go faster at "off-peak" times, but users are expected to check with remote site admins as to when such off-peak times are. See CUSTOM DELAYS.
Use -1 to force Gromit to download a new copy of the file, even if the remote copy has not been updated. This is useful to force a refresh where local copies have become corrupted.
The default is 0 (use cache only if remote file newer).
The simplest invocation is to call Gromit with a single URL:
% gromit http://www.hcourt.gov.au/This will cause Gromit to download the URL http://www.hcourt.gov.au/. Because there is no file name at the end of the URL, the contents returned will be placed in the file _index.html (underscore precedes "index.html"). The contents of the file are then parsed and all links extracted. Any links that fall under the original URL will be downloaded. Links outside that scope are ignored. Because we are in text only mode (the default), image links will also be ignored, as will any links that do not appear to be of the MIME type text/html or text/plain.
If we wanted to download images as well, and build a full mirror of the remote site, we would add -mirror before the URL:
% gromit -mirror http://www.hcourt.gov.au/We may want to download two URLs using the one process. To do this, spcecify the second URL after the first (you can list as many URLs as your shell will allow). For example:
% gromit -mirror http://www.hcourt.gov.au/ -raw \ http://netscape.com/help/This downloads the "hcourt" URL as a mirror (includes images and rewrites URLs) and the second URL as text only (does not rewrite URLs). Note that only extracted links that appear below the original URL (ie lower down in the file heirarchy on the same server) are followed - the Gromit robot is not allowed to wander "off site" (at least in this example).
Once downloaded, the files are stored in under the directory specified in the -to argument. The first part of the path is the host name the file came from. Then the file structure of the original URL is reproduced. Assuming that the argument -to /home/mirrors was specified on the command line, the URL http://www.hcourt.gov.au/cases/reported/clr/166.html will be saved in the directory /home/mirrors/www.hcourt.gov.au/cases/reported/clr/166.html.
Gromit works by downloading the initial "starting URL" passed to it on the command line. Once downloaded, all links are extracted and examined by the robot to see if they are candidates for downloading. Whether or not a new link will be added to the "URL pool" (used to track which URLs must be downloaded next) depends on
As of version 1.1, Gromit uses a breadth-first iterative download algorithm, rather than the previous depth-first recursive algorithm. This reduces Gromit's memory and stack requirements.
- Whether or not text only mode has been specified, in which case, only .txt and .html files will be downloaded. If text-only is overridden (eg with the -mirror option), non-text files may be downloaded as well;
- Where the link fits into the document heirarchy. Links below the starting URL (ie on the same site, but at the same level or lower down in the directory heirarchy) are downloaded, all others are ignored (with the exception of additional -include directives).
URL rewriting occurs after an entire site has been downloaded. All HTML documents below the mirror root are modified to link to locally-cached copies of documents if they exist, or to point to the original remote server otherwise. Links are not followed -- the files to be converted are read directly from the file system.
If the URL contains a port number (other than port 80) a sub-directory is created under the host name when storing files.
From version 1.2, Gromit supports the use of initialisation files or ("rc files") to reproduce standard operating environments. On startup, Gromit looks for an rc file in one of three standard places:Gromit looks for the file .gromitrc in each of these three places in turn, and will use the first one it finds. Normally, Gromit will then stop searching. For example, if a .gromitrc exists in the current directory, no search will be done in the users home directory. However this default behaviour can be over-ridden, by including the word "chain" at the end of the rc file.
- The current directory
- The current users home directory (environment variable $HOME)
- The directory /usr/local/bin
GENERAL SYNTAX
The rc file is a plain-text ASCII file, each directive appearing on one line. Comments begin with the "#" (hash or pound) character and continue to the end of the line. Comments cannot appear at the end of directives -- they must have their own line.
SET COMMAND
To change a configuration setting, use set, followed by the setting, and then the value. Setting names are the same as those for long options (above), without the leading "--". For example, to set a delay period of 5 minutes, use "set delay 5". To clear the URL option, use just "set url".
INCLUDE COMMAND
Adds whatever follows the word "include" to the list of allowed domains. Use "clearinc" to clear the list. See help for -include, above.
ECHO COMMAND
When found, whatever follows the word "echo" will be printed on STDOUT. Useful for debugging purposes, to determine which rc file is being used.
DEBUG COMMAND
Turns debugging ON. Use "debug off" to turn debugging off, or "debug" to turn it back on.
CHAIN REQUEST
Normally Gromit stops looking for additional rc files after finding one it can read, however your rc file can contain the word "chain" which will tell Gromit to keep looking for rc files after processing this one. Has no effect on rc files loaded using the -load directive.
Note that command line options are cumulative and take effect only on URLs that follow them. To download a single site:
% gromit http://www.hcourt.gov.au/To download the site with debugging turned on:
% gromit -debug http://www.hcourt.gov.au/To mirror the site locally:
% gromit -mirror http://www.hcourt.gov.au/To download two sites, the second one being mirrored locally:
% gromit http://www.fedct.gov.au/ -mirror http://www.hcourt.gov.au/To download the first site into /tmp/webindex and the second site to /home/mirrors:
% gromit -to /tmp/webindex http://www.fedct.gov.au/ -to /home/mirrors -mirror http://www.hcourt.gov.au/And finally, to download the first site into /tmp/webindex, with debugging turned on, and the second site into /home/mirrors, with debugging turned off:
% gromit -to /tmp/webindex -debug http://www.fedct.gov.au/ -to /home/mirrors +debug http://www.hcourt.gov.au/
Normally, Gromit will only download URLs that occur below the original URL in the directory heirarchy on the same server. However, you can ask Gromit to "stray" onto other sites and directories as well. For example, a site may have an index, located at /index/acts.html. This page may point to resources located at /acts. We could attempt to download the site this way:
% gromit -mirror http://www.site.com/index/acts.htmlThe problem here is that the site point to resources in /acts, which is outside the heirarchy (Gromit will only follow URLs that point to /index and below). In order to tell Gromit to follow links to /acts, we add an include directive:
% gromit -mirror -include http://www.site.com/acts \ http://www.site.com/index/acts.htmlIf anything in /acts points back to a new page in /index, the new page will be downloaded. If anything in /acts points to a new page in /cases, the new pages will not be downloaded. If we wanted to also download resources in /cases, we would add two -includes:
% gromit -mirror -include http://www.site.com/acts -include \ http://www.site.com/cases \ http://www.site.com/index/acts.htmlRecall that we can download more that one URL from the same session. If we wanted to download /docs/report.html, we can add it to the command line:
% gromit -mirror -include http://www.site.com/acts \ http://www.site.com/index/acts.html \ http://www.site.com/docs/report.htmlBecause the include list is maintained for each site, anything pointed to in /acts by /docs/report.html (that hasn't already been downloaded) will be downloaded. If we wanted to avoid this happening, we would clear the include list, just prior to downloading /docs/report.html:
% gromit -mirror -include http://www.site.com/acts \ http://www.site.com/index/acts.html -clearinc \ http://www.site.com/docs/report.htmlThere is currently no -exclude mechanism, to cut out portions of a site for download.
From version 2.2 Gromit supports customised delays (the changes have been made to the underlying GromitUA engine). This allows you to tell Gromit to use different delay periods between downloads, depending on the time of day at the remote server. If the time of day at the remote server cannot be determined, a default is used.To set a custom delay, use the --delay option on the command line (see OPTIONS) or the set delay command in an rc file (see INITIALISATION FILES). Instead of specifying a simple integer (such as "60" for 60 seconds), specify a string in the form
default,timeA-timeB=secs[,timeA-timeB=secs...]Where:
- default is the number of seconds to wait if the time at the remote server cannot be established; and
- timeA is the time in 24hr format for the start of the range (eg "0900" for 9:00 am); and
- timeB is the end of the range (eg "1700" for 5:00 pm); and
- secs is the time to wait if the time at the remote server falls into the range specified with timeA and timeB. The special value of "wait" means "wait at least until timeB".
For example, to wait 60 seconds by default, but wait 120 seconds during normal working hours use "60,0800-1800=120". You must use 24hr time and you must provide a default. Ranges cannot wrap over the midnight hour at this stage, so you'll need to specify multiple ranges if you want that kind of control.
If we took the example above, but modified it so that we could wait 30 seconds outside normal working hours, we'd say "60,0800-1800=120,1801-2359=30,0000-0759=30". Note that you can't say "1801-0759" because time ranges must go from a lower "number" to a higher one. That's why we had to split the range into before and after midnight pairs.
Gromit connects to the daytime port on the remote server, which gives the server's localtime as an ASCII string (see RFC867). There is no set format for the string, so Gromit might not be able to determine the time -- in which case the default number is used. Also, not all hosts run a daytime server.
If you think you know what the time would be at the remote server (for example, it's in the same time zone as you) you can get Gromit to take its time off the local server and use a default offset to guess the time at the remote server. Do this using the --delay_offset setting.
One final note needs to be made: this feature was not designed for general use on server's whose admins you haven't talked to. It was developed so that SCALEplus admins could tell us to "go slow" at certain times of the day and night when they were doing heavy processing.
Gromit is not intended to be used directly by a human operator. Typically, it runs under the control of wallace, a control script that fires off gromit processes over blocks of URLs. See wallaced(1).Gromit uses recently developed Vomit(tm) technology.
wallaced(1); commet(1).
URL rewriting with the -mirror flag is broken in this release. It will be fixed when actual mirroring becomes a priority.A custom delay string of "1800-0700=10" doesn't do what you think it should -- no wrapping over midnight occurs.
RFC867 was never really designed to be used in this way! I'd use RFC868 but that returns time in GMT land, which is useless for our purposes.
Daniel Austin (dan at austlii.edu.au)
Australasian Legal Information Institute (AustLII)