GROMIT

AustLII Web Toolbox Manual

NAME

gromit - recursively download HTML pages.
wallaced - control multiple gromit processes.
commet - validate URLs read from an mSQL(tm) database.

VERSION

These notes relate to Version 2.3 (20 April 1998).

SYNOPSIS

See gromit(1); wallaced(1); and commet(1).

GENERAL DESCRIPTION

The AustLII Web Toolbox consists of two web "robots" (also called spiders, or sometimes "bots") and a harness program. A number of small demonstration or utility scripts are also provided in the distribution. Almost all programs are written in Perl 5, and rely extensively on the LWP library.

GROMIT

Gromit is the main workhorse. It recursively downloads URLs, starting from a specific page and working down. It obeys the Robots Exclusion Protocol (robots.txt) and has a set delay between downloads to stop it from flooding remote servers. See the help file gromit(1).

COMMET

Commet works with an mSQL database of URLs and checks them for validity. Alternatively it can read its list of URLs from an ASCII file (one URL per line). It will typically print bad URLs to STDOUT, but if -update is specified, it will also update the mSQL database with appropriate status fields. Help is provided in commet(1).

WALLACE

Wallace is the human interface for Gromit, and is able to control multiple gromit processes. It will also interact with an mSQL database, to determine which URLs are meant to be downloaded with Gromit, and which options should be passed to Gromit for the download. Only one gromit process is used for each site. You can configure how many processes run in total. See wallaced(1).

AUTHOR

The AustLII Web Toolbox is written and maintained by Daniel Austin (Australasian Legal Information Institute.


Gromit Web Toolbox / http://avoca.austlii.edu.au/~dan/gromit/