GROMIT

Wallace Daemon Robot Harness Manual

NAME

wallaced - daemon to control operation of the gromit(1) web robot

VERSION

These notes relate to wallace 2.3 (20 April 1998).

SYNOPSIS

wallaced [ -f config_file ] [ long_options ]

DESCRIPTION

The Wallace Daemon is a harness program for the gromit(1) web robot, and replaces the old wallace.pl script. Gromit is a targeted web spider being run by the Australasian Legal Information Institute. Wallace instructs the web spider as to which sites it should download, and monitors its progress. Wallace runs a number of spider processes at any one time, but limits the maximum number of spiders to a preset limit. When one spider finishes, another is started automatically to download a different site.

Wallace reads the list of links to download from a remote mSQL database using the Perl DBD and DBI modules. The database is expected to be in the format maintained by the Feathers links system, as developed by Geoffrey King at the Australasian Legal Information Institute.

Wallace can also read a list of links from a plain-text ASCII file. See the --db_host option, below.

NEW IN THIS RELEASE

These are the features added since 2.2:

OPTIONS

Most of Wallace's behaviour is controlled through the use of configuration files. You can modify settings on the command line directly using so called "long options" (described below).

-f filename
Specifies a configuration file to read options from. The format for the configuration file is specified below -- see CONFIGURATION.

Extended (long) Options
Wallace has a number of settings that can be modified through its extended options. Extended options have the same name as those in the configuration files, but are preceded by two hyphens. Every long option expects an argument to follow.

The long options are: --log_dir, --url_log, --replace_logs, --log_err, --log_ops, --log_sts, --log_pid, --db_host, --db_port, --db_name, --db_lock, --lock_max_retries, --lock_sleep_time, --email_to, --email_exec, --new_only, --filter, --newness, --max_robots, --gromit_exec, --gromit_dir, --gromit_ops, --gromit_conf, --gromit_nice, --just_kidding, --sleep_time, --spare_robots, --refresh_time, --umask.

--db_host [file://]host_or_file
This option defines the location of the database that wallace is to query to get a list of URLs to be visited by the spider. If the argument following --db_host begins with "file://" then the rest of the argument is taken to be a file name. This file will be read for URLs to download (one per line). Wallace will try and clear the file after reading the URLs.

In all other cases whatever follows --db_host will be assumed to be a host name (defaults to bronte). The host computer is expected to be running mSQL. You can specify the port and database name with the --db_port and --db_name arguments respectively.

--db_lock filename
This option is only used if --db_host begins with "file://". It's argument is a filename which need not exist, but must be able to be created. Wallace will create the file and then "lock" it (ie get exclusive access to the file) before reading the database file. It will unlock (but not delete) the file after it has finished reading the database file. Programs that interact with wallace should lock the "db_lock" file before writing. The lockf system call is used to lock the file. This allows file locking over NFS mounted partitions. By default, the lock file is /home/dan/mirrors/logs/wallaced.lock.

--db_name database
This option defines the mSQL name that Wallace will attempt to read. By default, the Feathers database is called "linker". You can use a different database name but it must have the same field names as a Feathers database. This option is ignored if --db_host begins with "file://".

--db_port port
Set to the port number that the mSQL server is listening on. By default, this is port 1114. This option is ignored if --db_host begins with "file://".

--email_to email address list
Takes a list of e-mail addresses, seperated by spaces. If you have more than one e-mail address, include the list in quotes. Defaults to just root on the local host. Use this to have e-mail sent to a human operator whenever a Gromit process under Wallaced control finishes. To turn off the feature, set the e-mail address to blank (eg --email_to "").

--email_exec full path to mailer
Wallaced expects to be able to use a mail client similar to mailx to send mail. By default, this paramater is set to /bin/mailx. You can use a different mailer by specifying its full path here.

--filter regexp
Used to restrict downloads to particular sites. Regexp is a regular expression (in standard Perl format) that is used against every URL. If a URL matches the regexp, it is a candidate for downloading (it is still checked against --new_only and --newness described above). By default, this is set to .* (which means "downloading everything").

For example, to restrict downloads to the SCALEplus site, use --filter '^http://scaleplus.law.gov.au'.

--gromit_conf config_file
Specify the full path to the configuration for for Gromit. By default, no configuration file is passed.

--gromit_dir directory
The directory that Gromit will use to save downloaded files (the base directory). Used by wallace to check if a file has already been downloaded, and if so, how long ago. Defaults to /home/dan/mirrors.

--gromit_exec path
Full path to the Gromit executeable. By default, this will be /usr/local/bin/gromit -- but you can specify a different application (such as a beta of the new Gromit).

--gromit_nice 0 to 20
Set the "nice" value for Gromit to run under. Basically this lowers the CPU priority of all Gromit processes. Generally this won't affect performance much, because Gromit is network bound and spends most of its time sleeping in a properly configured system. However when Gromit is rewriting URLs in a downloaded mirror, it can consume very large quantities of CPU time -- lessen the impact by using a positive number here, to lower Gromit's CPU priority.

--gromit_ops options
Any special options that will be passed to Gromit, seperated by spaces. Defaults to -to /home/dan/mirrors, which tells Gromit to download into the named directory.

--just_kidding 1 or 0
This must be set to 0. If set to 1, wallace will not start Gromit spiders, but will print out the command line instead. This is useful for debugging and not much else. In release 2.1 it defaults to 1 so must be over-ridden by a configuration file or command line argument.

--lock_max_retries integer
Specifies the maximum number of times that wallace will try to acquire a lock on the database file during an update. Only valid if --db_host is file://. When wallaced "queries" the database file, it will attempt to lock the db_lock file using lockf(2). If it failes it will retry -- up the maximum specified here. The default is 5.

--lock_sleep_time seconds
The number of seconds to sleep between failed attempts at getting a lock on the db_lock file (see --lock_max_retries, above). Defaults to 30 seconds.

--log_dir directory_name
Specifies the directory where you want Gromit logs to appear. Gromit logs are named by host and contain a complete record of Gromit's download session for that site. Defaults to /home/dan/mirrors/logs.

--log_XXX file_name
Where XXX is one of err, ops, sts or pid. These options are to set the location of various wallace log files. The first item ("err") is for errors or warnings. The "ops" log is for operations, and records interactions with the Feathers server, download processes started and stopped and other information. The "sts" log is the file to which wallace will dump statistics summaries. Wallace will do this periodically, or you can force it to update the stats file by sending it a USR2 signal. The "pid" file records the process-id of the daemon.

--max_robots integer
Used to control the maximum number of Gromit robots that are allowed to run under wallace. Defaults to 5. A larger number will mean quicker downloads overall, but will place extra load on the system. Remember that only one spider is started for each host, so multiple URLs may be assigned to one spider.

--new_only 1 or 0
Set to 1 (the default) to only download sites never seen before. Set to 0 to download files that have been downloaded before (but see --newness, below).

--newness days
If new_only is set to 0, then download any URLs that were downloaded more than n days ago. By default, this option is set to 14 (two weeks).

--refresh_time seconds
The refresh rate specifies how often wallace should contact the Feathers (mSQL) database to check for new downloads (or, if applicable, check the local database file). Defaults to 3600 (1 hour).

--replace_logs 1 or 0
Set to 0 to have Wallace preserve the site-specific logs for each site download. Defaults to 1.

--sleep_time seconds
The number of seconds to sleep between main loops. During the main loop, Gromit spiders are started (if there is room) and signals are responded to. Defaults to 10 seconds.

--spare_robots integer
The number of robots to keep as "spares". Spares are used to handle high-priority downloads, such as new resources. Defaults to 0, for no spares. Set number must be at least one lower than max_robots. If set to 2, only 3 spiders will be doing "refresh" downloads (ie revisiting a site previously spidered). The remaining 2 spiders are kept as spares, and are used to download new sites not yet seen before. "Spare" support is very provisional at this stage.

--umask integer
Set this to the umask to be used for all files created by wallaced. By default this is 022, which means all files are world and group readable (but not writeable) and directories are group and world accessible (also not writeable). This should be sensible for most sites, but note that you have to create the --log_dir directory yourself and set the permissions for that directory manually (required if you're using ctrlwal.cgi).

--url_log URL
Set this to a base URL where users can read Gromit log files, if you've made them available via the web. The URL will be included in e-mail reports. It should be blank by default.

USAGE

To begin processing those URLs marked "indexable" by the Featers Links system, simply type:

% wallaced -f ~/etc/wallacedrc

where ~/etc/wallacedrc is the location of your wallace configuration file. You will need to set at least a few options, because the defaults will not work (at the very least, you must over-ride just_kidding).

Once wallaced starts it puts itself into the background. You can monitor its progress by examining the log files, or using the ctrlwal.cgi script (see code for details).

To use a different database and remote host than that offered by the default settings, try:

% wallaced --db_host bondi --db_name devdb

Note that only those records marked for indexing will be checked. In addition, records marked for mirroring will be downloaded by gromit(1) with the -mirror tag turned on. All other records are downloaded in text-only mode (ie with the -raw tag).

OPERATION AND CONTROL

Performance Issues

Wallace first downloads all the URLs in the database that are marked for indexing or mirroring. It then sorts the URLs by host name. URLs are grouped into host bands (that is, they all contain the same host name) and these bands are passed as URL lists to the web spider (gromit) for downloading. Within the spider, safety features are built in to prevent sites from being bombarded with requests, and to prevent the robot wandering "off site."

Wallace runs its spiders concurrently. There may be a web spider running for each host band at the same time, up to a maximum of 5. The user can modify the maximum number of spider processes. As one spider completes, another is started, until all host bands have been downloaded. This method allows:

Faster Indexing
Because more than one site is indexed at a time, the total time taken to index all sites is greatly reduced.

Controlled Server Load
The operator can control how much of the local systems resources are dedicated to indexing remote sites, by controlling the maximum number of web spiders that can run at any one time. Web spiders affect network performance, as well as CPU and disk performance. On low memory machines, paging may increase as gromit processes are swapped in and out of memory. It is important therefore to stagger the starting of gromit processes, so that not all processes are paged into memory at once.

Spider Safety
A notorious problem with web spiders is that they can saturate a remote site with requests, slowing down the remote server and denying access to other web users. By grouping sites into bands, no one site is accessed simultaneously by Gromit, since Gromit processes URLs in consecutive order.

An important performance issue for Wallace is staggering Gromit processes, particularly on low-memory machines. If the maximum number of subprocesses were to be started at once, then each would need to be paged into memory at the same time. And because each Gromit process sleeps the same number of seconds between requests, they're all going to "wake up" and request CPU time at once. On the 32MB SPARC-4 development machine, that led to massive disk trashing.

To overcome this, you must stagger the starting times of subprocesses. The default performance settings of 5 spiders and 10 second "wait time" is optimum for low-memory machines. It means a download will occur every 10 seconds. These settings are optimum because the default Gromit download delay (the time between downloading documents) is 60 seconds. Five spiders multiplied by 10 seconds means that downloads are evenly spaced throughout the 60 second download period (with the remaining 10 seconds being the delay until the cycle repeats).

If you increase the Gromit delay time, you can afford to either decrease the wait time or increase the number of spiders. Aim to have spider downloads evenly spaced throughout the download delay window. (For example: Gromit delay of 120 seconds -- set wait time to 10 seconds and max spiders to 11. Or wait time to 5 seconds and max spiders at 20.)

As spiders exit and new ones are started there is some degradation in the staggering effect, so that some spiders might be scheduled to wake up and begin a download within seconds of each other. If this results in degraded system performance, decrease the number of spiders, or increase the wait time. This may also require an increase in gromit download delays.

Logging and Error Messages

The output of each gromit process is directed to a log file. Log files are contained in /home/dan/mirrors/log by default. Log files are named by the host to which they pertain, and are raw dumps of the output of the web spider process.

Information contained in the log file includes successful downloads, missing pages and dead links.

CONFIGURATION FILES

On startup, wallace will look for a configuration file ("rc file") in:
  1. The current directory
  2. The current users home directory
  3. The directory /usr/local/bin
Wallace looks for the file .wallacedrc in each of the above locations, using the first one it finds. The best way to make sure wallace picks up the right file is to use the -f argument to specify a file to read.

GENERAL SYNTAX

The rc file is a plain-text ASCII file, each directive appearing on one line. Comments begin with the "#" (hash or pound) character and continue to the end of the line. Comments cannot appear at the end of directives -- they must start on their own line.

SET COMMAND

To change a configuration setting, use set, followed by the setting, and then the value (leave the value blank to clear the setting). Setting names are the same as those for long options (see above), without the leading "--".

ECHO COMMAND

Causes any text following to be dumped to the error log. Useful for recording that particular rc file was used.

EXAMPLES

To start 3 spiders (maximum) over all indexable sites:

% wallaced --max_spiders 3

To start 5 spiders (maximum) over all indexable sites contained in the dbdev database on the server bondi:

% wallaced --max_spiders 3 --db_host bondi --db_name dbdev

To start the default number of spiders, but read the URL list from file /mydir/urllist.txt:

% wallaced --db_host file:///mydir/urllist.txt

NOTES

Wallace and its subprocesses are memory intensive tasks, and so tend to be memory bound rather than CPU bound.

SEE ALSO

gromit(1); commet(1).

BUGS

There is some performance degradation due to decreased staggering over the life of a wallace process (see discussion of PERFORMANCE ISSUES, above).

Some mechanism to dynamically control Gromit processes through Wallace (such as shutdown, suspend, restart) would be useful.

Shared configuration data is the next priority

AUTHOR

Daniel Austin (dan at austlii.edu.au)
Australasian Legal Information Institute (AustLII)

Gromit Web Toolbox / http://avoca.austlii.edu.au/~dan/gromit/