wallaced - daemon to control operation of the gromit(1) web robot
These notes relate to wallace 2.3 (20 April 1998).
wallaced [ -f config_file ] [ long_options ]
The Wallace Daemon is a harness program for the gromit(1) web robot, and replaces the old wallace.pl script. Gromit is a targeted web spider being run by the Australasian Legal Information Institute. Wallace instructs the web spider as to which sites it should download, and monitors its progress. Wallace runs a number of spider processes at any one time, but limits the maximum number of spiders to a preset limit. When one spider finishes, another is started automatically to download a different site.Wallace reads the list of links to download from a remote mSQL database using the Perl DBD and DBI modules. The database is expected to be in the format maintained by the Feathers links system, as developed by Geoffrey King at the Australasian Legal Information Institute.
Wallace can also read a list of links from a plain-text ASCII file. See the --db_host option, below.
These are the features added since 2.2:
- Use --gromit_nice to lower robot priorities on a heavily loaded system. Most downloads are network (not CPU) bound, but this feature might be used where you are rewriting URLs as part of a mirror download.
- Users can receive e-mail as and when Gromit processes complete their jobs, with a summary taken from the Gromit log. See --email_to and --email_exec.
- The default umask of 022 is automatically set when the daemon starts (you can over-ride this with the --umask parameter.
- Some bug fixes -- a starvation problem that could be triggered due to host name sorting (wallaced may never get to zeta.reticuli.com!). Also fixed a problem that could cause two Gromit's to visit the same host at the same time.
Most of Wallace's behaviour is controlled through the use of configuration files. You can modify settings on the command line directly using so called "long options" (described below).
The long options are: --log_dir, --url_log, --replace_logs, --log_err, --log_ops, --log_sts, --log_pid, --db_host, --db_port, --db_name, --db_lock, --lock_max_retries, --lock_sleep_time, --email_to, --email_exec, --new_only, --filter, --newness, --max_robots, --gromit_exec, --gromit_dir, --gromit_ops, --gromit_conf, --gromit_nice, --just_kidding, --sleep_time, --spare_robots, --refresh_time, --umask.
In all other cases whatever follows --db_host will be assumed to be a host name (defaults to bronte). The host computer is expected to be running mSQL. You can specify the port and database name with the --db_port and --db_name arguments respectively.
For example, to restrict downloads to the SCALEplus site, use --filter '^http://scaleplus.law.gov.au'.
To begin processing those URLs marked "indexable" by the Featers Links system, simply type:
% wallaced -f ~/etc/wallacedrcwhere ~/etc/wallacedrc is the location of your wallace configuration file. You will need to set at least a few options, because the defaults will not work (at the very least, you must over-ride just_kidding).
Once wallaced starts it puts itself into the background. You can monitor its progress by examining the log files, or using the ctrlwal.cgi script (see code for details).
To use a different database and remote host than that offered by the default settings, try:
% wallaced --db_host bondi --db_name devdbNote that only those records marked for indexing will be checked. In addition, records marked for mirroring will be downloaded by gromit(1) with the -mirror tag turned on. All other records are downloaded in text-only mode (ie with the -raw tag).
Wallace first downloads all the URLs in the database that are marked for indexing or mirroring. It then sorts the URLs by host name. URLs are grouped into host bands (that is, they all contain the same host name) and these bands are passed as URL lists to the web spider (gromit) for downloading. Within the spider, safety features are built in to prevent sites from being bombarded with requests, and to prevent the robot wandering "off site."Wallace runs its spiders concurrently. There may be a web spider running for each host band at the same time, up to a maximum of 5. The user can modify the maximum number of spider processes. As one spider completes, another is started, until all host bands have been downloaded. This method allows:
An important performance issue for Wallace is staggering Gromit processes, particularly on low-memory machines. If the maximum number of subprocesses were to be started at once, then each would need to be paged into memory at the same time. And because each Gromit process sleeps the same number of seconds between requests, they're all going to "wake up" and request CPU time at once. On the 32MB SPARC-4 development machine, that led to massive disk trashing.
- Faster Indexing
- Because more than one site is indexed at a time, the total time taken to index all sites is greatly reduced.
- Controlled Server Load
- The operator can control how much of the local systems resources are dedicated to indexing remote sites, by controlling the maximum number of web spiders that can run at any one time. Web spiders affect network performance, as well as CPU and disk performance. On low memory machines, paging may increase as gromit processes are swapped in and out of memory. It is important therefore to stagger the starting of gromit processes, so that not all processes are paged into memory at once.
- Spider Safety
- A notorious problem with web spiders is that they can saturate a remote site with requests, slowing down the remote server and denying access to other web users. By grouping sites into bands, no one site is accessed simultaneously by Gromit, since Gromit processes URLs in consecutive order.
To overcome this, you must stagger the starting times of subprocesses. The default performance settings of 5 spiders and 10 second "wait time" is optimum for low-memory machines. It means a download will occur every 10 seconds. These settings are optimum because the default Gromit download delay (the time between downloading documents) is 60 seconds. Five spiders multiplied by 10 seconds means that downloads are evenly spaced throughout the 60 second download period (with the remaining 10 seconds being the delay until the cycle repeats).
If you increase the Gromit delay time, you can afford to either decrease the wait time or increase the number of spiders. Aim to have spider downloads evenly spaced throughout the download delay window. (For example: Gromit delay of 120 seconds -- set wait time to 10 seconds and max spiders to 11. Or wait time to 5 seconds and max spiders at 20.)
As spiders exit and new ones are started there is some degradation in the staggering effect, so that some spiders might be scheduled to wake up and begin a download within seconds of each other. If this results in degraded system performance, decrease the number of spiders, or increase the wait time. This may also require an increase in gromit download delays.
The output of each gromit process is directed to a log file. Log files are contained in /home/dan/mirrors/log by default. Log files are named by the host to which they pertain, and are raw dumps of the output of the web spider process.Information contained in the log file includes successful downloads, missing pages and dead links.
On startup, wallace will look for a configuration file ("rc file") in:Wallace looks for the file .wallacedrc in each of the above locations, using the first one it finds. The best way to make sure wallace picks up the right file is to use the -f argument to specify a file to read.
- The current directory
- The current users home directory
- The directory /usr/local/bin
GENERAL SYNTAX
The rc file is a plain-text ASCII file, each directive appearing on one line. Comments begin with the "#" (hash or pound) character and continue to the end of the line. Comments cannot appear at the end of directives -- they must start on their own line.
SET COMMAND
To change a configuration setting, use set, followed by the setting, and then the value (leave the value blank to clear the setting). Setting names are the same as those for long options (see above), without the leading "--".
ECHO COMMAND
Causes any text following to be dumped to the error log. Useful for recording that particular rc file was used.
To start 3 spiders (maximum) over all indexable sites:
% wallaced --max_spiders 3To start 5 spiders (maximum) over all indexable sites contained in the dbdev database on the server bondi:
% wallaced --max_spiders 3 --db_host bondi --db_name dbdevTo start the default number of spiders, but read the URL list from file /mydir/urllist.txt:
% wallaced --db_host file:///mydir/urllist.txt
Wallace and its subprocesses are memory intensive tasks, and so tend to be memory bound rather than CPU bound.
gromit(1); commet(1).
There is some performance degradation due to decreased staggering over the life of a wallace process (see discussion of PERFORMANCE ISSUES, above).Some mechanism to dynamically control Gromit processes through Wallace (such as shutdown, suspend, restart) would be useful.
Shared configuration data is the next priority
Daniel Austin (dan at austlii.edu.au)
Australasian Legal Information Institute (AustLII)