[Previous]
[Next]
[Up]
[Title]
6.9. Annexure - Some technical information about the targeted web spider
Gromit is a specialist web robot. It targets selected legal web sites, namely a
subset of the URLs contained in AustLII's World Law Index Internet indices,
selected for their high value legal content. Gromit Web Robot (Gromit) is a
single program that recursively downloads all text files on a site for indexing
by AustLII's SINO Search Engine.
Normal operation for remote indexing purposes (as opposed to mirroring) is in
text only mode, so image links will also be ignored, as will any links that do
not appear to be of the MIME type text/html or text/plain.
Gromit maintains a local cache of downloaded documents, so that they can be
indexed by AustLII's SINO Search Engine. The cached documents are not available
for browsing or downloading via AustLII's servers - users must go to the
original host in order to browse or download.
Gromit is not intended to be used directly by a human operator. Typically, it
runs under the control of Wallace, a control script that fires off Gromit
processes over blocks of URLs. AustLII's new software for the Links indices,
Feathers, will invoke Gromit processes in relation to those sites selected by
the editors of the indices. A separate version of Gromit to access protected
databases on remote servers, with permission, is also available.
Wallace is a harness program for Gromit. Wallace instructs Gromit as to which
sites it should download, and monitors its progress. Wallace runs a number of
spider processes at any one time, but limits the maximum number of spiders to a
preset limit. When one spider finishes, another is started automatically to
download a different site. Wallace reads the list of links to download from a
remote mSQL database using the Perl DBD and DBI modules. The database is
expected to be in the format maintained by the Feathers links system
Wallace first downloads all the URLs in the database that are marked for
indexing or mirroring. It then sorts the URLs by host name. URLs are grouped
into host bands (that is, they all contain the same host name) and these bands
are passed as URL lists to the web spider (gromit) for downloading. Wallace
runs its spiders concurrently. There may be a web spider running for each host
band at the same time, up to a maximum of 10. The user can modify the maximum
number of spider processes. As one spider completes, another is started, until
all host bands have been downloaded.
Gromit is a relatively unobtrusive robot, designed to have minimal impact on
the sites it visits. The robot, designed and implemented by AustLII staff, has
been written in Perl 5, and uses the LWP library. In particular, the
LWP::RobotUA object is used as the basis for Gromit. That module, together with
other measures taken in the program, minimises impact on web performance
because:
- It obeys the Robots Exclusion Protocol so as to not visit areas where
robots are not welcome. Specifically, it obeys directives in the robots.txt
file in the root directory of servers (see Robots Exclusion at The Web Robots
Pages).
- No one site is accessed twice by the robot within a 1 minute period.
- The robot caches downloaded documents for later indexing, and will issue a
HEAD request for a page before attempting to download fresh versions of already
cached pages. On those web sites that support such mechanisms, Gromit will take
advantage of the If-Modified-Since and Last-Modified HTTP headers, reducing
server load for those machines.
- A notorious problem with web spiders is that they can saturate a remote
site with requests, slowing down the remote server and denying access to other
web users. By grouping sites into bands, no one site is accessed simultaneously
by Gromit, since Gromit processes URLs in consecutive order.
Gromit is
still under development, and during this initial stage will not be running
unattended. Further information can be obtained on the page `Gromit Web Robot -
Information for Web Managers'.
[Previous]
[Next]
[Up]
[Title]