[Previous] [No next] [Up] [Title]

9. Appendices


Some technical information about the targeted web spider

Gromit is a specialist web robot. It targets selected legal web sites, namely a subset of the URLs contained in AustLII's Links internet indices, selected for their high value legal content. Gromit Web Robot (Gromit) is a single program that recursively downloads all text files on a site for indexing by AustLII's SINO Search Engine.

We call Gromit a Targeted Web Spider, as is not designed to traverse the Web generally, its downloading being limited to the site specified in the original URL specified when it is invoked. For example, if Gromit is invoked to download the URL http://actag.canberra.edu.au/actag/ (ie the A.C.T. Lawnet site), any linked pages that fall below the original URL (ie lower down in the file hierarchy on the same server) will be downloaded. Linked pages outside that scope are ignored. The Gromit robot is not allowed to wander "off site".

Normal operation for remote indexing purposes (as opposed to mirroring) is in text only mode, so image links will also be ignored, as will any links that do not appear to be of the MIME type text/html or text/plain.

Gromit maintains a local cache of downloaded documents, so that they can be indexed by AustLII's SINO Search Engine. The cached documents are not available for browsing or downloading via AustLII's servers - users must go to the original host in order to browse or download.

Wallace the Gromit harness

Gromit is not intended to be used directly by a human operator. Typically, it runs under the control of Wallace, a control script that fires off Gromit processes over blocks of URLs. AustLII's new software for the Links indices, Feathers, will invoke Gromit processes in relation to those sites selected by the editors of the indices. A separate version of Gromit to access protected databases on remote servers, with permission, is also available.

Wallace is a harness program for Gromit. Wallace instructs Gromit as to which sites it should download, and monitors its progress. Wallace runs a number of spider processes at any one time, but limits the maximum number of spiders to a preset limit. When one spider finishes, another is started automatically to download a different site. Wallace reads the list of links to download from a remote mSQL database using the Perl DBD and DBI modules. The database is expected to be in the format maintained by the Feathers links system

Wallace first downloads all the URLs in the database that are marked for indexing or mirroring. It then sorts the URLs by host name. URLs are grouped into host bands (that is, they all contain the same host name) and these bands are passed as URL lists to the web spider (gromit) for downloading. Wallace runs its spiders concurrently. There may be a web spider running for each host band at the same time, up to a maximum of 10. The user can modify the maximum number of spider processes. As one spider completes, another is started, until all host bands have been downloaded.

Impact on other sites

Gromit is a relatively unobtrusive robot, designed to have minimal impact on the sites it visits. The robot, designed and implemented by AustLII staff, has been written in Perl 5, and uses the LWP library. In particular, the LWP::RobotUA object is used as the basis for Gromit. That module, together with other measures taken in the program, minimises impact on web performance because:


* It obeys the Robots Exclusion Protocol so as to not visit areas where robots are not welcome. Specifically, it obeys directives in the robots.txt file in the root directory of servers (see Robots Exclusion at The Web Robots Pages).


* No one site is accessed twice by the robot within a 2 minute period.


* The robot caches downloaded documents for later indexing, and will issue a HEAD request for a page before attempting to download fresh versions of already cached pages. On those web sites that support such mechanisms, Gromit will take advantage of the If-Modified-Since and Last-Modified HTTP headers, reducing server load for those machines.


* A notorious problem with web spiders is that they can saturate a remote site with requests, slowing down the remote server and denying access to other web users. By grouping sites into bands, no one site is accessed simultaneously by Gromit, since Gromit processes URLs in consecutive order.

Gromit is still under development, and during this initial stage will not be running unattended. Further information can be obtained on the page `Gromit Web Robot - Information for Web Managers'.

Mirror sites on AustLII

AustLII has been granted permission to mirror certain legal sites. The Gromit robot is used to download these sites and keep the mirrors updated. When mirroring, Gromit rewrites local URLs to use the mirror copies of documents, and also downloads any graphics or other files that may be referenced there.

List of proposed development law subject headings for DIAL Index

Accounting and auditing

Administrative law

Alternative dispute resolution

Arbitration

Banking and finance

Bankruptcy, insolvency, reorganization

Capital markets

Consumer protection

Contract

Companies and other business entities

Courts and the judicial system

Deeds and other instruments

Environment

Foreign investment

Industrial relations, labor law

Infrastructure

Electricity

Ports

Roads

Telecommunications

Water supply and sanitation

Insurance

Intellectual property

Leases and tenancies

Legal practitioners

Litigation

Maritime law

Media and communications

Mortgages and securities

Natural resources

Practice and procedure

Privatization

Procurement

Public administration

Public health

Real property, land law

Regulatory law, government regulation of business

Social welfare and services

Taxation, revenue, customs

Trade practices

Trade and commerce

Transport


[Previous] [No next] [Up] [Title]