GROMIT

Gromit Web Robot - Information for Web Managers

Introduction

Gromit is a specialist web robot designed and implemented by programmers at the Australasian Legal Information Institute. It targets legal information web sites based on URLs contained in its Links Database. Web managers and systems administrators who need to contact the Gromit authors can do so by:

Web site administrator's who wish to discuss the operation of the robot on their site should contact Geoffrey King or Dan Austin using one of the above contact methods.

Note that a second robot, called Commet is also run from the AustLII servers. This robot is a links validator, and uses the same user-agent field as Gromit for the purposes of robots.txt files.

Stopping Gromit from Accessing your Site

Gromit Obeys the Robots Exclusion Protocol. You can control whether or not web robots are allowed to access your site (and what areas of your site they can access) through the use of a special file on your web server called robots.txt. The robots.txt file must exist in the root directory of your web server (not the root directory of the computer itself). For example, AustLII's robots.txt file can be found at /robots.txt.

To prevent Gromit from accessing parts of your site:

  1. Create (or edit) a file called robots.txt in the base directory for your web documents.
  2. The first line of the file will identify which robot you want to control access for. Type User-agent: Gromit to control just AustLII's robot, or use User-agent: * to control access for all robots.
  3. We next list all the directories that the robot is not allowed to access. A directory name includes all the sub-directories, so to restrict the whole site, you would type Disallow: /. To stop access to your CGI area, use Disallow: /cgi-bin/. Use as many Disallow lines as needed.
  4. Save the file and make it world readable. Check that you can load it remotely using a web browser. An example robots.txt file is below.
Example robots.txt file:
User-agent: *
Disallow: /cgi-bin/
Disallow: /acts/1997/privacy/
Disallow: /cases/1997/irt/
Disallow: /opinions/

The file above applies to all robots, and stops them from accessing the CGI and opinions directories. They can download things in /acts/1997/ and /cases/1997/, but not files stored in the privacy or irt subdirectories. See More Information below for further details

Allowing Gromit in

Here we will assume that you already have a robots.txt file, and just want to allow Gromit to access areas that other robots can't. Consider the sample file above. If we wanted to allow Gromit to read /opinions/, we'd add this section to the start of the robots.txt file:

User-agent: Gromit
Disallow: /cgi-bin/
Disallow: /acts/1997/privacy/
Disallow: /cases/1997/irt/

Now the whole file should read:

User-agent: Gromit
Disallow: /cgi-bin/
Disallow: /acts/1997/privacy/
Disallow: /cases/1997/irt/

User-agent: *
Disallow: /cgi-bin/
Disallow: /acts/1997/privacy/
Disallow: /cases/1997/irt/
Disallow: /opinions/

What we have done is created a seperate control section for Gromit, so that other web robots cannot read /opinions, but Gromit can.

Note that robots.txt is a voluntary protocol -- robot designers can ignore it if they want to. However, Gromit will always honour a robots.txt file if it conforms to the correct specification.

Note that there is no "Allow" line (only "Disallow") in the current specification. Some reviews of the robots exclusion protocol are underway.

Performance Characteristics

Gromit is designed to have minimal impact on the sites it visits. The robot has been written in Perl 5, and uses the LWP library. In particular, the LWP::RobotUA object is used as the basis for Gromit. That module, together with other measures taken in the program, minmises impact on web performance by:

  1. Obeying the Robots Exclusion Protocol so as to not visit areas where robots are not welcome (specifically, it obeys directives in the robots.txt file in the root directory of your server);

  2. No one site is accessed twice by the robot within a 1 minute period;

  3. The robot caches downloaded documents for later indexing, and will issue a HEAD request for a page before attempting to download fresh versions of already cached pages (uses the Last-modified header in the HEAD response). If you use a server (such as Apache) that returns a Last-modified header in HEAD responses, your server load from Gromit will be lighter. Some servers (such as Netscape Enterprise Server) do not return Last-modified headers however;

  4. In addition to the above, the robot adds "If-Modified-Since" headers to all GET requests, so that servers that support it will not send documents that have not changed since the robot last visited the site;

  5. Gromit is a directed robot, meaning it is discriminating about the links it follows and downloads. It will only access those sites which are listed in AustLII's Legal Links pages. These pages are selected specifically for their legal information content.

If you need information on the correct management of robot exclusion files, see Robots Exclusion at The Web Robots Pages. Alternatively, you can contact AustLII at one of the addresses listed above in order to discuss the issues.

Cacheing Information

Gromit does maintain a local cache of downloaded documents, so that they can be indexed by AustLII's SINO Search Engine. The cached documents are not available for download via AustLII's servers.

AustLII has been granted permission to mirror certain legal sites, as a free service to certain web maintainers. The Gromit robot is used to download these sites and keep the mirrors updated. When mirroring, Gromit rewrites local URLs to use the mirror copies of documents, and also downloads any graphics or other files that may be referenced there.

Additional Information and Links

Here are some additional links containing information about this and other web robots:


Contact Dan Austin for information on the Gromit Web Robot.