Gromit is a specialist web robot designed and implemented by programmers at the Australasian Legal Information Institute. It targets legal information web sites based on URLs contained in its Links Database. Web managers and systems administrators who need to contact the Gromit authors can do so by:
Web site administrator's who wish to discuss the operation of the robot on their site should contact Geoffrey King or Dan Austin using one of the above contact methods.
- Mailing dan at austlii.edu.au (Dan Austin) or geoff @ austlii.edu.au (Geoffrey King); or
- Using the UNIX talk program (also available on other platforms), issue a talk request to dan@tathra.austlii.edu.au; or
- Contact Dan Austin by phone (+61 2 9514 3173).
Note that a second robot, called Commet is also run from the AustLII servers. This robot is a links validator, and uses the same user-agent field as Gromit for the purposes of robots.txt files.
Gromit Obeys the Robots Exclusion Protocol. You can control whether or not web robots are allowed to access your site (and what areas of your site they can access) through the use of a special file on your web server called robots.txt. The robots.txt file must exist in the root directory of your web server (not the root directory of the computer itself). For example, AustLII's robots.txt file can be found at /robots.txt.To prevent Gromit from accessing parts of your site:
Example robots.txt file:
- Create (or edit) a file called robots.txt in the base directory for your web documents.
- The first line of the file will identify which robot you want to control access for. Type User-agent: Gromit to control just AustLII's robot, or use User-agent: * to control access for all robots.
- We next list all the directories that the robot is not allowed to access. A directory name includes all the sub-directories, so to restrict the whole site, you would type Disallow: /. To stop access to your CGI area, use Disallow: /cgi-bin/. Use as many Disallow lines as needed.
- Save the file and make it world readable. Check that you can load it remotely using a web browser. An example robots.txt file is below.
User-agent: * Disallow: /cgi-bin/ Disallow: /acts/1997/privacy/ Disallow: /cases/1997/irt/ Disallow: /opinions/The file above applies to all robots, and stops them from accessing the CGI and opinions directories. They can download things in /acts/1997/ and /cases/1997/, but not files stored in the privacy or irt subdirectories. See More Information below for further details
Here we will assume that you already have a robots.txt file, and just want to allow Gromit to access areas that other robots can't. Consider the sample file above. If we wanted to allow Gromit to read /opinions/, we'd add this section to the start of the robots.txt file:
User-agent: Gromit Disallow: /cgi-bin/ Disallow: /acts/1997/privacy/ Disallow: /cases/1997/irt/Now the whole file should read:
User-agent: Gromit Disallow: /cgi-bin/ Disallow: /acts/1997/privacy/ Disallow: /cases/1997/irt/ User-agent: * Disallow: /cgi-bin/ Disallow: /acts/1997/privacy/ Disallow: /cases/1997/irt/ Disallow: /opinions/What we have done is created a seperate control section for Gromit, so that other web robots cannot read /opinions, but Gromit can.
Note that robots.txt is a voluntary protocol -- robot designers can ignore it if they want to. However, Gromit will always honour a robots.txt file if it conforms to the correct specification.
Note that there is no "Allow" line (only "Disallow") in the current specification. Some reviews of the robots exclusion protocol are underway.
Gromit is designed to have minimal impact on the sites it visits. The robot has been written in Perl 5, and uses the LWP library. In particular, the LWP::RobotUA object is used as the basis for Gromit. That module, together with other measures taken in the program, minmises impact on web performance by:
If you need information on the correct management of robot exclusion files, see Robots Exclusion at The Web Robots Pages. Alternatively, you can contact AustLII at one of the addresses listed above in order to discuss the issues.
- Obeying the Robots Exclusion Protocol so as to not visit areas where robots are not welcome (specifically, it obeys directives in the robots.txt file in the root directory of your server);
- No one site is accessed twice by the robot within a 1 minute period;
- The robot caches downloaded documents for later indexing, and will issue a HEAD request for a page before attempting to download fresh versions of already cached pages (uses the Last-modified header in the HEAD response). If you use a server (such as Apache) that returns a Last-modified header in HEAD responses, your server load from Gromit will be lighter. Some servers (such as Netscape Enterprise Server) do not return Last-modified headers however;
- In addition to the above, the robot adds "If-Modified-Since" headers to all GET requests, so that servers that support it will not send documents that have not changed since the robot last visited the site;
- Gromit is a directed robot, meaning it is discriminating about the links it follows and downloads. It will only access those sites which are listed in AustLII's Legal Links pages. These pages are selected specifically for their legal information content.
Gromit does maintain a local cache of downloaded documents, so that they can be indexed by AustLII's SINO Search Engine. The cached documents are not available for download via AustLII's servers.AustLII has been granted permission to mirror certain legal sites, as a free service to certain web maintainers. The Gromit robot is used to download these sites and keep the mirrors updated. When mirroring, Gromit rewrites local URLs to use the mirror copies of documents, and also downloads any graphics or other files that may be referenced there.
Here are some additional links containing information about this and other web robots:
- Gromit Web Toolbox
An overview of the family of robot programs that make up the Gromit Web Toolbox.- Gromit Manual Page
This is the manual page of the current development version of Gromit. Earlier versions' manual pages can be accessed from the archive directory (available from the above page).- Wallace Manual Page
Wallace is the robot manager process (super-daemon) that overseas the operation of the Gromit web robot over indexed sites.- Commet Manual Page
Commet is a stripped-down version of Gromit, used for links validation only. It uses the same User-Agent field as Gromit for the purposes of robots.txt files.- Robots: A Tutorial
Information on building your own web robot using the LWP library and Perl.- Perl Home Page; and libwww-perl Home Page
The LWP module (libwww) is a powerful toolset for Internet and Web programming using the Perl scripting language.