16th Biennial LAWASIA Conference
Seoul, Korea, 7-11 September 1999
The future of legal research via Internet, from an Asian perspective
(Project DIAL and AustLII's World Law)
Presenter: Professor Graham Greenleaf,
University of New South Wales, Australia
http://www2.austlii.edu.au/~graham/ mailto:graham@austlii.edu.au
Co-authors[*]
: Graham Greenleaf, Daniel Austin, Philip Chung, Andrew Mowbray, Jill Matthews
and Madeleine Davis of AustLII (http://www.austlii.edu.au/)
Version: This is the first version of a work in progress which
may be found at http://www2.austlii.edu.au/~graham/publications/World_Law/
. A version of the paper was presented at AustLII's Law via the Internet
'99 Conference, 21-23 July 1999, University of Technology, Sydney.
Contents
1. The potential - A world-wide law library on the
Internet
Despite its recent development, the Web already contains an astonishing
variety of legal materials from dozens of countries[1].
Significant collections of legislation are already available on the Web
from over 50 countries[2]http://www.austlii.edu.au/links/DIAL_Index/Legislation/].
The full text is available on the Web of all legislation from almost all
the jurisdictions of the USA, Canada, Australasia, many Latin American
countries and some European countries (such as Norway and Germany), and
extensive collections from many other European counties (such as the United
Kingdom, France, Spain, Portugal). Substantial collections of legislation
are available from many developing countries, including India, Turkey,
Kazakhstan, South Africa, Vietnam, Zambia, China, Mexico and Israel.
There are also extensive collections of case law from about 20
countries, particularly from North America and Australasia and some European
courts, but also courts from India, Korea, Brazil and other countries.
The Parliaments of dozens of countries have Web pages, and these contain
many significant resources concerning legislation and law reform. Law reform
commissions and similar bodies are starting to make their reports and working
papers available via the Web. There are specialist university and other
centres which provide very large specialist collections of materials in
areas such as constitutional law, trade law, the law of the sea and human
rights.
Despite the abundance of valuable legal materials already on the
Web, and the rapidity with which these materials are expanding, these materials
are often very difficult to find, since they are scattered across thousands
of Web sites located all around the world. As we shall see, the tools we
have used for internet legal research until now do not serve us well enough.
1.1. Importance to developing countries "
The Internet is still dominated, in terms of both location of Web sites
and location of users, by the developed world of North America, Europe
and Australasia. The preponderance of English-language information on the
Web is in part a reflection of this.
However, the availability of legal information on the World-Wide-Web
is of considerable importance to developing counties, in Asia and elsewhere.
Law libraries with reasonably comprehensive and up-to-date collections
of legislation, case law and law reform reports, are virtually unknown
in the developing world, and the costs of maintaining them are prohibitive.
Access to large online commercial services such as Lexis is also prohibitive
for most lawyers in developing countries. This is the principal reason
that the Asian Development Bank has funded the development of Project DIAL
(Development of the Internet for Asian Law)[3]http://www.austlii.edu.au/au/special/dial/]
(one subject of this paper): to provide a means by which legislative and
law reform personnel can obtain access to comparative legislative and law
reform models. It is of equal importance for superior Courts in developing
countries to have access to the decisions of similar Courts throughout
the world. It is not only the legislation or case law of the developed
countries that is of interest and importance: a legislator in Mongolia
might find their best model for a Buy-Operate-Transfer (BOT) law from a
Kazakstan Web site, and the Supreme Court of a small Pacific Islands country
is likely to find the most important precedents for its decisions in the
decisions (currently almost completely unobtainable) of similar Pacific
Island states.
Legal information on the Internet is also important to developing
countries for a quite different reason: the provision via the Internet
of comprehensive and up-to-date information about a country's laws and
legal system can be a striking demonstration of the transparency of a country's
legal system. This is likely to be of considerable imporartance to potential
foreign investors, both in symbolic terms and in facilitating efficient
and low cost legal advice necessary for investment decisions to be made.
A recent example is the proliferation of govenment agency Web sites in
Mongolia which provide the legislation that they adminster[4]:
it would be very difficult and expensive to obtain that legislation by
any other means.
2. The problem - Internet legal research is difficult
2.1. Two types of tools - Catalogs and search engines
There are essentially only two types of tools which help users find legal
materials on the internet, commonly called catalogs and search engines.
Catalogs are where individual Web sites are classified by hand
according to various classificatory schemes. Because of the human intelligence
and effort involved, they are also called 'intellectual' indexes. Usually,
such indexes only provide the title, URL[5]
and perhaps a brief description of each site indexed. Yahoo![6]http://www.yahoo.com/]
is a well known example of an internet-wide catalog or intellectual index
of the Web (ie one which is not law-specific).
'Search engine', in the Web context, has become a shorthand way of referring
to the combination of a 'Web robot' and the data it gathers, and the text
retrieval software used to search that data. A program (variously called
a `Web robot' or `Web spider') traverses the Web, downloading every page
it encounters, so that every word on every page can be converted into one
very large word occurrence index ('concordance') which can be searched
by the text retrieval software. They can also be called 'automatic indexes'.
When the search engine displays a URL as a result of a search, that URL
is to the original site, not to a mirror on the remote site. Alta Vista,
Northern
Light, Hot Bot, Excite and Infoseek are well-known
`general purpose' search engines that search an index created by a Web
spider, and where the Web spider goes to all types of subject matter. They
have only existed since Alta Vista's creation in 1996. The principle
advantage of this approach that it is possible to search every word that
has been indexed, not just the titles and brief summary of what is on the
site. About 85% of internet users use search engines to locate information[7].
Combinations of catalogs and search engines on the one site are
now becoming more common, and such combinations are often referred to as
'portals'. Some catalogs are now attempting the automated or semi-automated
classfication of new Web sites located by a Web spider (for example, Excite
Australia[8]).
This approach is potentially promising but has not yet been shown to produce
useful systems for legal research.
Despite the existence of these research aids, finding legal information
on the internet is surprisingly difficult, partly because neither catalogs
nor search engines used alone can provide a satsifactory solution. This
is particularly in relation to internet-wide (ie not law-specific) catalogs
and search engines. These difficulties will now be summarised[9].
2.2. Catalogs alone are insufficient
Good catalogs (intellectual indexes) for law are hard to find[10].
While there are many multi-country intellectual indexes to law on the internet[11],
none are even remotely comprehensive, and many are US-oriented with only
a slight international gloss. Many are updated only rarely if at all. Some
very good indexes do exist for particular countries (eg Canada, the USA,
Germany and Australia), and many exist for particular subject matter areas,
but they are often difficult to find from the multi-country indexes. It
is therefore difficult to find a good place to start. The coverage of legal
materials in general-purpose internet indexes is no more helpful, as an
inspection of the limited coverage of legal materials in an index such
as Yahoo! (the largest general-purpose catalog) will show[12].
Catalogs are hard to maintain. As the quantity of legal material
on the internet grows, the sites that contain significant legal information
grow so numerous, and some are so large, that it is difficult to maintain
catalogs at all, and particularly to maintain them with any depth of indexing
of each site. The best that can be hoped for is that sites with significant
legal materials are identified in the index, even though there is no detailed
description of their content. For example, it soon becomes impossible to
include in a catalog the content of each piece of legislation, each case,
or each journal article included on a large site.
As a result, we can say that catalogs are inherently shallow
- even when they are good at identifying important law sites, they cannot
index very deeply into those sites.
2.3. General purpose search engines alone are insufficient
The main problem of general purpose search engines for legal research is
their lack of comprehensiveness, but there are numerous other problems
as well[13].
General search engines are not comprehensive. There are very
good internet-wide search engines, but they are nowhere near as comprehensive
as people often assume. A July 1999 report in Nature by Lawrence
and Giles[14]
shows that of eleven major search engines tested (includingAlta Vista,
Northern Light, Hot Bot, Excite and Infoseek
) the search coverage they provided of the estimated 800M pages on the
Web was 16% at best (Northern Light) and down to 5.6% (Excite),
with some not named here even lower.
There are a number of reasons, other than the rapdily expanding size
of the Web, for the lack of comprehensiveness of search engines based on
automated Web spiders. They include the following[15]http://searchenginewatch.com/size.htm]:
-
Some robots only index a sample of pages on a particular site (at least
at any one time), and do not continue indexing until they complete all
pages on a site in one session[16]http://www5.zdnet.com/anchordesk/talkback/talkback_11638.html].
-
Well-behaved robots[17]http://info.webcrawler.com/mak/projects/robots/robots.html]
adhere to the robot exclusion standard[18]http://info.webcrawler.com/mak/projects/robots/exclusion.html],
by which Web servers tell robots which pages they may not index on a site.
Because of the effects of some robots on server performance, and for other
reasons, many servers exclude robots. All major search engines observe
robot exclusions.
-
'Non-Web' databases are often accessible on the Web behind search forms,
but the underlying databases cannot be indexed by robots as they are not
in the form of HTML pages.
-
There are some technical problems with frames and with dynamically created
Web pages that mean that they cannot be included in Web spider indexing.
To make matters worse, the percentage of the available Web pages that the
search engines are indexing is declining, not increasing, due to the rapid
expansion of the Web from an estimated 320 M pages in December 1997 to
800 M pages in February 1999[19].
The decline in the best search engine was from about 33% to 16% of the
estimated total size of the Web, a dramatic drop. It seems that the Web
spiders of general purpose search engines simply cannot keep up with the
expansion of the Web. The technical problems caused by such expansion may
also be exacerbated by diminishing financial returns from the costs of
trying to keep up. Lawrence and Giles suggest:
Why do search engines index such a small fraction of the Web?
There may be a point beyond which it is not economical for them to improve
their coverage or timeliness. The engines may be limited by the scalability
of their indexing and retrieval technology, or by network bandwidth. Larger
indexes mean greater hardware and maintenance expenses, and more processing
time for at least some queries on retrieval. Therefore, larger indexes
cost more to create, and slow the response time on average.There are diminishing
returns to indexing all of the Web, because most queries made to the search
engines can be satisfied with a relatively small database. Search engines
might generate more revenue by putting resources into other areas (for
example, free e-mail).
The combined coverage of all of the search engines tested by Lawrence and
Giles was estimated at 42% of the total number of Web pages, a decline
from 60% in their 1997 study. They agree that use of meta-search engines
such as MetaCrawler will provide a more comprehensive search. However,
it seems from their research that the best possible coverage would still
only be about 42%, and even then this would depend on maximum coverage
and sorting efficiency of the meta-search engine, so the real figure is
likely to be somewhere between 16% and 42%.
There are other significant problems with general purpose search
engines which stem from the ambitious nature of their task of indexing
all information on the Web irrespective of its subject matter:
-
General search engines are often out of date Lawrence and Giles
concluded from their tests that there was 'evidence that indexing of new
or modified pages can take several months or longer'. As a result, there
may be new pages added in that period that are not indexed and searchable,
even if the site is included in the particular search engine's coverage.
-
General search engines contain too much `noise' It is difficult
to make searches precise enough to find only legal materials using general
search engines, because they index predominantly non-legal material. It
is usually necessary to try to impose some ad hoc search limitation (in
addition to the real search terms) such as `law or legislation or code
or court' or some such, to try to stem the flood of irrelevant information
(or more likely, to fool the relevance ranking into putting legally oriented
material first).
-
General search engines are difficult to search for law from particular
countries It is difficult for most users to limit searches to materials
concerning laws of particular countries[20],
and failure to do so will usually result in the search being flooded with
material from North America and other `content rich' parts of the internet.
-
General search engines are biased in favour of 'popular' pages The
way in which the Web spiders for general search engines work is that they
find many of the pages they index simply by following hypertext links from
other pages. This likely bias that search engines would have in favour
of finding pages which have numerous hypertext links to them has been confirmed:
there is a direct relationship between the probability of a page being
indexed and the number of links to that page from other pages[21].
New types of search engines are exacerbating the problem as they take into
account in their measurement of relevance ranking the number of other pages
that link to the page concerned[22].
Lawrence and Giles conclude 'For ranking based on popularity, we can see
a trend where popular pages become more popular, while new, unlinked pages
have an increasingly difficult time becoming visible in search-engine listings.
This may delay or even prevent the widespread visibility of new high-quality
information.'
-
General search engines do not provide unbiased world-wide coverage
Given the limited coverage of general search engines, their bias in favour
of materials which are hypertext linked from other pages, and the fact
that existing catalogs of legal materials on the internet are very heavily
biased in favour of materials from developed countries (particularly North
America) it seems very likely that general search engines are likely to
be biased in favour of materials in English, and materials from developed
countries.
-
Some general search engines are selling priority places in their relevance
ranking. This approach distorts the reliability of search engines,
as the 'top ranked' sites may not be those which contain the most relevant
materials.
The conclusion we may draw from all of these problems is that it would
be unrealistic to expect general purpose internet-wide search engines to
provide a very effective method of a task as specialised and 'non-popular'
as internet legal research.
2.4. Research problems with individual Web sites
Many significant law sites can't be searched When you do find a
site containing valuable legal information it will often not have a search
engine at all, so searching at word level is not possible. Of the more
than 30 internet sites around the world containing significant quantities
of legislation, less than half have any search engine[23].
It requires considerably greater technical ability to run a search engine
than it does to simply put pages of legal material onto the internet where
they can be browsed.
Using different search engines can be confusing Even if a law
site does have its own search engine, users who wish to find legal materials
on different sites can also be easily confused by the need to use different
search engines with different search commands.
2.5. An unsatisfactory result
So the problems of finding legal materials world-wide are that it is both
difficult to find which useful sites exist for a particular country or
subject, and also difficult to find what is on such sites as are known.
These research problems are very substantial even for the most expert `internet
savvy' lawyers and law librarians. They are much worse for inexperienced
users.
3. AustLII's new solution - World Law and DIAL
The challenge is to find a new approach to legal research on the internet
which will provide an internet-wide (which means world-wide) method of
effectively providing access to legal materials available on the Internet,
no matter where they are located.
The answer we propose is in part a technical solution, a limited area
search engine for law, but to a large extent the success of the technical
component will depend on an organisational element, the creation of a multi-national
group of collaborators who are willing to make joint use of the technical
tools we have developed in order to create and sustain a world-wide legal
research facility.
3.1. A new technical solution - A limited area
search engine for law
Our approach to reducing the problems of internet legal research rests
on these propositions:
-
A catalog created by intellectual indexing effort is essential to identify
high value law sites and legal resources, but cannot and should not aim
to be comprehensive in its depth of indexing particular sites once they
have been identified.
-
Web robot indexing of remote law sites, and a sufficiently powerful search
engine, are necessary to provide the depth of search capacity that intellectual
indexing cannot provide;
-
This is particularly so when many significant law sites do not have search
engines at all, and where there is no consistency among the search engines
used by those that do.
-
Searching robot indexed sites will work much better if (i) only law sites
are indexed (to remove non-legal `noise' and improve precision); and (ii)
such sites are indexed comprehensively (to improve recall). We call the
combination of such a Web spider (a `targeted' Web spider) and search engine
dedicated to legal materials only a 'limited area search engine' .
-
Significant law sites which normally exclude robots may allow a `targeted'
law-oriented Web spider to make them searchable, by request. The number
of requests may be manageable.
-
A comprehensive catalog is needed to identify the law sites worth indexing,
and therefore to `target' the robot. The intellectual index therefore serves
the double function of a useful resource in itself, and the essential means
of `feeding' the search engine.
-
A versatile method of effectively limiting the scope of searches to particular
types of subject matter (legislation, case law, laws from a particular
country, laws on a particular subject) is needed to improve the precision
of legal research, as the amount of legal materials over which searches
will be conducted is still very large.
-
Once a law-oriented Web spider has created a searchable index of key law
sites, specific searches over that index for various types of subject matter
can be `embedded' in the intellectual index, thereby making the intellectual
index `self-updating' to a certain extent, and so reducing its maintenance
costs. Such `embedded searches' also cater for inexperienced users who
have difficulty in formulating searches.
-
Maintenance of both the catalog and the validity of the web spider targeting
(both of which can degrade due to page movements) must be automated as
far as possible.
The key to effective legal research on the internet may therefore be a
tight integration of an intellectually created catalog and a search engine
based on a Web spider, a symbiotic relationship in which each builds on
the features provided by the other[24].
Intellectual indexing and automated indexing can feed off each other.
The following diagram explains this relationship, from the perspective
of the indexer and the user.
Interactions in the use of a World Law/DIAL's limited area search
engine
3.2. AustLII's tools - Web indexing, Web spider
and search engine software
AustLII personnel[25]
have developed software and systems to implement this approach:
-
internet indexing software (`Feathers') which allows remote updating
by multiple contributors to an index, full search facilities over the index
entries, and a facility to `target' a robot to fully index specific sites
identified in the index[26];
-
a robot or `Web spider' (called `Gromit'), and a `harness' or means
of controlling it (called `Wallace'[27]http://www.wallaceandgromit.com/]),
which only indexes those sites or parts of sites to which it is sent (by
the indexing software), and does not stray 'off site' to index unwanted
material[28].
-
a search engine (SINO), which has the full range of boolean and
proximity search commands, optional relevance ranking of search results,
and a facility for limiting the scope of searches to specific databases
or collections of databases[29].
3.3. International indexing partnerships - a necessary
element
The approach we are taking requires the development and maintenance of
a world-wide catalog created by intellectual indexing effort, which at
least catalogs significant national law sites in all countries world-wide
and the major subject-oriented resources. A free access law facility such
as AustLII, even if it does obtain significant funding support for the
task, is unlikely to be able to support more than a couple of legal indexing
staff to undertake the task of adding content. Thehuman languages that
such staff are conversant with will be necessarily limited. While it is
possible to provide a basic level of world-wide coverage with such resources,
as can already be seen in the World Law / DIAL 'Countries' page[30],
this will not be fully satisfactory unless regional, national or subject
experts are also involved in the indexing process.
Consequently, the way in which we are now envisaging the World
Law / DIAL project is that, while most of the organisation of the content
indexing and much of the substantive indexing will be carried out by AustLII
staff, wherever possible we will invite appropriate authorites from other
countries, and other language and subject specialities, to participate
in the cataloging process as 'indexing partners' for particular parts of
the catalog.
There are a number of ways in which such partnerships can proceed,
ranging from a partner periodically emailing lists of proposed URLs to
be added to our pages to AustLII's indexers (who would add them to the
relevant pages and send the Web spider); to our indexers checking a partner
site regularly for any additional links which should be added to our pages;
to a remotely located indexer being provided with password access to the
editing interface to World Law such that they can add links and change
sub-categories in 'their' parts of the index but not elsewhere. The editing
interface of World Law is via the Web, so contributing editors can be located
anywhere with Web access.
Wherever collaboration occurs, it will be acknowledged on those pages
of the catalog where the collaboration is based, by inclusion under the
'Contributor' heading of the logo (or name) of the contributing organisation
or person and a link to their Web site. For example, Australia's Department
of Foreign Affairs and Trade contributes content to the creation of our
'Treaties and International Agreements' pages, which is acknowledged as
shown below.
Some of the advantages of this approach for our indexing partners
are that they will obtain additional users through referrals from the World
Law pages, and if they wish they will also be able to place CGI search
interfaces on their own pages which will search over the full texts indexed
by our Web spider, but only only the part of AustLII's index relevant to
them.
We are only now starting this process of creating partnerships, so it
is too early to report on its success. Our first international partnership
is being established between Ralph Amissah's Lex Mercatoria site
in Norway[31]
and World Law's International Trade pages[32],
Other partnerships are being negotiated.
3.4. A brief history of World Law and Project DIAL
AustLII had been developing a catalogue of Australian law sites since July
1995, with some international indexing. The index was maintained as hand-created
HTML pages until 1997, at which point it was moved into a mSQL database,
and it became possible to search for individual catalog entries. At this
point the appearance and functionality was similar to Yahoo!'s approach.
In early 1997 Web spider software was customised, and the internet indexing
software rewritten so that it could be used to 'target' the Web spider.
The first opportunity for extensive testing of the targeted Web spider
was in Project DIAL[33]http://www.austlii.edu.au/au/dial/],
a project for the Asian Development Bank which aims to increase the accessibility
of legislation on the internet. The Web-spider search facility, DIAL Search[34]http://www.austlii.edu.au/au/special/dial/DIALsearch.html],
was released on the Web in August 1997 and allowed searches of legislation
and legislation-related material from many countries. The World Law / DIAL
facilities were demonstrated at the Australian Society of Indexers' Conference
in 1997[35].
The more general version of the targeted Web spider was made available
from AustLII's home page as `World Law Search' in Februrary 1998, when
non-legislative material started to be added by the Web-spider. Two test
'Libraries' were added to allow searches to be restricted to legislative
materials and to other internet law indexes - the precursor of the search
limitation facilities which have now been added.
In June 1999 the Asian Development Bank entered into a three year
agreement for AustLII to develop and host the full-scale Project DIAL facility,
as a Regional Techncial Assistance (RETA) of the Bank. The value of the
RETA is approximately A$1 M, with nearly a half of those funds being used
to assist and train users in the Bank's Developing Member Countries (DMCs)
of the Bank.
During 1999 there has been a major redevelopment of the search
facilities in World Law / DIAL, so that the scope of searches can be limited
by the location in the catalog from where the user does the search. This
enables users to limit their search scope (thus increasing precision) without
having to undersand any search commands in order to do so. It is the major
single technical innovation in World Law / DIAL.
3.5. Other limited area search engines
AustLII's World Law Search is one of the first and one of the few examples
of what we call `targeted Web spiders' and others have called `Limited
Area Search Engines'. There are a couple of examples of this technology
being used in the field of law.
JURIST[36]
launched in January 1998 what it calls a `Limited Area Search Engine (LASE)'
to make searchable all home pages of law Professors and course pages of
law subjects in its index.
JURIST Search page - http://jurist.law.pitt.edu/search.htm
JURIST uses the Argos LASE which, according to its developers, was `the
first peer-reviewed, limited area search engine (LASE) on the World-Wide
Web'[37]http://argos.evansville.edu/about.htm]
at the time of its release in October 1996. Argos was developed to provide
a more precise means of searching for scholarly literature on the ancient
and medieval world. It's development was prompted by the same shortcomings
with Internet-wide search engines as we have identified in the preceding
discussion[38]http://argos.evansville.edu/about.htm]:
At the time of this writing, a search for "Plato" on the Internet
search engine, Infoseek, returned 1,506 responses. Of the first ten of
these, only five had anything to do with the Plato that lived in ancient
Greece, and one of these was a popular piece on the lost city of Atlantis.
... Add to this broad range of responses the fact that Infoseek returns
ten entries per page, making it necessary to examine one hundred and fifty
one pages of entries, many of which are irrelevant to a scholarly search
of "Plato," and the result is a process that is frustrating and inefficient.
They identify advantages of the alternative, `targeted' approach:
By limiting the range of the search engine, a LASE strips out
many unwanted references... The result is a higher quality index built
for a specific purpose and for a smaller audience. Furthermore, the quality
of the index, its purpose and the level of specialization expected of its
intended audience are variables that can be manipulated with LASE
technology.
They also note that, because of its limited scope, it is possible for Argos
to update weekly the indexes to all the sites it covers, rather than the
couple of months that (in their experience) are taken by Internet-wide
search engines to update sites. They estimate that this means that 98%
of all links found by their search engine work at any given time.
Another example is that Knowledge Basket, the New Zealand Internet
content provider, released Legal Search New Zealand[39]http://202.37.88.18/search/legsearch.html]
in December 1997. It uses the Verity Web robot and Topic search engine,
and makes searchable 25 New Zealand law sites at present. The advantages
they claim for this approach are similar to those described above[40]http://www.knowledge-basket.co.nz/kete/nzsearch.html].
4. Catalog structure of World Law/DIAL
World Law Index, the intellectual index or catalog aspect of World Law
has a conventional `Yahoo-like' interface. The 'World' root page of the
index is shown below, and all other sub-categories are located in a hierarchy
under 'World'.
4.1. Content in the catalog
Approximately 4,000 law sites are indexed in the catalog at present, with
most sites indexed under a number of sub-categories.
Some features of the catalog structure are:
-
There is a page for every country in the world[41],
which at the very least will contain an embedded search (see later) which
will find at least 190 legal documents referring to even the smallest country
(for example, 193 for the Faroe Islands; 239 for Niue; 385 for the Cayman
Islands; 771 for Kyrgyzstan; 907 for Burkina Faso).
-
The hierarchical structure of the catalog is organised to a large extent
around indexing by country, with each 'country page' having a fairly standard
structure of sub-categories ('Legislation', 'Courts', 'Education', 'Other
Indexes' etc). Other cross-indexing from the 'World' page (for example,
on the 'World >> Legislation' page) is essentially by cross-references
to these sub-categories of the country pages. Such cross references are
apparent by the '@' symbol following a sub-category.
-
There is legislation from more than 60 countries accessible directly from
the 'World >> Legislation' page, as illustrated in the extract below. Each
link goes directly to the 'Legislation' sub-node of that country's page.
A similar cross indexing structure is provided for Courts and Case Law,
Education, Law Reform etc.
-
Most sites and parts of sites indexed in the catalog are subject-indexed
(where relevant) as well as indexed by type or source of content. The 'World
>> Subject Index' page indexes sites by over 40 subject categories, and
the 'Australia >> Subject Index' (which has been under development longer)
has over 70 categories.
-
-
Extract of categories from the 'Australia >> Subject Index' page
4.2. Navigating to other places in the catalog
(the hierarchy)
Every catalog page lists its hierarchical location in the catalog. Click
on any point in the hierarchy to go back to that catalog page.
You can always get back to the start of the catalog by clicking
on '>>
World >>' (or on 'Australia' in the Australian part of the
catalog). If you are in the 'World' part of the catalog and you want to
get to a particular country page (eg Vietnam), click on '>> World
>>' then '>> Countries >>' and then select 'Vietnam@'.
4.3. The `New Additions ' page
Users can see at any time what content has been added recently to World
Law Index (and to World Law Search) by checking the `New Additions' page
from the World Law home page or from the [New] button on the button bars
in the system.
Extract from the `New Additions ' page - http://www.austlii.edu.au/links/new.html
4.4. Targeting the Web spider from the catalog
Editing entries in the catalog also involves the editor deciding whether
to send the Gromit Web spider to index every word on the site which has
been `targeted'. The harness program (Wallace) reads the list of instructions
from the Web indexing software, and then sends off multiple instances of
the Web spider program, each to download the content of a particular Web
site. The harness program ensures that only one instance of the Web spider
software is ever downloading from a particular site, to avoid saturating
that site with spider requests and denying access to other users. The harness
ensures that the Web spider is 'well behaved', causing minimum impact on
the sites from which it downloads Web pages. We call this a targeted Web
spider, as is not designed to traverse the Web generally: its downloading
is limited to the site specified in the original URL which 'targets' it.
Targeting the Web spider to start indexing at the correct page,
so that it when it indexes all other pages to which its starting page is
directly or indirectly linked, but are equal to or below the start page
in the server's file hierarchy, it indexes all and only the desired pages,
is a complex task. Some desired sets of data cannot be indexed because
of the `noise' they will bring with them. For others, it is impossible
to find an appropriately located `table of contents' page to use as the
`start page'. Other `problems of targeting' have also been identified[42]http://www2.austlii.edu.au/~graham/Futureproof/indexers-6.html#Heading22].
4.5 Maintenance of the catalog and the Web-spider
targeting
Catalog links (URLs) go out of date as pages are moved on remote sites,
or the sites cease to exist. If the URL is also the 'target' for the Web
spider to start its work, the spider will be unable to do so next time
it returns to the site to update its download of pages. To assist the indexers
maintain the catalog we have a program called Comet which checks all URLs
in the catalog every day and sends a report to the indexers indicating
which links appear to be broken.
Another maintenance problem is that the web spider is sometimes
sent to a URL which does exist, but for some reason cannot download the
intended pages from that starting point. Conversely, sometimes an editor
does not notice that by starting the web spider from a particular page
it will download far more pages than are intended. To assist in addressing
these problems, the program that controls the sending of Web spiders ('Wallace')
sends a report to the indexers if the Web spider downloads only a couple
of pages from a URL, or downloads more pages than a pre-set limit (currently
1,000).
5. Search facilities of World Law/DIAL
The search facilities in World Law/DIAL search over two types of content:
(i) the contents of the catalog; and (ii) the full text of all the remote
pages indexed by the targeted Web spider.
There is now one interface to both the catalog and the search engine,
with a search form located at the top of each catalog page (illustrated
below). The very unusual feature of the search facilities in World Law
/ DIAL is that the scope of the search is (in default) limited by the location
in the catalog from which the user carries out the search. In other words,
if the search is from the 'World>> Legislation' page, it will search over
all legislation from any country (but only legislation); if it is from
the 'World >> Countries >> Germany' page it will search only over sites
related to German law; and if it is from the 'World >> Countries >> Germany
>> Legislation' page it will search only German legislation.
Search results are displayed with catalog pages ('categories') listed
first, and then with the full texts of remote documents ('documents') listed.
Both lists are sorted into likely order of relevance to the search query
(relevance ranking).
5.1. Contents of the search facility
About 10 GB of text from targeted sites has been indexed to date, providing
searches over about 1M pages of legal information. This is a search space
about 50% larger than AustLII's Australian databases (approximately 6 GB).
To date, the countries from which the largest components come are the .us
domain of the USA (over 2 GB - mainly State legislation), non-AustLII Australian
sites (over 1.5 GB), the .edu and .org domains and Canada (each about 1
GB), followed by another 3 GB or so drawn from 57 other countries. The
emphasis is on legislation, law reform reports and law journals to date
(because of Project DIAL), but major components of case law, law school
sites and the like are being added.
5.2. Interface and search scope
The scope of a search is limited by where you search from (the context).
As in the example below, if a user is at the 'World >> Law Reform' page,
the default search scope will be 'Law Reform'. A search from this point
will search over all Web sites listed on the Law Reform page, or those
on any page which are sub-categories or cross-references from the Law Reform
page.
To put it very simply, the way in which the search restriction
mechanism works is that the search is first done over all pages retrieved
by the Web spider, but before the results are displayed to the user they
are 'filtered' by the URLs listed on the page indexed and its related pages,
so that only relevant pages are displayed.
To broaden or narrow your search scope, you go to a more appropriate
page in the catalog. Context determines search scope.
To search over everything available, the user could go back to
the 'World' page (by clicking on 'World') and search from there. Alternatively,
no matter where you are in the catalog, you can search everything ('World')
by selecting 'All World Law' (from the 'in' option) instead of the default
option limiting the search scope, as shown above. At present, the
'All World Law' option is by far the fastest way to search. But the more
restricted search scopes may give greater precision. Test for what works
best until the restricted scope searches become faster.
5.3. Search options
World Law uses AustLII's SINO search engine, so all of the search facilites
available for searching over AustLII's Australian databases can be used,
but this is modified by new interface options. The following options are
available:
The default option is 'any of these words'. If you want
to do any other type of search, you must change this default.
-
any of these words - equivalent to a Boolean OR between each separate
word. This is the 'simple search' option, but often as effective as any
other. Normally, users will simply enter a few words to indicate the main
concepts for which they are looking (eg 'pollution river' or 'program patent').
Users can also enter text in any form (eg `I want laws on tax and
bankruptcy' or `tax taxation bankrupt bankruptcy') without knowing about
search connectors, truncation etc. Results are ranked according to likely
relevance, so the user will (usually) obtain a useful list of search results,
although often not as complete as a boolean search can provide. This is
the search method recommended for inexperienced users, and is the default
option. (This is AustLII's Freeform (Ranked Results) search method.)
-
all of these words - equivalent to a Boolean AND between each separate
word.
-
this phrase - the words entered will be treated as a literal phrase
even if they contain terms which would normally otherwise be treated as
Boolean connectors (such as 'and', 'or' , 'near'). There is no need to
put inverted commas around the phrase.
-
this document title - only the titles of Web pages are searched,
not the text.
-
this Boolean query - any Boolean search may be entered, using AustLII's
logical and proximity operators This is the most powerful form of searching,
where searches can be made using boolean and proximity connectors, and
the results are then also ranked in likely order of relevance. This allows
reasonably broad searches (to aid completeness) with the relevance ranking
then providing more precision.
5.4. 'Search this site'
Where it has been possible to send the World Law / DIAL Web spider to a
site, the icon appears next to the listing
in the catalog. If you click on the words 'search site' or the
icon, then you are taken to a 'Search Site' page which automatically limits
the scope of the search to the one site selected. This function allows
World Law Search to be used to search specific sites which have no search
engine of their own, or have a search engine which does not have the same
features as the SINO search engine used for DIAL Search. It also overcomes
the need for users to learn to use the features of multiple search engines.
In the following, two of the sites may be searched, but the other
two may not:
When a site is selected, a 'Search selected site' page is presented
and searches from that page are limited to the site selected:
5.5. Display of search results
Results are displayed as shown below. Items are ranked in order of likely
relevance. Catalog categories are listed first ('World Law Categories'),
with the search over the text of the catalog pages, not just the title
of the page.
The following example is the results of a search for the phrase 'financial
intelligence' over the whole of World Law.
The percentages given against each item found are based on the
most relevant item being given a score of 100% (provided it contains all
search terms - otherwise less), and all subsequent items being given a
percentage ranking proportional to that, according to their likely relevance.
The search form containing the executed search is displayed at
the top of each search results page, so any search can be modified easily
in light of the results obtained.
The present method of displaying results relies upon the remote
site attaching informative titles to its HTML pages, as it is these titles
that are displayed. While most sites do achieve this to a reasonable degree
(as can be seen from the above example), some sites fail to provide any
title at all, so that only the URL of the site can be displayed in default.
This is not informative enough. A number of alternatives are under consideration.
One is to display the first 50 words or so of the document, similarly to
what is done in Alta Vista. A second is to include the name of the site
from which each the document comes (as described in the catalog), or perhaps
the name of the catalog page, after the display of the title. This would
at least make it easier to recognise which countries and systems particular
items are from. These choices have significant processing overheads, and
we are concerned not to unduly slow the delivery of search results.
6. Storing searches to create a self-maintaining
index
One of our main tactics in creating a sustainable World Law catalog is
the use of `stored search links' (or 'stored searches') in the catalog.
For example, on the Intellectual Property Subject Index page (below)
there are various stored searches for different aspects of intellectual
property. The hypertext links that appear on the page are each a search
of World Law that a legal indexer has created in order to find a general
set of documents which relate to the topics of the searches. For example,
the link entitled 'World Law search: trade marks and related laws' is in
fact a stored boolean search for 'trade mark or trademark or unfair competition
or passing off'. These examples are relatively simple searches, but searches
of any complexity can be stored.
ttp://beta.austlii.edu.au/links/World/Subject_Index/Intellectual_Property/Stored_Searches/index.htmlh
The significance of these `stored searches' of DIAL Search in the DIAL
Index is twofold:
-
Stored searches are by experts. The legal indexers responsible for
developing and maintaining the catalog have expertise in search techniques,
and know what types of searches are most effective over World Law. By creating
stored searches at sensible locations in the index they make this expertise
available to users of the index who are unlikely to have the same level
of search expertise. For example, the search for `patent law' above is
actually a search for "patent* or brevet* or octrooi*", utilising both
the French and Dutch terms, and truncation, to search effectively for the
relevant terms in Spanish Portuguese, German and it.
-
Another example is the above link for 'World Law Search: integrated circuits
protection (semiconductor chip or circuit layout)', which is in fact the
stored search listed below. Most users would not be aware that 'integrated
circuit', 'semiconductor chip' and 'circuit layout' are all expressions
used in legislation to refer to the same type of legal protection, and
yet all three terms are used in the six most relevant documents found.
An unassisted user would be unlikely to find the 620 documents relating
to this topic. A user who only searches for 'integrated circuit' will miss
over 200 documents.
-
A stored search rarely needs to be updated. An expert who creates
a stored search only has to do so once. When more data is added to the
World Law by the Web spider, the expert does not have to change the stored
search, because it will now find relevant new legislation and other materials
as well as the old materials (assuming the search was well-constructed
in the first place). The Intellectual Property page will to some extent
update itself without editors adding new links to it. In contrast, sets
of ordinary hypertext links to legislation must be updated constantly when
legislation changes, or other new material is available. Stored searches
can to some extent create a `self-maintaining catalog', partly overcoming
the impossibility of detailed subject indexing of world-wide legal information.
-
Users can leverage more precise searches off general stored searches.
The results display of a stored search allows the search to be modified
by the user. For example, the following search form appears on the search
results page for the integrated circuits stored search above. A user who
wishes to search for the relationship between copyright law and integrated
circuits law need only add 'copyright near ...' to the previous (stored)
search, and press 'Refine Search', to carry out this more precise and expert
search. In this way, World Law users can 'leverage off' the expertise of
the legal indexers.
-
-
Poor searches can find better searches. Inclusion of stored search
links means that, since the catalog is searchable, a search can find other
searches. For example, if a user does a World Law Search for `breach of
confidence', only one category is found, the Intellectual Property Subject
Index page above, which contains the stored search `DIAL Search for trade
secrets or confidential information'. However, if that stored search is
then selected, over 150 items are found and ranked, because the stored
search was for "trade secret or segredo comercial or breach of confidence
or confidential information". The user may not have been aware that in
most commercial contexts, `trade secret' is a conceptually the same as
`breach of confidence', but the invitation to repeat a search, and the
stored search links, will assist the user to `find' this information.
6.1. Finding law about a country
One particularly effective use of stored searches in World Law Search is
to use them to enable users to find starting points for research concerning
the laws of a named country. In the example below, the stored search is
for 'fiji* or fidji* or fiyi* or fidschi* or figi*' , so as to find entries
in most common European languages. The titles of the first few results
show the effectiveness of the multi-lingual search. The total of 1366 items
show how much is available even for a small country like Fiji.
Because the relevance ranking tends to give short documents and
documents that use a search term in a title, many of the internet law indexes
that have a separate page for that country will appear near the top of
the list, so the user can they quickly review existing intellectual law
indexes for that country. Here, the first 15 items found are a mixture
of the 'Fiji' pages in other internet law indexes (JurWeb, ICL and ILRG),
documents about human rights compliance, and Asian Development Bank law
and development project notices.
6.2. Future development of stored searches
The Intellectual Property stored searches are an example of how the curent
'World >> Subject Index' pages will be developed so that each subject category
has a basic set of stored searches that will keep that part of the subject
index reasonably current. Resources will be available for more intensive
intellectual indexing of some subject index pages, but this will not always
be possible, so the use of stored searches will allow a moderate cost 'across
the board' extension of the subject index.
In the longer term work will be done on the use of legal thesauri
to create large scale sets of stored searches and their distribution through
the catalog. Good thesaurii are difficult to find on the web.
7. Multi-lingual law indexing and searching of
the Web
The resources available on the Web for legal research are biased in favour
of English, but there is a very large quantity of non-English language
materials if only they can be located. Apart from the inherent value, the
availability on the Web of the laws of a person's own country is more likely
than most other factors to encourage that person to undertake legal research
using other countries laws on the Web.
The key to the development of a genuinly world-wide free access
catalog and search facility for law is definitely the formation of 'indexing
partnerships' with legal institutions with expertise in other key languages
for materials on the Web (Chinese, French, Spanish, German, Russian). Development
of the technical capacity to handle different character sets is also required.
As the range of non-English materials searchable in World Laws
increases, it is likely to become valuable to be able to limit searches
to materials in a particular language. This will probably be implemented
by the indexer indicating the language of non-English materials at the
time of adding them to World Law Search, with an option for users to exclude
or include materials in particular languages.
7.1. Automated translation of pages - useful but
limited
All pages in World Law have a [Translate] button that takes the user to
Alta Vista's automated translation service, provided by Systran translation
software and ensures that the Systran page has inserted in it the correct
URL for the DIAL page that the user was just viewing (in the example below,
the World page). The user then only has to select to which language the
DIAL page is to be translated, press the `Translate' button, and then be
returned to the DIAL page translated into the language of choice.
The resulting translation seems adequate to convey the meaning
of most of the items on the page.
The World page and search options translated automatically to
French by Systran
The Alta Vista/Systran translation facility is at present limited to
translations from English to French, Spanish, Portuguese, Italian, or German,
and vice-versa. This translation facility is also only a prototype, and
sometimes has inadequate processing power to translate very long pages.
It is also not recommended to use it to translate documents with complex
grammar, or where accuracy is vital (such as legislation). However, for
pages such as menus, or lists of search results, it is usually extremely
helpful.
The translation facility has uses beyond translations of catalog pages:
-
A World Law user can use the translation facility to translate their proposed
search terms into another language before conducting a search, so as to
obtain multi-language synonyms.
-
Search results pages may be translated, so that if a search finds documents
in another language, their titles can be translated into a language understood
by the searcher.
-
By initiating the translation facility before leaving the World Law pages
(either on a catalog page or the search results page), and then selecting
a remote document in a foreign language to browse, the Alta Vista translation
facility will 'follow' the user as he or she browses from page to page,
asking each time if this page is also to be translated between the two
languages.
The combined result of these translation features for a world index is
revolutionary: instead of being an `English only' facility, it is now effectively
available in six of the most pervasive European languages.
7.2. Embedding multi-lingual searches
Where embedded searches are included on the catalog pages, if it is possible
they are being constructed as multi-lingual searches. For example, on the
Mozambique page[43],
the embedded search for 'Search World Law: Mozambique' is in fact a search
for 'mozambi* or mocambiq* or mosambi*'. Similar multi-lingual searches
by country names have been placed as stored searches on all country pages
(more than 200[44])
in World Law.
At present the languages used cover most common European languages,
with translations constructed using the Eurodicautom[45]
service.
Translations of country names in a number of Asian languages will
be added next to the embedded searches. As these searches are permanent,
broad searches of all the documents indexed from time to time, an investment
of time in the construction of such searches is probably more effective
than equivalent time spent indexing sites in detail.
8. Future directions
8.1. Building a perverse portal
Internet 'portal sites' have been described as designed to perform two
functions[46]:
(i) providing users with the range of tools they need to find the content
they want; and (ii) obtaining large audiences so as to generate revenue,
typically through advertising (and the user surveillance it usually involves).
World Law aims to be a portal for legal information, but it is a perverse
one, because it is based around the idea of free access to Web resources
and does not depend on advertising or individual subscriptions to sustain
its development and maintenance.
If World Law succeeds it will be because the technical infrastructure
that we are creating, and the collaborative environment within which it
is used, is capable of attracting the interest and cooperation of others
around the world with a commitment to the provision of free access to legal
information via the Internet, and an expertise in some part of the Internet's
rapidly expanding legal content.
8.2. Building Internet legal research in Asia -
DIAL training
One aim of World Law is to build an audience for legal research resources
on the Internet which goes beyond the normal users in law firms and law
schools of the developed world, and provides a facility which is also valued
and used in the developing counties of the world as an affordable and genuinely
international resource for legal development.
First DIAL training session for Mongolian teachers from the Legal
Retraining Centre, Ulaan Baatar, held at the University of Melbourne on
9 July 1999.
The Asian Development Bank's funding of the training component of Project
DIAL is a significant experiment in achieving this broader goal. DIAL involves
in-country training of government lawyers (and others as resources permit)
in seven Developing Member Countries (DMCs) of the Bank: Vietnam, Philippines,
China, India, Pakistan, Indonesia and Mongolia. The DIAL training team
will involve creation of a team from eight countries: local trainers in
each of these DMCs, a Regional Training Coordinator from the Philippines[47],
and the DIAL lead consultants at AustLII. 'Train the trainer' courses are
planned to begin in Mongolia, the Philippines and Vietnam during 1999,
and in the other countries in 2000. In addition there will be on-line training
materials that anyone can access, and DIAL-user email list where anyone
trained in DIAL use will be able to receive online support from the DIAL
training team in any aspect of Internet legal research.
9. References
-
Graham Greenleaf (1998) Developing the Internet for Asian Law - Project
DIAL (A feasibility study and prototype (Asian Development Bank, January
1998, 156 pgs) - at http://www2.austlii.edu.au/~graham/DIAL_Report/
-
Graham Greenleaf, Geoffrey King, Andrew Mowbray, Daniel Austin & Jill
Matthews (1997) `Future-proofing a global internet index by a targeted
Web spider and embedded searches' Australian Society of Indexers Annual
Conference 'The Futureproof Indexer' 27-28 September 1997, Katoomba,
Australia - at http://www2.austlii.edu.au/~graham/Futureproof/indexers.html
-
Greenleaf G, Mowbray A and van Dijk P (1995) 'Representing and using legal
knowledge in integrated decision support systems - DataLex WorkStations'
Artificial
Intelligence and Law , Kluwer, Vol 3, Nos 1-2, 1995, 97-124
-
Sam Hinton (1999) Portal Sites: Emerging structures for Internet control
Research Report No 6 La Trobe University Online Media Program, January
1999
-
Steve Lawrence and C. Lee Giles (1999) 'Accessibility of information on
the Web' Nature, Vol 400, 8 July 1999, pgs 107-9 (Macmillan, UK)
-
[*] Graham
Greenleaf is project leader; Daniel Austin, Philip Chung and Andrew Mowbray
are the software authors and system developers. Madeleine Davis and Jill
Matthews are the legal indexers for the project.
[1]
For more details see part '2.1. The potential 'world law library' on the
Internet' - at http://www2.austlii.edu.au/~graham/DIAL_Report/Report-2.1.html
- in Greenleaf (1998)
[2]
[3]
[4] See
http://beta.austlii.edu.au/links/World/Countries/Mongolia/Legislation/index.html
for examples
[5] `Universal
Resource Locator' or internet address of a Web page.
[6]
[7] Lawrence
and Giles 1999
[8] http://www.excite.com.au/
[9] For
the detailed arguments see '2.2. Finding law on the net - Why is it so
difficult?' - at http://www2.austlii.edu.au/~graham/DIAL_Report/Report-2.2.html
- in Greenleaf 1998.
[10]
This section summarises part 2.3. 'Intellectual indexes (directories) for
law' - http://www2.austlii.edu.au/~graham/DIAL_Report/Report-2.3.html -
in Greenleaf 1998.
[11]
See http://beta.austlii.edu.au/links/World/Other_Indexes_and_Search_Engines/
for many examples.
[12]
http://dir.yahoo.com/Government/Law/
[13]
This part draws on part 2.4. 'Automated indexes and Internet-wide search
engines' - at http://www2.austlii.edu.au/~graham/DIAL_Report/Report-2.4.html
- in Greenleaf 1998
[14]
Lawrence and Giles 1999
[15]
See Lawrence and Giles 1999, and `How big are the search engines?' (Search
Engine Watch) at and references linked therefrom, for detailed discussion
of all these matters.
[16]
In 1996 it was claimed that Alta Vista only indexed about 10% of
the pages of moderately large Web sites (600 to 6,000 pages in the example
cited), and not denied by Alta Vista. Alta Vista now claims to index sites
without any limit on pages. The claim by John Pike, webmaster of the American
Federation of Scientists, and the reply by Alta Vista are available at
< >and discussed in `The Alta Vista Size Controversy' on Search Engine
Watch.
[17]
See Martin Koster `The Web Robots Pages' at <> for details of the operation
of Web robots.
[18]
See the `Robots Exclusion' page, dealing with both the standard and the
Meta Tag for robot exclusion at <>
[19]
Lawrence and Giles 1999. Earlier studies in 1996 had estimated theat the
best search engine covered about 20% of the then 150 M Web pages.
[20]
For example, on Alta Vista, a search for Vietnamese legal materials requires
a search which is limited to materials which are located on a server in
Vietnam (the `domain:vn' delimiter) or contain `Vietnam or Viet Nam' -
and this is still somewhat hit or miss.
[21]
Lawrence and Giles 1999
[22]
Exmaples are Google - <http://www.googel.com> - and Direct Hit.
[23]
as at Janurary 1998 - see Greenleaf 1998
[24]
Aspects of such an approach, in a pre-Internet context, are explored in
Greenleaf, Mowbray and van Dijk 1995
[25]
Geoffrey King (to 1998), Daniel Austin, Philip Chung and Andrew Mowbray
have all worked on the software and systems side of this project.
[26]
Developed by Daniel Austin. Geoffrey King developed a previous version.
[27]
The names `Wendolyn' and `Wensleydale' have been reserved for future software
developed for this project. For further details see
[28]
Developed by Daniel Austin
[29]
Developed by Andrew Mowbray
[30]
http://beta.austlii.edu.au/links/World/Countries/index.html
[31]
http://lexmercatoria.net/
[32]
http://beta.austlii.edu.au/links/World/Subject_Index/International_Trade/index.html
[33]
[34]
[35]
Greenleaf et al 1997
[36]
http://jurist.law.pitt.edu/
[37]
[38]
[39]
[40]
[41]
http://beta.austlii.edu.au/links/World/Countries/index.html
[42]
[43]
http://beta.austlii.edu.au/links/World/Countries/Mozambique/index.html
[44]
http://beta.austlii.edu.au/links/World/Countries/index.html
[45]
http://eurodic.echo.lu/cgi-bin/edicbin/EuroDicWWW.pl
[46]
Hinton 1999, Chapter 3
[47]
Rachel Romana of CD Asia and her colleagues