3 Improved retrieval and storage techniques

Criticisms of existing retrieval techniques
Concordance-based retrieval

Precision and recall
Blair and Maron's experiment
Automatic or intellectual indexing?

A `user hostile' interface?
Identity or nearness functions?

STATUS With IQ
Vector-based retrieval

Conceptual retrieval systems
Other Retrieval Technologies

A revival of conceptual indexing
Enhanced secondary databases

New Storage Media

CD-ROM
`The Electronic Book'

Expert Systems

The Social Security Enquiry System
The DataLex project
CCH Solvware -- taxation software
Legal document generators
Availability of expert systems

Today's `state of the art' will be tomorrow's antique.

Existing commercial retrieval systems still use the first generation of a technology that is thirty years old. While that technology may have been impressive before the invention of the microcomputer, most full text retrieval systems would today be classified as `user hostile'. Few lawyers have found the benefit of these techniques so compelling as to want to use them constantly. In part this may be due to the price of the services, and to the lack of sufficiently extensive databases, but there is also a constant refrain, even among experienced users, that the software is too difficult to use.

However, there is no consensus as to what form the `second generation' of retrieval software should take. This Chapter outlines some criticisms of existing retrieval systems and improvements that have been suggested.

Other technologies which are related to but go beyond `information retrieval' are also mentioned. New retrieval technologies, expert systems, `smart books', CD-ROM and electronic mail will change computerised legal information beyond recognition.

Criticisms of existing retrieval techniques

Search languages based on logical and positional connectors are capable of being used to construct complex and sophisticated search requests. However, it has been argued that their proper use is beyond the abilities of most users.

Bing comments that `Boolean retrieval is by far the most widely used strategy', and

The strength of the retrieval strategy is its flexibility, and the possibility it offers for experienced users to construct complicated and well performing requests. In principle, high retrieval performance is always possible.

The drawback ... is the high demands posed to the user ... An inexperienced user will have difficulties in exploiting the possibilities, and a request may therefore easily have a structure which is different from the one intended by the user (for instance the user mixes up the ORs and ANDs, or is unaware of the sequence in which they are executed). (J Bing (Ed) Handbook of Legal Information Retrieval , North Holland 1984, p163)

A recent Australian commentator on the use of STATUS has commented on the simple searches carried out by most users of a Hansard database in similar but more colourful terms:

... at training courses we often see eyes start to glaze over at the mere mention of Boolean logic. The concepts of algebraic precedence and the use of parentheses to enforce it then finishes the job. So, when asking questions such users tend to avoid the issue by adopting a stepping-stone approach to their enquiry ... (J Gouldstone, `How users really use databases' Information Online 87, Proceedings of the Second Australian Online Information Conference, Library Association of Australia 1987)

Concordance-based retrieval

The types of retrieval systems discussed in the previous chapter are variously called concordance-based, `boolean', `automatic indexing', or `word occurrence indexing'. Research into their performance has been inconclusive. In Bing's view, this is partly because there is as yet no comprehensive theoretical foundation for information retrieval, and in particular because there are no generally accepted measurements of the relevance of documents retrieved (Bing Handbook, 1984, p 204).

Some research claims that existing computerised retrieval systems are superior to manual research (See, for example, M Iosipescu and J Yogis A Comparison of Automated and Manual Legal Research, Canadian Law Information Council, Ottowa, 1981), whereas other researchers have been very critical (See, for example, D Blair and M Maron, `An evaluation of retrieval effectiveness for a full-text document-retrieval system', (1985) Vol 28 No 3 Proceedings of the ACM).

Precision and ecall

The effectiveness of full text retrieval is often measured in terms of precision and recall. The two measures derive from the following table.

	Relevant	Not relevant
Retrieved	a	b
Not retrieved	c	d

`Precision' is the ratio of a to (a + b), or the proportion of relevant material retrieved to all material retrieved. Retrieval of irrelevant material (ie b) lowers this ratio.

`Recall' is the ratio of a to (a + c), or the proportion of relevant material retrieved to all relevant material retrievable. Failure to retrieve relevant material (ie c) lowers this ratio.

High recall and high precision (ie each as close to 1:1 as possible) are both desirable. There seems to be no optimum relationship between precision and recall. However, it is often observed that there is an inverse relationship between the two, with high recall resulting in low precision, and vice-versa (see Blair and Maron op cit and references cited therein).

Blair and Maron's experiment

Blair and Maron tested retrieval effectiveness of forty queries used to retrieve documents from a full text database of over 40,000 documents used in a litigation support database concerning one piece of litigation undertaken by a United States legal firm.

IBM's STAIRS retrieval system was used. Experienced para-legals constructed the queries in conjunction with the lawyers working on the case, often refining a query through a number of attempts. The lawyers had specified that, for them, an acceptable level of recall required that at least 75% of all relevant documents must be retrieved. They were less concerned with precision. Their subjective assessment of the results of the queries was that this level of recall had been achieved -- that is, they were happy with what they retrieved.

All of the documents were then analysed according to relevance, and the actual precision and recall measurements made. To the great surprise of the lawyers concerned, it was found that the value of recall was only 0.2 (20%), whereas precision was 0.75 (75%). They believed that they were retrieving more than 75% of all relevant documents, but in fact were only retrieving 20%.

Blair and Maron concluded that the main assumption underlying automatic indexing was incorrect, namely that the likely users of these systems can `foresee the exact words and phrases that will be used in the documents they will find useful, and only in those documents'. In other words, they question whether, in practice, likely users can capture concepts in word-occurrence combinations. If not, it is irrelevant whether it is possible to do so in theory. They suggest that if high recall is desired, manually assigned index terms (`intellectual indexing') is necessary.

There are a number of reasons why we should exercise caution before extrapolating this result into a general condemnation of full text retrieval systems.

First, the experiment was on a litigation support application, where one would expect to find a database consisting of a wide variety of document types, ranging from letters between parties, affidavits of witnesses, technical documents, and formal court documents. Such documents will have no consistent structure, so that that search functions based on specific sections of documents will be unavailable. Nor will such documents have any regularity in style or precision in use of language, such as one can reasonably expect to find in the formal style of reported cases or, a fortiori, in legislation. Such formality must make it easier for a lawyer who is sensitive to judicial or statutory language to predict what words will be used in relation to which concepts in these texts (See P Flass's letter in 28(11) Comm ACM, 1985 for a similar point).

It is also questionable whether to have paralegals translating lawyers' requests is the best model of legal information retrieval, whether a manually-constructed index would have behaved any better, and whether at least some of the fault lies with the lack of adequate ranking functions in STAIRS, rather than with automatic indexing per se.

Automatic or intellectual indexing?

Despite these reservations, the Blair and Maron experiment does give good reason to doubt that automatic indexing is a panacea. Bing is probably correct in suggesting that this debate is really `a thing of the past' , at least insofar as commercial retrieval systems are concerned. T Fjeldvig's experiments concerning a database of decisions of the Norwegian Central Tax Authority found that abstracts alone had the lowest retrieval performance (including both precision and recall), full text the next best, and a combination of both abstracts and full text the highest performance (discussed in J.Bing `The law of the books and the law of the files' (Part 1) (1987) 54 Computers and Law 31).

The sensible but unsurprising conclusion is that, if it is possible on pragmatic grounds such as cost to include both full text (with automatic indexing) and abstracts, catchwords or indices (intellectual indexing), then retrieval performance will be enhanced. Both are needed to overcome the limitations of each.

A `user hostile' interface?

Whatever the relative merits of automatic and intellectual indexing are, Bing claims that many of the deficiencies of existing retrieval systems may be remedied without altering the reliance upon automatic indexing (See J.Bing `The text retrieval system as a conversion partner' (1986) 2 Yearbook of Law, Computers & Technology ). A major problem is the `user hostile' interface of most systems, coupled with their passive `help' facilities.

Most online retrieval systems are command-driven, and present the user with a prompt (such as >) when another command is expected, but no suggestion as to what commands are possible or desirable at that point. This was the norm for all programs until the mass-marketing of computer programs from the late 1970s. However, most microcomputer users now expect a greater degree of `user friendliness' , such as on-screen menus of commands which indicate which commands are possible at that time. As Bing says, the relative user-friendliness of text retrieval systems has deteriorated. In part this is due to the restrictions imposed by communications protocols and speeds, but these problems could be overcome to some degree by local interface software running on the user's computer. The LEXIS communications software goes some way toward this goal.

Help facilities are usually passive, waiting for the user to invoke them. Error messages in most systems are extremely cryptic, of the `Command not valid' variety. Bing points out that what is needed is `active intervention of the system. The system should monitor the dialogue, and butt in with advice when detecting something is wrong.' LEXIS does this to some extent.

Other problems (Discussed in J.Bing `Legal text retrieval systems -- the unsatisfactory state of the art' (1987) Vol 2 No 1 Journal of Law & Information Science) which could be lessened by such active system intervention include:

Misspelt words -- As many as 9% of words in search request are misspelt, with most systems responding only with an unhelpful `No documents retrieved'. An automatic spelling checker could at least check all words and require the user to confirm any word which does not appear in its dictionary.
Missing synonyms -- Users are often required to remember all possible synonyms for search terms in order to obtain adequate recall. Despite the fact that some retrieval systems such as STATUS allow a user to call for the system to substitute automatic lists of synonyms for search terms (the & operator), system operators such as CLIRS and SCALE have not developed the necessary synonym lists. A user should be able to accept or reject synonyms suggested by the system.
Missing grammatical variants -- Accurate truncation of search terms is also necessary for effective searching, but existing systems do not prompt the user to truncate, nor suggest possible truncations and allow the user to accept or reject them.

Such active system intervention does not require advances in artificial intelligence technology. If it allowed the user to accept, reject or modify the search improvements suggested by the system, the retrieval abilities of most users could be improved. Such a system would be the `intelligent pre-processor' of search requests that Bing has suggested, but not an inflexible one.

Identity or nearness functions?

One of the main criticisms of retrieval systems based on logical and positional connectors is that they are very strict: they only retrieve documents which meet all the requirements of the search request. A document which contains all the specified search terms except one, or contains all of them but not quite as close together as is specified, will be excluded.

A by-product of this `identity' function, as Bing calls it, is that logical and positional connectors provide no method of ranking those documents which do satisfy a search request in terms of how likely it is that they will be relevant to the user's request. A document either satisfies the request or it does not.

A number of alternative retrieval methods are being developed based on `nearness' functions, rather than `identity', in order to counter both of the above problems (For the Norwegian approach of `conceptor based retrieval', implemented in NOVA*STATUS and SIFT, see J.Bing `The law of the books and the law of the files' (Part 1) (1987) 54 Computers and Law 31). Fairly simple word-frequency ranking was included in the Canadian QL system, and is included in the STAIRS retrieval system, but has not proved to be sophisticated enough.

STATUS With IQ

An example of an Australian retrieval system which uses this approach is an enhancement to STATUS, STATUS with IQ.

The approach taken by STATUS with IQ may be summarised briefly as follows (See D L Pape Status with IQ User's Guide, Computer Power Applied Research and Development Division, Canberra, 1985).

The user provides a list of `tokens' , a token consisting of a word or group of synonyms representing one concept which the user considers relevant to the search. Up to ten tokens may be suggested, and some logical and positional connectors may be used within tokens.
IQ computes a `target density' for each token, essentially a measure of how frequently that token must appear in some form in an article before the occurrence is considered statistically significant or `dense' enough.
IQ then retrieves all articles that contain one or more of the tokens, and ranks the articles on a score of 0-100 according to how many tokens an article contains, whether those tokens exceed their target densities, and a number of bonus factors such as whether tokens appear in the same or contiguous paragraphs, or in particularly significant sections of an article (such as a title, abstract or headnote).
IQ then reports the ranking and density information to the user. The user can view the articles in ranked order until he or she decides that their level of relevance is too low. Alternatively, the user can amend the tokens in light of the information given, which might show, for example, that a particular token never appears at greater than its target density, due perhaps to a poor choice of search terms.

STATUS with IQ appears to be a powerful tool for the sophisticated user, giving a great degree of `relevance feedback'. It could also be useful for inexperienced users because of the reduced stress on positional and logical operators.

STATUS with IQ has not yet been included as an alternative search method on the CLIRS or SCALE databases. It has been tested with a database of High Court cases.

Vector-based retrieval

Vector retrieval is another nearness function, where both documents and search requests are represented as mathematical vectors, and the system calculates the proximity between documents and the request, and ranks the documents accordingly (Bing `The law of the books and the law of the files' (Part 1), op cit; see also C Tapper `The use of citation vectors for legal information research' (1982) Vol 1 No 2 Journal of Law & Information Science). Vector retrieval systems have not been implemented commercially.

Conceptual retrieval systems

More radical approaches to improving retrieval seek to utilise techniques related to artificial intelligence and legal expert systems. They seek to represent legal documents in terms of the legal concepts for which they may be significant, and therefore change the emphasis of retrieval techniques from combinations of word occurrences to combinations of concept occurrences.

Such an approach could require intellectual or `conceptual' indexing of legal materials, although Norwegian research suggest that it may be possible instead to construct a `norm-based thesaurus', a conceptual model of an area of law, with groups of terms associated with each conceptual node in the structure, but with no intellectual indexing of the actual documents. The norm-based thesaurus would suggest additional search terms to users and, by utilising user feedback, could be self-maintaining (See Bing `The text retrieval system as a conversion partner' op cit; J Bing `Designing text retrieval systems for "conceptual searching"` Proceedings of the First International Conference on Artificial Intelligence and Law, ACM, Boston, 1988).

There are many other research projects in this area at present (See J Bing `Designing text retrieval systems for "conceptual searching"` ibid for a summary of some of the most important; see also the papers by Tong et al, Hafner, Dick and Belew in those Proceedings), but none involve commercial implementation with the exception of the Italian ITALGIURE system.

Other Retrieval Technologies

A revival of conceptual indexing

Another approach is to abandon retrieval based on word occurrence searches, and to develop systems based on subject indexing and menus, similar to an on-line book. This approach is being taken, at least to some extent, by MAYNELAW in its development of an `online legal encyclopaedia', discussed in Chapter 29. Videotext systems such as Viatel, discussed in Chapter 31, rely on such a `page-based' approach.

Enhanced secondary databases

CCH in the United States is developing a method of presenting its loose-leaf reporting services online which could be called, for want of a better expression `enhanced secondary databases'. It is conceptually similar to what is being called `hypertext' or non-linear text presentation.

The CCH ACCESS system will be run by CCH on its own network, and will use a combination of retrieval software on the host computer, and interface software on the user's own PC. The retrieval software will allow both Boolean retrieval and menus based on the normal CCH numbered tables of contents and paragraphs. The key to the system, however, will be that all paragraphs will be coded to cross-refer to the statutory provisions, caselaw and other commentary referred to in that paragraph (See Sperling, 1987, pp 3-4).

It will therefore be possible to have immediate and automatic access to the primary materials to which a commentary refers, and vice versa . This approach will constitute an integration of primary and secondary databases on a topic. A similar approach has already been taken in the development of some legal expert systems.

Such an approach differs from existing information retrieval techniques in that it requires considerably more complex and labour-intensive coding or `marking up' of the data to be included in the retrieval system.

The second element in the CCH approach will be the interface software on the user's PC. It will make extensive use of user-manipulable windows to allow different categories of data (eg commentary and statutes) to be viewed simultaneously, and for note-taking while on line. In other words, it will apply what by now are conventional microcomputer techniques to the creation of a more friendly and responsive information retrieval interface than remote systems can yet provide.

CCH Australia doers not expect to offer an online CCH ACCESS system `in the immediate future', but considers that the software being developed is what is needed for searching CD-ROM disks, and that it may be offered in this way in future (id p 8).

CCH's proposals do not seem to involve any startling computing innovations, but appear to be very sound.

New Storage Media

Distribution of computerised legal information has, until now, been achieved principally through online searching of remote databases. There has been relatively little distribution on computer storage media such as floppy disks. The large storage quantities required for useful databases, combined with the limited capacity and considerable costs of disk storage, made such distribution uneconomical. Suitable retrieval software for microcomputers either did not exist or was prohibitively expensive. These problems were exacerbated by the need to keep such data up to date as the law changed, at a multitude of distributed sites.

The only notable exception in Australia has been the sale of `precedent packages' collections of useful forms and precedent documents in such areas as conveyancing and litigation (S Lewis Australasian Legal Software Directory, Legal Management Consultancy Services, Sydney, 1987, contains details of some packages). Precedents require relatively small disk storage, they only require word processing software, which lawyers require in any event, and they only require intermittent updating.

New storage technologies are likely to change completely the economics of such distribution of computerised legal information.

CD-ROM

Compact Disk Read-Only Memory (CD-ROM) is a storage technology essentially the same as that used for music Compact Disks. One CD-ROM can hold up to 600 megabytes of data , the equivalent of nearly 2,000 low-density floppy disks. In order for a computer to read information stored on a CD-ROM, a separate disk drive is needed, but these are now priced below $2,000.

Reference works are starting to become available on CD-ROM. The twenty volume Grolier Encyclopedia is available on a CD-ROM for under $400.

The Index to Legal Periodicals is available as a database on the LEXIS, Westlaw and Wilsonline online retrieval services in the United States, and is therefore searchable by LEXIS users in Australia. It is also available on CD-ROM from the company operating Wilsonline. The data is kept current in two ways. The CD-ROM is replaced quarterly by an updated cumulative disk. Purchasers of the CD-ROM are offered free search time on the online version of the database on Wilsonline, although they must still pay telecommunications charges.

The Legal Resources Index, the other principal legal bibliographic reference work, is also available as a database on LEXIS (LGLIND), Westlaw and other online services. It is also available as a videodisc called LEGALTRAC, available from Information Access Company, the publishers of the Legal Resources Index . Videodisc is a technology requiring more complex equipment than CD-ROM, but comparable in its storage capacity. The LEGALTRAC system does not use a full text retrieval system. Searches are based on a subject index (L Scott Rawnsley `Making Tracs: Road Testing the INFOTRAC and LEGALTRAC video-Disk Databases' (1986) Vol 6 Nos 3/4 Legal Reference Services Quarterly 168).

No Australian legal materials are yet available on CD-ROM.

`The Electronic Book'

Online retrieval systems and CD-ROMs both use conventional computer terminals for access, and both store data on disks of varying types. The electronic book, or `smart book', stores data on a chip rather than a disk. Because of the elimination of any need for the moving parts of a disk drive, a `reader' need only be the size and shape of a book, and may be hand-held.

The Weldon-Hardie Group of companies, publishers of the Macquarie Dictionary, have announced that a subsidiary company, Megaword, is to market a Book Reader. It is to be battery powered, book size, with a backlit screen the size of a page, and six keys to control retrieval functions. Each `book' comes on a credit card sized storage device. The retrieval system involves a concordance, so search techniques similar to those in online retrieval systems will be possible. (see Online Currents Vol 2(6), 1987, p1).

A prototype smart book using -- what else but? -- the Bible was demonstrated in 1987. CCH Australia has announced that the second smart book to be released will be its Master Tax Guide 1989 . Many would see this as an appropriate encore.

Online Services or optical discs or CD-ROM will never replace books... because as yet neither form is portable. You are tied to a computer and/or telephone. You can't take one to bed, or read on the train. (Editorial Online Currents Vol 2(6), 1987).

Expert Systems

Databases and full text retrieval systems, whether online or on CD-ROM, represent the present `state of the art' in computerised legal information. `Knowledge-bases' and the `expert systems' of which they are part, may represent the future direction.

A legal expert system is a computer program that gives legal advice. Whereas a database program merely retrieves information which is potentially relevant to an individual legal problem, leaving its application to specific problems to the user, an expert system applies information to the specific problem. Databases store the `raw material' on which legal advice and decisions are based: cases, statutes and `textbook law'. `Knowledge bases' store legal information in a more highly structured form which also represents the interrelation between the different items of information, and allows it to be applied to individual problems. A legal `knowledge base' is an attempt to represent the structure of an area of law.

The Social Security Enquiry System

The Social Security Enquiry System (SSES) is the most complete and complex legal expert system yet developed in Australia. It provides advice on eligibility for any type of pension or social security benefit available in Australia. Each conclusion reached by the system is supported by references to statutory provisions and relevant caselaw, and the user may view these provisions or summaries of the cases in a window on screen on request. A Manual on social security law is also incorporated in the system. SSES assists the user to draft submissions to administrative bodies such as the Social Security Appeals Tribunal and the Administrative Appeals Tribunal. It also drafts appeal documents, Freedom of Information Act applications, and other documents.

SSES was developed by P Johnson (a lawyer) and D Mead (a programmer), initially for the A.C.T. Welfare Rights Centre. Distribution to other welfare agencies is now being organised. The SSES application was developed using a `shell' called Knowledge Base System Management System (KBMS-1) written as part of the project in the logic programming language, Prolog. The shell may be used to produce applications in other areas of law. See D Mead and P Johnson The Social Security Enquiry System -- Background Paper, the authors, January 1988.

The DataLex project

The `DataLex project', a research project involving academics from three Universities in Sydney, A Mowbray, A Tyree and G Greenleaf, has developed two expert system `shells', LES (decision networks), XSH (primarily rule-based), and PANNDA (nearest-neighbour discriminant analysis) and a number of demonstration applications on the law of intestacy, copyright, and the finding of chattels. See Greenleaf, Mowbray and Tyree `Legal Expert Systems -- Words, Words, Words...?' (1987) 3 Yearbook of Law Computers and Technology 119, for a discussion of LES and PANNDA, but not X-SH.

Both the SSES and DataLex projects received substantial assistance from the Law Foundation of New South Wales, which has shown a consistent interest in the development of computerised legal information services.

CCH Solvware -- taxation software

CCH Australia Limited has published four taxation programs under its `Solvware' title. These programs are `expert systems' insofar as they contain knowledge concerning the application of taxation law and apply it to the specific tax problems of the user of the program to give advice. However, they have been written using conventional programming techniques rather than the use of expert systems shells or languages.

The Tax Breakdown of Termination Payments program calculates taxable amounts of termination payments. The Tax Return Package prepares tax returns in a form approved by the Australian Taxation Office. The Assets Register and Depreciation Package records details of an organisation's assets and produces an assets register listing those assets and schedules showing depreciation expenses against individual assets. The Fringe Benefits Tax Package prepares FBT returns and answers questions on whether or not a benefit is subject to FBT.

CCH provides a `Hotline', a telephone advice service to assist with the use of Solvware.

Legal document generators

Programs that guide users through the creation of various types of standard legal documents, fitting the document generated to the precise details and legal requirements of the transaction at hand, are a specialised variety of legal expert system. The Document Modeler program is probably the most successful example of a `shell' with which such document `templates', or `intelligent precedents', may be created.

Once created by an expert in a particular field, these `templates' may be used by less expert users to create documents for their own transactions. The marketing of templates has recently commenced in Australia, but few good templates are yet available.

Availability of expert systems

Legal expert systems can be marketed as online services, on CD-ROM, or (for smaller applications) on floppy disk. The problems of keeping them up to date are not a great deal different from those concerning databases. The need for more sophisticated screen handling than is available in many online systems makes some expert systems unsuitable for online delivery. The very limited range of research on legal expert systems in Australia makes it unlikely that many examples will be available commercially in the near future.

[Previous] [Next] [Title]