University of New South Wales - Faculty of Law - Computerisation of Law 2001

Reading Guide: Hypertext and Retrieval
3. Text retrieval - principles and evaluation

[Notices] [Contents] [Search site] [Research] [Class Email] [Credits and Contacts]

3.1. Introduction
- 3.1.1. What is text retrieval?
  - Definitions
- 3.1.2. Basic principles - concordances and 'exact match models'
  - Other references
  - Technical sources
3.2. Evaluation of text retrieval performance
3.3. Arguments about text retrieval effectiveness in law
- 3.3.1. Blair and Maron's experiment
  - Additional refernces
- 3.3.2. What retrieval results do lawyers require?
  - Further references
3.4. Relevance ranking
3.5. Integration of hypertext and text retrieval
3.6. Other approaches to text retrieval in law
- 3.6.1. Howard Turtle's survey - 'Text retrieval in the legal world'
- 3.6.2. Erich Schweighofer's survey - 'The Revolution in Legal Information Retrieval'

[This Part is complete for 2000]

3.1. Introduction

3.1.1. What is text retrieval?

'Text retrieval' (also sometimes know as 'full text searching' or 'free text searching') refers to computerised systems which allow users to find particular combinations of words in large bodies of text whether or not those texts have any uniform structure.

Howard Turtle (see below) gives a more technical (and accurate) definition ' a retrieval system applies some matching function to the representation of the information needs and the representation of each document to determine which documents to retrieve'.

Definitions

For those unfamiliar with text retrieval, the following glossaries may be helpful, either as a starting point for reading, or to come back to later:

Search Engine Watch's Search Engine Glossary - simple but useful
I-Search Digest Search Engine Terms - A very comprehensive glossary with links to many of the resources it discusses. Includes brief analyses of all the main web search engines, as well as technical terms. If you read this glossary you will know just about every bit of jargon there is relating to web searching.
Scott Weiss' Glossary for Information Retrieval - a much more technical glossary, oriented toward information retrieval research. Not for the faint-hearted.

3.1.2. Basic principles - concordances and 'exact match models'

In the late 1950s, the basic technique underlying most computerised searching of large bodies of text was developed, variously know as the 'concordance', 'inverted file' or 'word occurrence index'. In summary, it involves the construction of an alphabetical list of every different word in each document in a set of documents, with the locations of each occurrence of that word recorded next to it. 'Searches' of the documents is in fact carried out over this 'concordance', not over the actual texts. (See a short history of text retrieval in law for more details.)

There seem to be few explanations of concordances or the underlying operation of search engines available on the web. The following reference, although it is old (1988), does explain the essentials of concordance-based searching:

Graham Greenleaf, Andrew Mowbray and David Lewis Retrieval Techniques (extract from Chapter 2 '`Basic principles of legal information retrieval' Australasian Computerised Legal Information Handbook Chapter 2, Butterworths, 1988).

The following points should be noted:

The relatonship between the concordance and the text files.
The relationship betwen the connectors and the concordance, and how the nature of the concordance determines what connectors (operators) are possible in a retrieval system.

The text retrieval system described in this book uses a relatively complex concordance five place concordance. Many text retrieval systems in use on the internet use far less complex concordances, sometimes only recording the number of a document in which a word is found, but not the location of the word within the document (much less the location of paragraphs or sentences).

Other references

For discussion of some basic concepts of text retrieval, and concepts used in indexing, see Dabney II. A Basic Model for Document Retrieval Systems' (in Dabney, cited below) [No longer on the web]
Howard Turtle 'Text retrieval in the legal world' - 5.1 Exact match models Artificial Intelligence & Law, (1995) Vol 3 Nos 1-2 (see LAWS 4609 Course Materials #3 at the Law Reserve Desk).

Technical sources

Scott Weiss Information Retrieval Papers - A guide to many technical information retrieval papers available on the internet.

3.2. Evaluation of text retrieval performance

How do you determine the effectiveness of a text retrieval system? Although Jon Bing says "The science of information retrieval lacks a comprehensive theoretical foundation', a lot of experimental effort has been made, with 'recall' and 'precision' as the two main measurements of quality of retrieval results.

3.2.1. Precision and recall

The effectiveness of full text retrieval is often measured in terms of precision and recall. The following table illustrates this. Assume that there are 100 documents in a collection being searched.The search retrieves 10 documents, only 4 of which (after inspection) are found to be relevant. However, after inspecting all the other 90 documents, we find that there are 2 relevant documents not found by the search.

	Relevant	Not relevant	Total
Retrieved	Hits a = 4	False drops b = 6	a + b =10
Not retreived	Misses c = 2	Dodged d = 88	c + d =90
Total	a + c = 6	b + d = 94	a+b+c+d = 100

Precision and recall table (derived from Miranda Pao - reference below)

'Precision' is the ratio of a to (a + b), or the proportion of relevant material retrieved to all material retrieved. Retrieval of irrelevant material (ie b) lowers this ratio. Here, the ratio is 4 / 10 = 40%
'Recall' is the ratio of a to (a + c), or the proportion of relevant material retrieved to all relevant material retrievable. Failure to retrieve relevant material (ie c) lowers this ratio. Here, the ratio i s 4 / 6 = 66%

Ideally, both precision and recall should be as close to 1 /1 ( ie 1 or 100%) as possible. We want to retrieve all relevant documents (perfect recall) and no irrelevant documents (perfect precision).

See Precision and recall (in Greenleaf Mowbray and Lewis cited above)

For a lengthier discussion, see Dabney IV. Key Ideas in the Evaluation of Document Retrieval Systems (in Dabney cited below)

3.2.2. Relationship between precision and recall - the retrieval problem

The principal problem in the science of information retrieval is caused by the simple fact that no information retrieval system delivers both perfect recall with perfect precision.

Many who have studied information retrieval have alleged that there is inverse relationship between precision and recall: the more we do to improve precision, recall drops; the more we do to increase recall, precision drops. Among others, this is asserted by Pao (reference below, p12), and suggested by Blair and Maron (reference below). Summarising 30 years of information retrieval research, Pao says `recall and precision are bound to lie within the 40 to 60 percent range'; but says that the inverse relationship is often noted but not proven.

This leads to two main issues:

How do we define the most desirable `balance' between precision and recall (if you can't have 100% of each)? . Does the answer depend on the type of documents, and the purpose of the search task?
What can we do to change retrieval systems so that both precision improves and recall improves?

3.2.3. `Recall devices' and `precision devices'

Bing classifies the search features of text retrieval systems as `recall devices' or `precision devices' (Jon Bing in an extract in LAWS 4609 Course Materials 1996 #3 p37 at the Law Reserve Desk) depending on whether their use enhance precision or recall. For example, truncations and thesaurii enhance recall, as does use of synonyms (and the OR connector), whereas use of the AND connector, and limiting a search over specified databases or parts of databases (eg headnotes) improves precision.

3.2.4. Further references

The following materials are not available on the internet but provide valuable discussion:

Howard Turtle 'Text retrieval in the legal world' - 3. Evaluation in IR Artificial Intelligence & Law, (1995) Vol 3 Nos 1-2 (see LAWS 4609 Course Materials 1996 #3 p46 at the Law Reserve Desk).
Miranda Pao `Evaluation and Measurement' extracts from Chapter 11, Concepts of Information Retrieval, Libraries Unlimited, Colorado, 1989 (see LAWS 4609 Course Materials 1996 #3 p34 at the Law Reserve Desk)

3.3. Arguments about text retrieval effectiveness in law

3.3.1. Blair and Maron's experiment

D Blair and M E Maron `An evaluation of retrieval effectiveness for a full-text document retrieval system' (1985) 28(3) Communications of the ACM 289 (see LAWS 4609 Course Materials 1996 #3 p38 at the Law Reserve Desk for an extract) is the most famous and controversial article in the field of text retrieval and law. This is simply because if is one of very few large-scall empirical studies in text retrieval, and just about the only one related to legal information.

For a brief summary see Blair and Maron's experiment (in Greenleaf Mowbray and Lewis cited above)
for a brief review see R Crawford - Review of Blair and Maron 'An evaluation of retrieval effectiveness for a full-text document-retrieval system' - (from Computing Reviews)
For a lengthy description and critique see Daniel Dabney VII. The Experiment of Blair and Maron (in Dabney cited below) [No longer on the web]

Blair and Maron attacked conventional wisdom, by arguing that recall was much lower than the 75% recall that users of this system (lawyers) demanded and though they were getting (because that is when they stopped searching). They were in fact only getting 20% recall, but with 75% precision. This gives some support for the alleged inverse precision/recall ratio.

Some additional comments on the experiment:

It is one of few experiments in a very large real life legal database (40,000 docts, 350,000 pages) and shows the care and expense involved in serious information retrieval experiments (which is why there are relatively few).
Some aspects of the experiment are questionable. Relevance [both A and C] was determined by the users themselves (blind controls on samples) ( see p29). To determine recall they couldn't test relevance of all non-retrieved documents, so they only tested samples; It is unclear whether they only compared what was retrieved from the samples with what wasn't (p15), but whatever they did, this approach may have been flawed because (i) their `sample frames' that were tested were those `rich in relevant documents', which must skew the result against recall, if searches are over the whole database; and (ii) if they only compared part of a database they are using 'unrealistically small sample sizes', like others they criticise. However, they insist that their experiment 'gives a maximum value for recall for each result' (p15) - but we might question whether in fact it minimises recall.
Were the searches a fair test? They seem to assume that the lawyers / paralegals were being rational (from experience) in keeping their queries as they did. The searchers often used 5 concept terms connected by AND, because they thought that to broaden the search would risk `output overload' . However, if this is how they searched, then unless they used synonyms extensively, recall must be low. Blair and Maron's defence of the searchers used is essentially that these were the best and most experienced searchers available - if they can't use free text retrieval, who could?
They assert that recall also `decreases as the size of the database increases' (p31), but don't explain why. They refer to `output overload' as a `frequent problem of full-text retrieval systems' (though that is the exact reverse of their results!!) and that users counter it by reducing the retrieved set by using the AND operator to limit the number of documents retrieved.
They concludethat the assumptions underlying full text retrieval are wrong., and that earlier studies supporting the effectiveness of full text retrieval were wrong because they were not based on large databases.

It is possible that Blair & Maron's experiemnt is mainly significant in showing that that Boolean retrieval or `exactness' retrieval (as Bing calls it) without a relevance ranking method gives psychological encouragement to attempt to reduce large potentially relevant sets by inappropriate means. The use of relevance ranking to overcome the deficiencies of Boolean retrieval will be discussed in later parts of this topic.

Additional refernces

Starr Roxanne Hiltz - Course Notes for CIS 675 Evaluation of Information Systems - (New Jersey Institute of Technology) - therse course notes from an information retrieval course may provide useful insights. [Sorry, now password protected - used to be free access!]

3.3.2. What retrieval results do lawyers require?

There is considerable dispute, prompted to a large extent by Blair and Maron's experiment, over what lawyers do actually want and demand from computerised retrieval: do they prefer to maximise recall or precision. Dabney and Burson have contributed to this debate.

Daniel Dabney 'The Curse of Thamus' - see 'Ramifications for the Users of CALR Systems' in 'The Curse of Thamus: An Analysis of Full-Text Legal Document Retrieval' (1986) 78(5) Law Library Journal 5-40 (was on Yale Law School web site - No longer on the web)

Some notes on Dabney's approach (GG):

He accepts (like Blair &Maron) that lawyers want recall above all else, but unlike B&M he is talking about case research; but considers they will accept relatively low precision.
He gives reasons why case/statute systems may behave differently from litigation support systems in text retrieval performance, but the factors he gives go both ways. The most sensible conclusion is that another experiment is needed to prove anything significant about legal research systems (as opposed to litigation support systems.
He identifies 4 types of queries where computerised retrieval is most effective, particularly fact retrieval and citation retrieval;
He makes 5 suggestions how retrieval performance may be improved, but he takes a somewhat defeatist approach , and seems to largely accept B&M's conclusions, even for research systems.

Scott F Burson `A reconstruction of Thamus: comments on the evaluation of legal information retrieval systems' (1987) 79(134) law Library Journal (see LAWS 4609 Course Materials 1996 #3 p94 at the Law Reserve Desk)

Some notes on Burson's approach (GG):

He starts by interpreting Dabney as saying that Lexis and Westlaw probably only provide 20% recall - and that this `conjecture' is `almost certainly correct' In doing so he accepts similarilty of the use of text retreival for litiigation support (as by Blair and Maron) and for legal research - and this is a dubious assumption.
He says Dabney mistaken in arguing for improved recall, as this will degrade performance. He argues that lawyers want precison, not recall, in computerised tools -- but since these are not the only tools available, they can obtain recall by other means.
He argues high recall was necessary in the B&M study, as an information retrieval sysem was the only method of recall available. He then argues this isn't necessary in case/statute systems, simply because there are other (manual) research tools are also available which can complement the computerised tools;
He asserts that changing tools to increase recall will lower precision, but without evidence, and is against such changes. Iis he just assuming an immutable inverse ratio?
His opinion is that lawyers prefer high precision to high recall in computerised research tools, becasue they then have other tools available to obtain high recall (eg following up citations). In doing so he simply avoids the question of whether they think they are getting high recall, as in B&Ms study, or any evidence of whether they know what they can effectively get.
Again, this is a defeatist attitude, with no discussion of possible improvements to tools so as to improve recall without degrading precision, as this is assumed to be an impossible `will of the wisp'. Burson simply accepts low recall in computer-assisted legal research [cf litigation support] but isn't concerned.

As a result of these discussions, it is clear that the supposed inverse relationship between precision and recall poses a very difficult dilemma for the use of text retrieval systems in law. It may well be that the use of relevance ranking systems, either in conjunction with boolean retrieval or separately from it, provides the way out of this dilemma. This will be discussed in the subsequent readings on legal research via the internet, where the greatest use of relevance ranking systems has been made.

Further references

Daniel Dabney 'A Reply to West Publishing Company and Mead Data Central on The Curse of Thamus.' (1986) 78 Law Library Journal 349 (was on Yale Law School web site - no longer on the web)

3.4. Relevance ranking

One of the most important innovations in text retrieval, which can be used both as an alternative and an enhancement to boolean retrieval, is relevance ranking systems . They are also sometimes called 'best match' or 'probability models' (Turtle) or a 'statistical interface' (Feldman). Relevance ranking is also discussed in Part 5 of this Reading Guide in the context of internet legal research, where it has had the most effect.

3.4.1. Essential features - inverse document frequency and within document frequency

The common feature of the various approaches to relevance ranking is the attempt to rank documents by the probability that they will be relevant to the query, such that the most relevant document is first in the list of retrieved documents, the next most relevant is displayed second, and so on.

The simplest relevance ranking system would be to count the number of times each search term occurred in a document, calculate the sum of all search terms occurring in the document, and list the document with the highest total as 'most relevant'. This would be a very crude measure.

Howard Turtle 'Text retrieval in the legal world' Parts 5.5 and 5.8 in Artificial Intelligence & Law, (1995) Vol 3 Nos 1-2 discusses the basic elements of relevance ranking systems.

The main question is how to calculate the 'significance' of a particular occurrence of a word in a document, as a means of assessing the relative likely relevance of the document in which it occurs. The second question is how to compound the measures of significance of each occurrence of each search term so as to give an overall measure of significance of a document.

Turtle says that two measures of word occurrence significance are used in most relevance ranking systems:

(i) Inverse document frequency = the number of occurrences of a term in whole database divided by the number of words in the whole database. This means (very roughly) that terms which occur in relatively fewer documents in the whole collection are given greater weight (also called the 'discrimination value' of the term).

(ii) Within document frequency = the number of occurrences of a term in a document divided by the number of words in the whole document. Therefore (again in rough terms) a word that occurs a lot in a short document gets a high score on this criterion.

One approach is then that the weighting to be given to an occurrence of a term in a particular document is the product of these measures for that term. As a result, for example, a term that occurs a lot in a short document (and so has high within document frequency), but doesn't occur very often in the whole database (and so has a high inverse document frequency) will get a very high overall weighting.

Finally, each occurrence of a search term in a document is multiplied by its appropriate weighting. The measure of relevance of the whole document would then be the sum of the weights of each occurrence of each search term in the document. So, a document which has many occurrences of search terms which have high overall weightings will be regarded as a document high in relevance.

There are many measures of relevance implemented in search engines more complex than this simple one.

3.4.2. Example in an internet system - AustLII's relevance ranking

AustLII implements relevance ranking in its SINO search engine in two ways (selected from the 'Find' setting), as both an alternative to boolean searching and as an enhancement to it:

'Any of these words' searches, which is searching based solely on the relevance ranking of search terms, with no boolean or proximity operators able to be used. The relevance ranking algorithm is described briefly in Freeform Search Help ('any of these words' used to be called 'Freeform' on AustLII)
'Boolean query' searches, are in fact not just boolean queries, but have results ordered by likely relevance. A boolean search is first carried out, and only those documents which satisfy the boolean search (ie an 'exact match' search) are then ranked into likely order of relevance.

The availability of relevance ranking can alter a user's search strategy using boolean retrieval. It is often wise to use a broad boolean search in order to maximise recall, relying upon the relevance ranking to supply the precision that the boolean search lacks.

There is a brief discussion of how popular internet search engines rank web pages in How Search Engines Rank Web Pages (on Search Engine Watch).

3.4.3. Effect of relevance ranking on evaluation of search results

If a search engine uses relevance ranking in any way, it is no longer possible to evaluate its results using simple measures of recall and precision. For example, if the first 50 documents retrieved by a relevance ranked search are all or most of the relevant documents in a database, it doesn't much matter that a search retrieves 100 documents but the last 50 were not very relevant, because the user can just keep working down the list until the degree of relevance is no longer high enough to warrant their continuing reading additional items.

Turtle (see 3.2 in his article) suggests methods of constructing a precision / recall curve, which shows recall values at pre-defined recall points (eg from 0.1 to 1.0 of all relevant documents). One way of thinking about such a curve is that it measures such things as 'if a user stops browsing when precision drops below 50% (ie only one in two documents is regarded as relevant), does this occur when recall of only 30% of relevant documents has occurred, or not until 90% of relevant documents have been recalled?

3.4.4. Examples in pre-internet systems - WestLaw's WIN and Lexis 'Freestyle'

Westlaw's WIN and Lexis' 'Freestyle' were two pre-internet examples of so-called 'natural language' searching, which were early implementations of relevance ranking to legal documents. In the LAWS 4609 Course Materials #3 at the UNSW Law Reserve Desk there are extracts from search manuals - WestLaw 'Retrieving documents with WIN - WestLaw is natural' (Chapter 5 of Introduction to WestLaw (1993), WestLaw) at p105 and Lexis 'Freestyle' searching (Information from Lexis help screens) at p110.

For discussion of evaluations, see:

Sheilla Desert, "Westlaw's Natural v. Boolean Searching: A Performance Study" 85 Law Library Journal 713 (Was on Yale Law School web site - no longer on the web)
Daniel Dabney 'A Reply to West Publishing Company and Mead Data Central on The Curse of Thamus.' (1986) 78 Law Library Journal 349 (Was on Yale Law School web site - no longer on the web)

3.5. Integration of hypertext and text retrieval

The purpose of these extracts is to act as an introduction to how hypertext and text retrieval may be combined in unusual ways to produce a result where 'the sum is greater than the parts'. In section 5 following, dealing with legal research on the internet, there is a similar theme in how it is necessary to use 'intellectual indexing' (which are essentially structures created using hypertext) in addition to full text retrieval, in order to get the best results.

3.5.1. The 'DataLex' approach to integrating hypertext and text retrieval

The main reading for this section is extracts from Graham Greenleaf, Andrew Mowbray and Peter van Dijk 'Representing and using legal knowledge in integrated decision support systems: DataLex WorkStations' Artificial Intelligence and Law Kluwer, Vol 3, Nos 1-2, 1995, 97-124; This pre-AustLII paper outlines an approach to the integration of hypertext and text retrieval in legal applications, much but not all of which has subsequently been implemented on AustLII. In a later reading guide, the integration of both these technologies with inferencing systems is discussed.

It is only necessary to read the parts listed below at this stage.

3.5.2. Relative costs of different technologies

In Paquin, Blanchard, and Thomasset `Loge-expert: From a legal expert system to an information system for non-lawyers' Proceedings of the 4th International Conference on Artificial Intelligence and Law, ACM Press 1991, p254 (copy available at UNSW Law Reserve desk in LAWS 4609 Materials #3 (1996) Theories of computerising law: hypertext and text retrieval at p20 - ignore the expert systems aspects for the moment).

This article contains one of the few attempts to compare the benefits and costs of using text retrieval, hypertext and expert systems to computerise legal information.

Diagram of relative costs and effectiveness of different forms of computerisation of law (following Paquin, Blanchard, and Thomasset 1991)

By 'noise' they mean the opposite of precision (items retrieved but not relevant). By 'silence' they mean the opposite of recall (items relevant but not retrieved).

These authors developed their theory (in pre-internet times) before the widespread use of relevance ranking (which makes arguments about the high level of 'noise' in text retrieval much harder to sustain), and before it was clear that large-scale automation of the creation of hypertext links was possible (which dramatically reduces the costs of creation of hypertexts).

3.5.3. Additional reading

M. A. Pasha and P. J. Soper 'Combining the Strengths of Information Management Technologies to Meet the Needs of Legal Professionals' (1996) 2 The Journal of Information, Law and Technology (JILT) - discusses limitations of hypertext and the need to integrate artificial intelligence and hypertext techniques ('knowledge-based hypermedia').

3.6. Other approaches to text retrieval in law

There are many enhancements to basic boolean ('exact match') retrieval systems, and quite a few alternative approaches to boolean retrieval. Some have been tested for legal information retrieval.

Two surveys of the various approaches that have been taken to text retrieval in law are:

3.6.1. Howard Turtle's survey - 'Text retrieval in the legal world'

Howard Turtle 'Text retrieval in the legal world' Artificial Intelligence & Law, (1995) Vol 3 Nos 1-2 (see LAWS 4609 Course Materials #3 at the UNSW Law Reserve Desk) is one of the most systematic surveys of pre-internet approaches to legal information retrieval. It is often quite technical but is recommended highly. Howard Turtle played an important role in developing the WIN (WestLaw is Natural) relevance ranking retrieval system for WestLaw.

3.6.2. Erich Schweighofer's survey - 'The Revolution in Legal Information Retrieval'

Eric Schweighofer ''The Revolution in Legal Information Retrieval or: The Empire Strikes Back' - 1999 (1) The Journal of Information, Law and Technology (JILT). Schweighofer's paper is for the most part less technical than Turtle's. Among the many projects summarised and discussed, Schweighofer includes summaries of:

Vector Space Models where documents and search queries are represented as vectors, and the similarity between a search vector and documents vectors is calculated. This approach is related to the Flexicon project (see below).
Approaches to relevance ranking (which he calls 'inference networks'). See Part 5 of this Guide.
Connectionist Information Retrieval - the use of neural networks and sub-symbolic reasoning for retrieval
'Automatic Acquisition of Legal Knowledge (Machine Learning) and Summarising of Documents' - which is also related to Flexicon (see later)

[Previous] [Next] [Title]

Reading Guide: Hypertext and Retrieval 3. Text retrieval - principles and evaluation

Reading Guide: Hypertext and Retrieval
3. Text retrieval - principles and evaluation