"Cheshire" Steering Group Meeting

18 April 2001

The British Library

2.00 p.m.


Chair: Derek Law

  1. Introductions

  1. Terms of Reference

  1. Summary Overview

  1. Progress Report

  1. Current Strategic Relationships

  1. Strategic Developments and Funding Strategies

  1. AOB

Terms of Reference for the Cheshire Steering Committee

  1. To advise on the project on behalf of the Higher Education and Further Education Sectors, the various non-HE Sectors which hold a stakeholding interest in the project, and to the Joint Information Systems Committee.

  2. To provide a reporting framework for the JISC, the DNER office, the National Science Foundation and related organizations.

  3. To receive periodic reports from the Principal Investigators of the "Cheshire" project.

  4. To represent the best interests of the Higher and Further Education sectors.

  5. To represent the broader interests of museums, libraries, archival repositories, all of which have a stakeholding interest in the project

  6. To act as an advocate for the project and its aims

  7. To increase visibility of the project within the UK HE and FE sectors as well as the wider communities of museums, archival repositories, libraries, etc.

  8. To read and comment on the annual report and any interim reports.

The Steering Committee will meet periodically and will be chaired by Derek Law.

Current Members:

Derek Law

Reg Carr

Julia Chruszcz

John Perkins

Sheila Anderson

David Dawson

Lynne Brindley

Paul Miller

Andy Powell

Catherine Grout

Frances Thomson

Catherine Grout

Principal Investigators:

Ray Larson

Paul Watry


Summary Overview

This is an update to the first annual report for the JISC/NSF funded project: "Cross-Domain Resource Discovery: Integrated Discovery and use of Textual, Numeric, and Spatial Data", also known as the Cheshire project. The second annual report is due August 2001, and this update is being presented to the Steering Group for consideration at the meeting on 18 April 2001.

The prioritized objective of the project is to produce a next generation online information retrieval system based on international standards. We originally proposed to expand the current system ("Cheshire") and reimplement the basic components as CORBA distributed objects using both Java and C++. This proposition was modified to take account of new technological developments, as outlined in the first annual report: in particular it was decided not to implement CORBA (since server-to-server interoperability can be provided through other means) and it was decided to develop the client as a browser rather than as a Java application.

The result will be a distributed architecture in which the different components of the system can be recomposed on demand to different configurations. This will permit the dynamic implementation of different processing and retrieval methods as appropriate to given domains.

Current Strategic Relationships

Since the annual report was issued (August 2000), the Cheshire system has been adopted by a number of large scale services in the United Kingdom as well as the rest of Europe. These include: the DIEPER project (IST), NESSTAR (IST), MASTER, ZETOC, the Archives Hub, the RSLP Palaeography project, the Natural History Museum, the metadata database of JISC data services hosted by MIMAS.

We have also set up a number of Cheshire servers and clients to support the national infrastructure for various thesauri and name authority control access mechanisms. These include: support for the Library of Congress Subject Headings (LCSH), the UNESCO thesaurus, and the NRA authority files for Personal Names, Corporate Names, and Family Names. This is maintained for the general benefit of the library, museum, and archive domain, is free for all to use, and has been incorporated already into online data input forms (e.g. the Archives Hub).

During 2001 we expect to use the system to host all the data of the Online Archive of California, as that service migrates from proprietary software (DynaWeb) to standards-based software (Cheshire); and we also expect to use Cheshire to host the MELVYL bibliographic database.

We are also developing new uses for Cheshire to demonstrate how the Z39.50 protocol can be extended to support applications in the non-bibliographic domain. During 2001 we released "Quick Step", a fully implemented Z39.50 email/archiving tool. This is free for anyone to use for non-commercial purposes. During 2001/2 we expect to be releasing another Z39.50 utility which can be used to index web pages.

To provide further information about Cheshire and its use in the United Kingdom (and the rest of Europe) we have set up a "Cheshire Resources" page which provides links to the services and tools enumerated above (this is in addition to the site home page). This page is available on: http://gondolin.hist.liv.ac.uk/~cheshire and has improved documentation for use of the Cheshire software. In addition, we have set up and are maintaining a Cheshire list-serve for developers and users of the system. This has been archived using the "QuickStep" tool; the archive and tool is publicly available via the "Cheshire Resources" page.

It may be appreciated that Cheshire is being deployed by many more sites than originally envisaged when the JISC/NSF application was submitted; but we believe strongly that such wide use is a vindication of the entirely standards-based approach that we have taken. We have tried to balance the development tasks with select support for use within the HE sector, as we believe that only through real applications of the Cheshire system can we come across problems needing to be solved or development tasks that need to be refined.

Recent Publicity Activities

In March 2001 we gave a seminar about the Cheshire project at UKOLN, attended also by staff of the Institute for Learning and Research Technologies ILRT (Bristol). This proved to be a very interesting event, since it gave us a chance to find out more about the activities of UKOLN and to determine more fully any role that the Cheshire project could play in the national infrastructure of digital library services (which is the ultimate aim of the project). It was interesting to note developments of the Open Archives Initiative (OAI) and, more broadly, infrastructural developments with various RSLP projects, particularly BACKSTAGE. It remains on open question whether Open Archives Initiative, which is layered on top of HTTP, will be adequate for supporting the national infrastructure in the ways already possible via use of Z39.50. This was debated by seminar participants who expressed various points of view.

We will be closely involved with UKOLN related activities and have recently lent our support to the UKOLN application for "A National Focus for Collection Level Description Work in the UK". Similarly we expect to have increased involvement with ILRT who will be trailing the Cheshire system to see how it compares with other Z39.50 compliant systems and other forms of data distribution. We have suggested the development of a Z39.50 attribute set for the RSLP Collection Schema which could be used to support RSLP services and resources. This may be used for the RSLP project BACKSTAGE. As a result of the UKOLN seminar, we are corresponding with Peter Cliff from UKOLN, who is the Systems Developer for the Resource Discovery Network, about possible uses of Cheshire for the RDN.

Strategic Developments

The changes to the planned development of the Cheshire client were outlined in the first annual report, which contains also a fuller discussion of the development issues. We expect to have a beta-version of the new Cheshire client ready for use by October 2001, as per our deliverable list. This is based on an open source Mozilla browser and, when complete, the client should be layered seamlessly over versions of Netscape. EXPLAIN handling is now implemented; and a basic search interface is generated either from strobe or EXPLAIN. At present, the Cheshire browser displays "raw" unformatted records, but we will be introducing capabilities for style sheets to be downloaded from the XML/SGML data. The Cheshire browser is cross-platform, and worked under Solaris, Linux, Windows OS. (Support for Macintosh OS X has not been prioritized, but is expected to be operational before the end of the project.)

Due to the increased use of the Cheshire system (see above) we have developed (and are developing) a number of client-side records management utilities which were not originally specified in the JISC/NSF proposal. These include, for example, the capabilities of re-editing submitted records, etc., which users have requested.

We have also implemented retrieval and display of "components" of large XML or SGML documents. This will enable us to deliver either entire documents, or only selected components of those documents based in order of relevance. This was a very large development task, but one which is necessary for optimal retrieval of SGML/XML documents. To our knowledge, this is the first time this capability has been implemented within a Z39.50 based system.

In testing the new client, we have found a number of the commercially based Z39.50 systems lack many elements of the protocol, as least to a minimum version 3 standard. For example, the EXPLAIN database capability, which would normally be expected in any implementation, is often not supported. Similarly, there is often no support for SCAN, SORT, etc., even though we view these as a fundamental prerequisite for the establishment of a national distributed library infrastructure.

As a result, we strongly believe that the standard procurement process of digital libraries in the United Kingdom should be the subject of guidelines, for example based on the original procurement document for the AHDS. There also needs to be some way of determining whether these requested capabilities are actually being supported in a non-proprietary manner by the vendor. There seems to be no national validation service, even though considerable sums of money are being expended on Z39.50 systems. It may be that some form of "trip test" may be required to ensure that commercial vendors are actually supporting the Z39.50 protocol in a non-proprietary manner, as is required for interoperable digital library services.

Progress has been somewhat slower for the development of the java based Cheshire III server due in part to a resignation of a staff member; but, more importantly, to a number of unanticipated technical problems with the JavaSpaces technology. The technical problem encountered was one of performance: JavaSpaces transactions take considerably longer to complete than indicated by the literature. The solution for this is to retain SDLIP and Z39.50 as the primary interaction mechanisms for the distributed elements of the system. This will provided the performance needed for the system.

Currently we are experimenting with distributed search and retrieval, including the construction of collection "surrogates" derived from the Z39.50 SCAN services. This work is being reported on at the Joint Conference on Digital Libraries in June, and presented at the SIGIR meeting in New Orleans this September.

In addition to this work on the Cheshire III server, we have implemented a number of additions and changes to the existing Cheshire II server which will be rolled into the Cheshire III development. These will be circulated to the Cheshire Steering Group.

The project so far has focused on the design and performance objectives. We intend to test the system against datasets in the testbed during the final year of the project. The goal will be to provide a working prototype allowing the user can search across different data, including EAD, MARC, CODEBOOK, etc. We intend to be able to provide cross-searching capabilities for the different metadata implied by these various data types.

In addition to the satisfying the immediate design objectives of the project, the base architecture could also provide a platform for the resolution of some real (and neglected) digital library problems involving metadata reuse. These include:

  1. Methods of accessing text and numeric information through use of its geographic focus and extent

  2. Improve access to geo-data/timeline via document and database entry

  3. To develop techniques for authority control of events, for time-lines, and for the generation and display of ad hoc time-lines of events relating to any given theme

  4. To research, develop, and implement support for a number of neglected digital library problems: Time adjacency searching; place-name disambiguation (relating place names to gazetteer entries); geographical adjacency searching; graphic display by time and place

Such development tasks are beyond the scope of the present funding, but we are developing the software with the idea that in future it could be extended to support such tasks.

The first Annual Report outlines what may be the most important design element of the Cheshire project with regard to the use of national (and international) digital library services for information discovery:

Our initial work in Cross-Domain Resource Discovery has concentrated on using the facilities of the Z39.50 protocol to implement what we are calling a "Meta-Search" capability using existing Z39.50 servers and resources. We are taking a new approach to building this meta-search based on Z39.50. Many existing attempts at distributed search and resource discovery (including the current AHDS implementation) have relied on broadcasting of search requests to all servers making up the distributed resources to be searched. There are a number of practical problems with this approach. One of the chief drawbacks is that all systems must be searched, before the user or the search controller can determine which systems are most likely to provide the results that the user is seeking.

Instead of using broadcast search we are using SCAN service of Z39.50 servers to build combined indexes containing information "harvested" from the individual servers. The SCAN service permits Z39.50 requests directly to server indexes, and returns results containing index information including the words or keys in the index along with their frequency of occurrence information for the database. With this information indexes combining information from many servers and databases can be combined and statistical ranking methods can be used to rank those servers and databases according to the probability that they contain relevant information for a given user query.

The Z39.50 SCAN service is included in all Cheshire servers, permitting us to test this method. We are currently implementing a special indexing mode for the Cheshire II system, which will take a list of servers and use the Explain and SCAN services to build combined "Meta-Indexes" for those servers. Once this capability comes online it will be easy for any Cheshire server to function as a Meta-Search server for some group of other servers. We plan to make this facility one that can be recursively executed, so that hierarchies of Meta-Search servers can be constructed (opening the possibility for layers of topically-oriented Meta-Search servers, as well as global servers that summarize index information from each of the lower layers in the search hierarchy).

Developing this capability will radically improve the performance of existing distributed library resources, will enable users to access any indexed element of any database, will foster more focused discovery of this material (even with distributed libraries containing hundreds or thousands of nodes), particularly within the archives, museum, and library domains.