Cross-Domain Resource Discovery: Integrated Discovery and Use of Textual, Numeric, and Spatial Data – Annual Report 1 October 2000 – 30 September 2001
This is the second annual report for the JISC/NSF funded project: "Cross-Domain Resource Discovery: Integrated Discovery and use of Textual, Numeric, and Spatial Data", also known as the Cheshire project.
One of the primary goals of the project has been to develop (and implement with real databases) a search and retrieval system based on international standards that is capable of searching cross-domain resources held in multiple locations. This goal has been achieved, although even more resources and services have yet to be added to the system, and truly seamless integration of these services will require further development of additional “middleware”. We have developed and demonstrated technology for “harvesting” distributed databases and integrating the information in them into a hierarchical set of database selection servers that can be scaled to very large size without loss of performance.
The prioritized objective of the project has been to produce a next generation online information retrieval system based on international standards. We originally proposed to expand the current system ("Cheshire") and reimplement the basic components as CORBA distributed objects using both Java and C++. This proposition was modified to take account of new technological developments, as outlined in the first annual report: in particular it was decided not to implement CORBA (since server-to-server interoperability can be provided through other means) and it was decided to develop the client as a browser rather than as a Java application.
The result will be a distributed architecture in which the different components of the system can be recomposed on demand to different configurations. This will permit the dynamic implementation of different processing and retrieval methods as appropriate to given domains.
Current Strategic Relationships
Since the first annual report was issued (August 2000), the Cheshire system has been adopted by a number of large scale services in the United Kingdom as well as the rest of Europe. These include: the DIEPER Digitized European Periodicals project. See: http://gdz.sub.uni-goettingen.de/dieper/ ), NESSTAR (Networked Social Science Tools and Resources. See http://www.nesstar.org/, Now being continued as FASTER – Flexible Access to Statistics Tables and Electronic Resources. See http://www.faster-data.org/ ), MASTER (Manuscript Access through Standards for Electronic Records. See http://www.cta.dmu.ac.uk/projects/master/ ), ZETOC (Prototype for a new version of the Electronic Table of Contents from the British Library. See http://zetoc.mimas.ac.uk/ ), the Archives Hub (See http://www.archiveshub.ac.uk/ ), and the RSLP Palaeography project (See http://www.palaeography.ac.uk/). The system is also being used by the British Natural History Museum in London to develop new web-based services and as a prototype implementation for the metadata database of JISC data services hosted by MIMAS (Manchester Information and Associated Services).
Cheshire is currently being implemented as a service for the Resource Discovery Network (RDN), where it will be used to harvest RDN records from the various hubs using OAI and make them available for the DNER through the Cheshire Z39.50 target. (See below "Integrating OAI and RSS metadata protocols into Z39.50 for the DNER" for further discussion.)
The Arts and Humanities Data Service (AHDS) is now seriously considering the use of Cheshire for all of its related services. The research and development programme taken forward as part of the JISC/NSF initiative has provided both of these services with the basis for seamlessly searching across datasets and using advanced information retrieval techniques to get the users to the information they require.
We have also set up a number of Cheshire servers and clients to support the national infrastructure (in the U.K.) for various thesauri and name authority control access mechanisms. These include: support for the Library of Congress Subject Headings (LCSH), the UNESCO thesaurus, and the NRA authority files for Personal Names, Corporate Names, and Family Names. This is maintained for the general benefit of the library, museum, and archive domain, is free for all to use, and has been incorporated already into online data input forms (e.g. the Archives Hub).
During late 2001 and 2002 we expect to use the system to host all the data of the Online Archive of California, as that service migrates from proprietary software (DynaWeb) to standards-based software (Cheshire); and we also expect to use Cheshire to host a version of the MELVYL bibliographic database and part of the COPAC database. We have currently a testbed comprising OAC and COPAC datasets, and the MELVYL database has been copied to a server in Berkeley.
We are also developing new uses for Cheshire to demonstrate how the Z39.50 protocol can be extended to support applications in the non-bibliographic domain. During 2001 we released "Quick Step", a fully implemented Z39.50 email archiving tool. This is free for non-commercial or academic use.
To provide further information about Cheshire and its use in the United Kingdom (and elsewhere) we have set up a "Cheshire Resources" page which provides links to the services and tools enumerated above (this is in addition to the site home page). This page is available on: http://gondolin.hist.liv.ac.uk/~cheshire and has improved documentation for use of the Cheshire software. In addition, we have set up and are maintaining a Cheshire list-serve for developers and users of the system. This has been archived using the "QuickStep" tool; the archive and tool is publicly available via the "Cheshire Resources" page.
It may be appreciated that Cheshire is being deployed by many more sites than originally envisaged when the JISC/NSF application was submitted; but we believe strongly that such wide use is a vindication of the entirely standards-based approach that we have taken. We have tried to balance the development tasks with select support for use within the HE sector, as we believe that only through real applications of the Cheshire system can we come across problems needing to be solved or development tasks that need to be refined.
Recent Publicity Activities
In March 2001 we gave a seminar about the Cheshire project at UKOLN, attended also by staff of the Institute for Learning and Research Technologies ILRT (Bristol). This proved to be a very interesting event, since it gave us a chance to find out more about the activities of UKOLN and to determine more fully any role that the Cheshire project could play in the national infrastructure of digital library services (which is the ultimate aim of the project). It was interesting to note developments of the Open Archives Initiative (OAI) and, more broadly, infrastructural developments with various RSLP projects, particularly BACKSTAGE.
We will be closely involved with UKOLN related activities and have recently lent our support to the UKOLN application for "A National Focus for Collection Level Description Work in the UK". Similarly we expect to have increased involvement with ILRT who will be trailing the Cheshire system to see how it compares with other Z39.50 compliant systems and other forms of data distribution. We have suggested the development of a Z39.50 attribute set for the RSLP Collection Schema which could be used to support RSLP services and resources. This may be used for the RSLP project BACKSTAGE. As a result of the UKOLN seminar, we are corresponding with Peter Cliff from UKOLN, who is the Systems Developer for the Resource Discovery Network, about possible uses of Cheshire for the RDN.
Information on the database extraction and resource discovery method developed for this project was presented at the IEEE/ACM Joint Conference on Digital Libraries in Roanoke, Virginia in late June 2001 (Ray R. Larson, “Distributed Resource Discovery: Using Z39.50 to Build Cross-Domain Information Servers” See http://cheshire.berkeley.edu/papers.html ).
The changes to the planned development of the Cheshire client and server were outlined in the first annual report, which contains also a fuller discussion of the development issues. We expect to have a beta-version of the new Cheshire client ready for use by October 2001, as per our deliverable list. This is based on an open source Mozilla browser and, when complete, the client should be layered seamlessly over versions of Netscape. EXPLAIN handling is now implemented; and a basic search interface is generated either from strobe or EXPLAIN. At present, the Cheshire browser displays "raw" unformatted records, but we will be introducing capabilities for style sheets to be downloaded from the XML/SGML data. The Cheshire browser is cross-platform, and worked under Solaris, Linux, Windows OS, and Macintosh OS X.
Due to the increased use of the Cheshire system (see above) we have developed (and are developing) a number of client-side records management utilities which were not originally specified in the JISC/NSF proposal. These include, for example, the capabilities of re-editing submitted records, etc., which users have requested.
We have also implemented retrieval and display of "components" of large XML or SGML documents. This will enable us to deliver either entire documents, or only selected components of those documents based in order of relevance. This was a very large development task, but one which is necessary for optimal retrieval of SGML/XML documents. To our knowledge, this is the first time this capability has been implemented within a Z39.50 based system.
In testing the new client, we have found a number of the commercially based Z39.50 systems lack many elements of the protocol, as least to a minimum version 3 standard. For example, the EXPLAIN database capability, which would normally be expected in any implementation, is often not supported. Similarly, there is often no support for SCAN, SORT, etc., even though we view these as a fundamental prerequisite for the establishment of a national distributed library infrastructure.
As a result, we strongly believe that the standard procurement process of digital libraries in the United Kingdom should be the subject of guidelines, for example based on the original procurement document for the AHDS. There also needs to be some way of determining whether these requested capabilities are actually being supported in a non-proprietary manner by the vendor. There seems to be no national validation service, even though considerable sums of money are being expended on Z39.50 systems. It may be that some form of "trip test" may be required to ensure that commercial vendors are actually supporting the Z39.50 protocol in a non-proprietary manner, as is required for interoperable digital library services.
Progress has been somewhat slower for the development of the java-based Cheshire III server due in part to a resignation of a staff member; but, more importantly, to a number of unanticipated technical problems with the JavaSpaces technology. The technical problems encountered were largely issues of performance: JavaSpaces transactions take considerably longer to complete than indicated by the literature. The solution for this is to retain SDLIP and Z39.50 as the primary interaction mechanisms for the distributed elements of the system. This will provide the performance needed for the system while ensuring continued compatibility with legacy systems supporting these protocols. However, we are making progress on development of the “next-generation” java-based server and expect to have a basic working version by the end of 2001. In addition to this work on the Cheshire III server, we have implemented a number of additions and changes to the existing Cheshire II server which will be rolled into the Cheshire III development. This work has focussed on adding additional features needed for the various services currently (and prospectively) using the system. The features added since the last report include:
These and future additions to the existing c-based Cheshire II server are now being used in services such as the Archives Hub and new requests for enhancements are being circulated to the Cheshire Steering Group for prioritization of the development process.
Currently we are experimenting with distributed search and retrieval, including the construction of collection "surrogates" derived from the Z39.50 SCAN services. This work was reported on at the Joint Conference on Digital Libraries in June, and will be demonstrated at the SIGIR meeting in New Orleans this September. A paper describing the efficiency (very good) and effectiveness (apparently also very good) of this method is being prepared for publication in the journal literature.
The project so far has focused on the design and performance objectives. We intend to test the system against datasets in the testbed during the final year of the project, and are currently testing against serveral large scale datasets, including COPAC (portion), Archives Hub, Online Archive of California, History Data Service, etc.. The goal will be to provide a working prototype allowing the user can search across different data, including EAD, MARC, CODEBOOK, etc. We intend to be able to provide cross-searching capabilities for the different metadata implied by these various data types.
Development of the Cheshire III Client:
The development of the Cheshire III client as a browser is progressing and is expected to be completed by the end of October 2001. We are currently using the open source third party project called "Protozilla", which has created a set of handlers to allow further protocols to be embedded within Netscape/Mozilla. We expect these handlers will be incorporated within the base Netscape/Mozilla distribution in the near future.
We are exploiting this package to allow Netscape/Mozilla and the existing Cheshire code to be glued together in a standardized way, rather than to implement a proprietary series of linking routines and duplicating the work of others.
The use of Cheshire and Protozilla within the Netscape/Mozilla framework permits us to maintain the state of multiple Z39.50 sessions at once, rather than the usual "fire and forget" method of http, which has resulted in the invention of proprietary, inefficient techniques such as Cookies and large server side activities databases.
Although the browser client will be the primary Cheshire III client, our strategy dictates that we maintain a number of different clients for different types of users. These include the existing Cheshire web client, a tcl client, and (in future) a Cheshire layer of MVD (Multi-Valent Document), an extensible document structure written as a Java application. However, the Cheshire III browser client is expected to be the most efficient and functional of all these and the one which we expect will be most used.
The Microsoft Corporation has recently dropped support for Java in its new products, including the new Windows XP operating system. We feel this validates our original decision to develop and implement the Cheshire III client as a browser in a way which conforms completely to international standards. Although we expect to develop support for Cheshire within a Java application (MVD), this will be a slower, less functional version.
Related JISC Funded Projects
Cheshire now fully supports all of the development aims of the Join-Up and Z-Gate Projects funded through the JISC 5/99 call. In many cases, the system can now extend the capabilities originally envisaged for Join-Up and Z-Gate projects and is incorporating these capabilities for national services throughout the United Kingdome. Although Cheshire's current funding from JISC/NSF is primarily purely research led, its widespread adoption by these and other services represents considerable return on investment.
In addition, the research and development aspects of Cheshire are expected to make an original contribution to the JISC funded HILT (High-Level Thesaurus) project, which aims to research, report, and make recommendations on the problems of cross-searching and browsing by subject across a range of services and data types. In particular, our JISC/NSF funding has enabled us to prototype enhanced support for metadata and vocabulary in a cross-domain context without the need for handcrafting expensive cross-walks between different thesauri. We are about to implement this as part of the Archives Hub service and plan to implement these techniques for the Resource Discovery Network, the Arts and Humanities Data Service, and various other JISC datasets. It is expected that the techniques evolved from our pure research project will, practically speaking, leverage existing invements in metadata by making existing metadata more accessible; by generating new metadata through automatic metadata assignments; and by transferring existing metadata to apply to other, additional datasets. (See section below on Entry Vocabulary Modules.)
Enabling the DNER: The Role of the "Cheshire" Information Retrieval System.
On 14 June 2001 we responded to the document "The DNER Technical Architecture" by issuing a statement defining a role of the "Cheshire" information retrieval system within the emerging architecture of the DNER. This has been published on our web site and outlines how the research and development objectives funded through JISC/NSF could make a contribution to the emerging information landscape within the United Kingdom.
Combined, these capabilities will greatly simplify the process of access and discovery for end users.
Effective searching of multiple Z39.50 targets within the DNER
One of the primary outcomes of the research funded through JISC/NSF has been the development of techniques enabling effective searching of multiple Z39.50 targets. To date, service providers have relied upon simultaneous "broadcasting" of search requests to all servers (or "targets") making up the distributed sources to be searched.
One of the chief drawbacks of such "broadcast searches is that all systems must be searched before the user or search controller can determine which systems are most likely to provide the results that the user is seeking. This is highly ineffecient and particularly impractical for large-scale distributed datasets.
The extent of the DNER requires that a technical solution needs to be found and implemented as a matter of priority.
We have examined this as a research objective within the context of the JISC/NSF project and have come up with what we believe to be an original solution. Instead of using "broadcast" searches, we have instead used the SCAN service of Z39.50 servers to build combined indexes containing information "harvested" from individual servers. This is a highly efficient, standards-based way of searching across any number of distributed datasets and could easily be implemented for the whole of the DNER.
Integrating OAI and RSS metadata protocols into Z39.50 for the DNER
In future some of the JISC-funded collections may be non-Z39.50 available datasets, but support the OAI (Open Archives Initiative) or RSS (RDF Site Summary) metadata protocols (assuming that RSS is expressed in the XML format). Neither OAI nor RSS are information retrieval protocols. This means that, in order to be included within the DNER, there must be some method of "harvesting" these OAI and RSS datasets and providing search and retrieval of the metadata they contain.
We have developed the research objective above to use the Cheshire Z39.50 system to "harvest" high level metadata within the DNER, using the SCAN service of Z39.50. This metadata can then be used to quickly locate appropriate repositories which could form the basis of a selective Z39.50 cross-search which could retrieve the full text, etc., of the relevant individual objects.
We have implemented this already for the datasets comprising TREC (Text Retrieval Evaluation Conference) and various JISC datasets with results which were surprisingly accurate and fast.
The above technique can be extended further to enable the whole of the DNER to conform to the Z39.50 information retrieval protocol, with Z39.50 targets seamlessly interoperating with each other. Therefore, far from having an unsustainable number of different protocols to support, the end user would only have to interact with a single Z39.50 based interface which would efficiently knit together the entire range of DNER data resources.
This does not mean that the Cheshire system has to be used by all of the information resource providers, since the Z39.50 services already have the capability of interoperating with each other. The new client will be able to search an independently maintained database of proxy Z39.50 Explain records to compensate for those services which do not fully support the Explain standard.
The above scenario would make it simple for the DNER office to maintain a list of Z39.50 servers (including OAI and RSS harvesters). It would reduce substantially the managing and maintenance cost of the DNER, while increasing functionality for the end-user.
Such an approach is interesting because it is based on advances in pure research into information retrieval issues, rather than the use of conventional Z39.50 toolkits.
Metadata Reuse: Entry Vocabulary Modules (EVMs)
One primary research objective of the JISC/NSF project is to enable the enhanced retreival of unfamiliary metadata using what we call "Entry Vocabulary Modules", or EVMs. This capability, growing out of the Cheshire project, is really a method of constructing linkages between natural language expressions of topical information and controlled vocabularies automatically.
One of the more common challenges facing any end user is in navigating various data sources which might use different thesauri. The Archives Hub is a case in point: data contributors follow either the LCSH (Library of Congress Subject Headings) or UNESCO thesauri. How do users unaccustomed to using either thesauri find out the information of interest to them? One solution, currently examined by HILT, is to create higher level cross-walks between the various thesauri. This is a very expensive solution which, in many cases, proves to be unsatisfactory as well as being expensive to maintain.
A key objective of the project is to discard this methodology and instead develop more research-oriented methods of providing access to these subject headings, no matter how unfamiliar and bewildering they may be to the end user, by automating the process of association between natural languages and their subject headings.
To facilitate this, we have exploited the capabilities of the Cheshire system's support for probabilistic information retrieval on any indexed element of the datase(s). This means that we can use a natural language query (for example, plain English) to extract the most relevant entries in one or more databases. From this information the server can automatically present to the user a cluster of subject headings which might be relevant to their inquiry. The user then can select the subject heading or combination which is most appropriate and then use this as a basis for a more effective subject search across the different databases.
This capability has been remarkably effective in enabling users to map their query to the controlled vocabularies (subject headings) used in descriptive metadata; much more so than traditional boolean methods. But a greater (and unanticipated) benefit may be that we are now able to cross-search different thesauri and automate associations between them and the user's inquiry.
We are now experimenting use of this to facilitate automatic subject retrieval across any number of thesauri supported by a number of distributed datasets. Although there is some more development to be done in this area, the initial results appear to bear out the initial promise of cross-searching different thesuari without the usual overheads of handcrafting links of and between different vocabularies. This could form a substantial contribution to the DNER infrastructure.
As a result, we have now incorporated this as part of the Archives Hub service and plan to extend it to other JISC datasets comprising the DNER.
Future Research Issues: Metadata Reuse and Geospatial Information
The initial findings suggest that further research in this area will greatly facilitate access to metadata, particularly those describing geospatial datasets. Although the project is currently seeking to integrate access to numerical, textual, and geodata we are now in a position to take forward a number of research issues which build upon our present findings:
Specifically, we could now examine methods of mapping geographic place names in text (natural languages) to probable geographic coordinates; and for mapping geographic coordinates to sets of nearby named places at different levels of geographic or political detail and of different place name types (e.g. city, country, state or province, country).
This would require the further development of techniques and standards for authority control of events, for time-lines, and for the generation and display of ad hoc time lines of events relating to any given theme, all of which can be used to facilitate 1) Time adjacency searching ("within 10 years of 1537"); 2) Place-name disambiguation (relating place names to gazetteer entries); 3) Geographical adjacency searching ("within 50km of Paris"); and 4) Graphic display by time and place.
Arguably, such metadata reuse in the context of geospatial searches could, in future, result in dramatically enhanced retrieval of unfamiliar metadata, thus producing an increased return on the existing large scale investments in digital libraries.
As part of the mandated JISC activities for this project we formed an steering committee consisting primarily of people concerned and involved in the development of distributed information services in the U.K. The members of the steering committee are:
Sheila Anderson (Director of the Arts and Humanities Data Service)
Lynne Brindley (Executive Directory, The British Library)
Reg Carr (CEI Chair and Head Bodliean Library, Oxford University)
Julia Chruszcz (Director of National Data Services, Manchester Computing)
David Dawson (ICT Advisor, Re:Source)
Catherine Grout (Assistant Director, Development, JISC DNER Office)
Derek Law (Chair) (Librarian of Strathclyde University Library)
Ray Lester (Department of Library and Information Services, Natural History Museum)
Paul Miller (Interoperability focus, UKOLN)
John Perkins (Executive Director of the Consortium for Museum Intelligence -- CIMI)
Andy Powell (Assistant Director and Team Leader – Distributed Systems and Services,
Frances Thomson (University Librarian, University of Liverpool)
Ray R. Larson (PI)
Paul B. Watry (Co-PI)
The first meeting of the steering group was held at the British Library in London on 18 April 2001. The “terms of reference” for the steering committee are:
Terms of Reference for the Cheshire Steering Committee
1. To advise on the project on behalf of the Higher Education and Further Education Sectors, the various non-HE Sectors which hold a stakeholding interest in the project, and to the Joint Information Systems Committee.
2. To provide a reporting framework for the JISC, the DNER office, the National Science Foundation and related organizations.
3. To receive periodic reports from the Principal Investigators of the "Cheshire" project.
4. To represent the best interests of the Higher and Further Education sectors.
5. To represent the broader interests of museums, libraries, archival repositories, all of which have a stakeholding interest in the project
6. To act as an advocate for the project and its aims
7. To increase visibility of the project within the UK HE and FE sectors as well as the wider communities of museums, archival repositories, libraries, etc.
8. To read and comment on the annual report and any interim reports.
The meeting saw a need for additional support for adopters of the Cheshire technology, either through commercialization of the software and services or through further support from JISC.