JISC/NSF Digital Library Initiative

Final Project Report: Cross-Domain Resource Discovery Project ("Cheshire")

 

1.                Background

This is the final project report for the JISC/NSF funded Cross-Domain Resource Discovery Project ("Cheshire"). This project addresses the need to develop and implement the advanced networking technologies relating to the integration of digital and internet-based services and digital content. To date, it has sought to do so by setting out a high-level systems component framework based on the Z39.50 information retrieval protocol; the development of a standards-based software system which is extensible for the accommodation of radically different architectural models; and the distribution and support of this system on an open source basis throughout the Joint Information Systems Committee (JISC) and US communities. For its three year life, the project has attempted to integrate a number of innovations in the access, analysis, and display of cross-domain resources including text, numeric, geo-spatial and bibliographic resources. It is currently used to serve over 100 datasets and is now used extensively in higher and further education institutions for use in national and international services.

 

This project was originally initiated to address the growing need to develop standards-compliant software supporting access to cross-domain resources which are distributed across locations, platforms, protocols, and projects. Our research agenda has been developed to enable users to work on a distributed basis to produce and distribute digital library materials. We feel this type of research is likely to be required if students, researchers, and teachers are to take full advantage of the infrastructure of information resources which have been funded by the JISC and various agencies in the United States. At our current stage, we are able to link a variety of resources and extend the capability of software and services which may allow educationalists to store and retrieve data objects for reuse in the portal environment.

 

The design of the system incorporates a client/server architecture based entirely on national and international standards for document description and information retrieval protocols, including SGML/XML, Z39.50, HTTP/CGI, with support for other protocols such as RDF, OAI, SOAP, etc., developed as part of the three year project. The system has been redesigned and implemented to user probabilistic information retrieval methods and to support Object-Relational Database Management Systems; it incorporates advanced retrieval algorithms which can form the basis of a sophisticated ontology engineering environment (which may be incorporated into managed learning environments).

 

The Cheshire software and tools link a wide variety of JISC resources and extend the capabilities of software and services characteristic of the digital library environment. In particular, the system allows users to retrieve and reuse data objects from the wide variety of protocols supported for use in learning systems such as portals, Virtual Learning Environments, etc.

 

1.1Why Cheshire is needed to support the national infrastructure

The Cheshire system has been designed to address the growing need to develop open-source, standards-compliant software which will support portal architectures catering for the widest variety of underlying databases and information retrieval protocols. The objective of the project has been to develop and implement a system which will support large-scale digital library services on a distributed basis, extending to non-bibliographic data.

 

As a result of research and development funded by the JISC/NSF grant, Cheshire now has the following capabilities. It will:

 

1.Support searches across multiple JISC services and datasets (digital libraries). This includes cross-protocol harvesting and serving (e.g. Z39.50, OAI, SOAP).

2.Support the indexing and delivery of non-bibliographic (e.g. full-text and numeric) databases as well as the linking of bibliographic references with corresponding full-text services.

3.Support advanced information retrieval algorithms to enable users effectively to discover information including cross-thesauri and trans-lingual information management.

4.Search support for unfamiliar metadata vocabularies and data filtering.

5.Allow users to request and deliver information from appropriate resources.

6.Supports a full distributed architecture, in which repositories can host and mediate their own data.

 

Cheshire's support for a distributed architecture means that:

1.Individuals, institutions, and projects are able to serve data; to allow that data to be easily used by national services; to give enhanced access to that data; and to manage that data at a greatly reduced cost.

2.Enables projects and services to be flexible in terms of the services they can include (e.g. it is easy for individuals, services, and projects to  mix and match data resources for particular purposes).

3.Reduction in management costs for the national services and an emphasis for individuals on best-practice for creating, managing, and serving information.

4.The architecture is scalable to an indefinite extent with little or no loss of performance and functionality.

 

The Cheshire software and tools developed in response to JISC/NSF link a wide variety of JISC resources and extend the capabilities of software and services characteristic of the digital library environment. In particular, the system allows users to retrieve and reuse data objects from the wide variety of protocols support for use in learning systems such as portals, Virtual Learning Environments, etc. As far as we are aware, there are no other products or systems which have all the capabilities of the Cheshire system available in one package, either commercial or open source.

 

Compared with other information retrieval systems, Cheshire offers a much greater specification as well as cross-protocol support.  As an example of its cross-protocol functionality, Cheshire's support for SRW ("Search and Retrieve Web Service" http://www.loc.gov/z3950/agency/zing/srw) is the most complete package available and the Cheshire project has been active in promoting and developing the protocol for wider use in the digital library environment.

 

1.2How the Cheshire system supports the JISC Information Environment architecture

The  system ("Cheshire") has been designed to support the infrastructural issues as set out in the DNER Architecture Document. This sets out an architecture based on the concept of portals querying or harvesting any number of information providers who may be serving information via OAI, Z39.50, or web service protocols not currently defined. This has brought about a need to for users to access these distributed information resources efficiently, even when there may be additional unanticipated protocols not anticipated in the DNER Architecture Document. In particular, there is a need to define some method of enabling managed learning systems to integrate with digital library services supporting a number of different protocols.

 

To address this situation, we have developed Cheshire's potential to act as a cross-protocol data harvester (for OAI, Z39.50, OGDI, etc.), providing both Z39.50 and HTTP interfaces to descriptions. As an outcome of the project, we have extended Cheshire's capabilities to support any number of protocols which might be used in the JISC information environment, particularly for teaching and learning architectures. These include Open Archives Initiative (OAI), Simple Object Access Protocol (SOAP), Search/Retrieve Web Service (SRW), Simple Digital Library Interoperability Protocol (SDLIP), and web service protocols such as Universal Description Discovery and Integration (UDDI), Web Service Definition Language (WSDL), etc. We intend to the Cheshire system to act as a transformation engine for managed learning systems,  making existing resource descriptions to teachers an learners, currently served by any number of various protocols, available via Cheshire harvesters. This will create a large testbed of items for which descriptions already existing. It will result in an assimilation of an existing network of resources into managed learning environments.

 

The current stage of the system's development, as completed over the past three years, has focused on creating a distributed data object retrieval system, based on Z39.50 and SGML/XML, which offers advanced retrieval and discovery capabilities. This includes such features as:

 

1.The system may be extended to accommodate any number of other protocols, for example OAI, SOAP, ADLP, and SDLIP using the embedded scripting languages  of TCL and Python.

2.Indexing and searching capabilities include the ability to extract geo-spatial and temporal information, extraction of single tags as pseudo-records, improved stemming and relevance ranking algorithms and the ability to dynamically adjust the indexing process based on processing instructions within the XML data.

3.Retrieval of individual known XML elements or strings from within a record on the fly.

4.Support for the Attribute Architecture specification for Z39.50 enabling use of multiple attribute sets at once, as well as more fundamental transactions such as sorting.

5.Virtual databases that provide integrated access to multiple datasets on a single server.

6.Storing XML in a preparsed form allowing for much faster access to complex record structures.

7.A Z39.50 client integrated within the open source Mozilla (Netscape) web browser.

8.The system was extended to support Mac OSX as well as Linux, with additional support for Windows platforms with the entire source tree available on an open source basis.

9.Many XML and SGML parsing issues were corrected, enabling the correct validation of even very complex DTDs and Schemas.

 

The outcomes of the JISC/NSF funding have resulted in a stable version of the Cheshire software which is intended to fulfill the development and redesign objectives of that funding, in particular the expansion of software support for Z39.50 and XML technologies. These capacities, now enabled, permit the dynamic implementation of different processing and retrieval methods, as appropriate to a given domain whether statistical, full-text, geospatial, or multimedia. The source code for the software is available for compilation on an open-source basis and has been progressively released throughout the grant period. We are currently compiling and releasing binaries for this version of the system which will enable support for Windows2000, Macintosh OS X, Linux, and Solaris 8 operating systems (the source code is available and may be independently compiled for these operating systems). This version has been set up as a "turn-key" search environment for Digital Library services.

 

1.2Relationship to other projects and services

1.2.1Use of Software and Tools

The Cheshire system is now being used by all the national services in the United Kingdom. Our model is to work within the framework of the services and collaborate with programmers to develop systems within their own areas of expertise. We believe this type of collaborative design process encourages cross-fertilization in the development of the software, access systems, and national services. Working with the national services, we have assembled a number of tailor-made implementation solutions based on established models. The project has been able to draw upon existing service infrastructures, already in place, which have organizational and support staff to maintain the software and its use over time.

 

One rewarding aspect of the project is to see how research and technical advances are incorporated into production services. These have been subject to formal evaluation and service review procedures taking place outside the project but which inform the development and success of the project's objectives. We have enjoyed working with the national services in disseminating the system more widely and investigating with them strategic support for the use of digital library activities, particularly in teaching and learning activities. In particular the work with Manchester Computing (MIMAS) has yielded valuable results which we hope to deploy with them (and UKOLN) for the JISC Information Environment Service Registry (IESR).

 

The most ambitious use of the software to date has been for the Archives Hub and Manchester Computing. This service was established in 1999 with the aim of managing and serving archival finding aids throughout the United Kingdom for the Higher Education sector. In its current phase, the Archives Hub is becoming a distributed service with each of the repositories responsible for hosting their own data. This architecture is derived directly from the research findings of the JISC/NSF project, based on the use of Z39.50 SCAN to harvest indexes of distributed servers, as discussed above. In order to enable the use of the Cheshire software by non-specialists, we have provided an easy installation package which is in current use by archivists in universities throughout the country. A key outcome of the Archives Hub development will be the deployment of distributed technologies at a national level and their impact on the daily use by non-specialists.

 

1.2.2Exploitation Plan and Dissemination

Rather than seeing this as a pure "research" project with limited local use, we have sought the widest possible use and dissemination of the system architecture as an outcome. Since Cheshire is being used by a number of national services, we are able to target a large number of potential audiences, including students and teachers, librarians, gateway and broker services, commercially provided portals, etc. To date the dissemination activities have provided general information about the project and its potential benefits; and information about research outcomes such as methods of metadata reuse. These have been disseminated in the following ways:

1.Papers on the design and development of the project and its components (see bibliography)

2.Papers documenting the project and significant results (see bibliography)

3.Technical reports made available via the project's web site(s)

4.The software and tools for the project are made available on an open source basis

 

As far as possible we have incorporated the architecture and methodology of the project more widely into existing services funded by the JISC. One objective is to to ensure that support and further development of the system can be sustained for use by individuals as well as by digital library services on an ongoing basis, thereby ensuring optimal take-up of the software and the tools described.

 

1.2.3System Test, Evaluation, and Quality Control

The stakeholders in the development and dissemination of the project's objectives include the national services using the software as well as students and teachers. We have paid a particular focus on the needs of those individual users as well as by the national services and the suitability of the software for production.

 

The process of evaluation for this project is associated with the testing of the system to ensure that it adheres to national and international standards and supports the optimal methods for integrated information discovery across domains. As part of this, we have conducted interoperability tests with other servers (e.g. COPAC) and with a variety of datasets. We have, additionally, tested interoperability with portals as well as with other digital library systems and services. An additional component has been an investigation of the use of standardized metadata elements across domains and support of these using the new Z39.50 attribute architecture for cross-domain searching, semantic interoperability, etc.

 

The Cheshire project has undergone a number of summative evaluation procedures, both its use for national services and as an experimental prototype. For production services, the primary evaluation procedure has been conducted for services hosted by MIMAS (Manchester Computing), available via http://www.archiveshub.ac.uk/introduciton.shtml. This reflected a series of single-focus user trials with candidates from a variety of institutions in order to determine typical user-type transitions and modes of search behaviour using the system.

 

2.Methodology

Our initial strategy was to build an information access system that would provide an original methodology for information discovery and retrieval; and its record in exploiting the fundamental interconnections between diverse information resources which comprise the JISC Information Environment. These are:

 

1.Document databases which describe information about various topics ranging from news reports and library catalogue entries to full-text articles from academic journals

2.Numerical statistical databases which assemble facts about a wide variety of social, economic, and natural phenomena

3.Spatial and temporal databases associated with geographic information systems (GIS) which facilitate map-based display of geographic attributes.

 

The development work undertaken for this project focused on exploiting the fundamental interconnections between these three types of data in the digital library context.  The original aim was to develop and implement a statistical association methods for mapping a query to appropriate metadata classifications in the various types of databases; while at the same time extending the system's capabilities and redesigning it with a new system architecture. These have been executed as follows:

 

Extension 1: Client Technologies. The integration of the Z39.50 information retrieval protocol into the existing code base of a browser (Netscape/Mozilla). We have achieved this through the release of a  Z39.50 browser client, currently available as a download with an XPI (cross-platform installer) for ease of use. The new client is able to locate a server via a Z39.50 URL and allows the functionality of the Z39.50 protocol to be accessible via XPConnect, a library linking Javascript with underlying C++ code. Thus, the Z39.50 session objects are able to support and intelligently process Z39.50 v.3 functions such as SEARCH, SCAN, etc. The client also has an interface which allows for improved accessibility to Z39.50 servers. This interface is easy to use, but has added functionality. The interface is accessible via Mozilla's automatic install procedure, so that the client may be seamlessly layered on top of Netscape.

 

Extension 2: Metadata Reuse. The development of tools which make it easier for users to discover information, even though they may be unfamiliar with the classification, categorization, and indexing schemes characteristic of the databases being searched. To achieve this, we investigated and implemented methods of reusing pre-categorized and pre-classified records for improving cross-domain searching, information management, and resource discovery.  We exploited Cheshire's ability to harvest and index data using an advanced "clustering" technique which will enable common terms to be interlinked automatically and retrieved quickly.

 

Extension 3: The creation of an infrastructure of Z39.50 databases and harvesters which may be used by the national services and which may be integrated into commercially available portals or Virtual Learning Environments (VLEs). Cheshire exploits the Z39.50 information retrieval protocol to link any information resource to any other information resource without necessarily any direct interchange between the two.  As part of the project, we developed additional support for use of Cheshire as a cross-protocol data harvester and transformation engine, which may be used to turn existing Z39.50 resource descriptions to other standardized metadata, e.g. EAD, MARC, IMS/SCORM/IEEE LOM.  The objective is to exploit this feature in a way which will permit services and datasets to interoperate in new and effective ways.

 

This work underpinned a number of related research and development projects based on the Cheshire infrastructure, as follows:

 

1.The Bandalino suite of research projects relating to user interfaces for search, text data mining, and empirical computational linguistics, and automating web site evaluation.  Within this context, Cheshire is being used as a search interface for heterogeneous web intranets, such as those found at large universities, corporations, and government sites. It is being used to organize and group search results over large intranets into coherent structures.

2.The Biothreat Reduction Program at the Los Alamos National Laboratory. Within this context, Cheshire is being used to search for and synthesize information about textual descriptions supporting biomedical research, in particular for cross-domain cataloguing, statistical analysis and strain/species identification based on forensic and attribution information. The Cheshire infrastructure is being used to support the development and deployment of statistical approaches to natural language processing, which will identify entities and relations between them in bioscience texts. This will in turn facilitate more effective search and synthesis.

3.The Metadata Research Programme, which uses Cheshire to explore information retrieval in a networked environment. Within this context, the Cheshire software is being used to design, build, and experiment with front-end prototypes, strategic search commands, entry vocabulary modules, and multi-database navigation.

4.The "Search Support for Unfamiliar Metadata Vocabularies". Within this context, Cheshire is being used to develop prototypes to assist with searching across various classification, categorizing, and indexing (metadata) schemes.

5.The Seamless Searching of Numeric and Textual Resources Programme. Within this context, the Cheshire software is being used by the Institute for Museum and Library Services in a development project to improve access to written material and numerical data on the same topic when searching very different types of databases and numerical data.

6.Translingual Information Management Programme. Within this context, the Cheshire system is being used to prototype new methods of cross-lingual searching, information management, and resources of language engineering.

7.The Cheshire Record and User Management, which uses Cheshire to provide a web based interface to the creation and maintenance of SGML/XML records within a Cheshire database. By using the standard web authentication system to validate users in conjunction with a Cheshire based user databases, very sophisticated levels of distributed management are possible.

 

The methodology behind the Cheshire development programme has, in many ways, been driven by its use within these types of research and development programmes.  Our strategy has been to develop Cheshire capabilities for use within R&D projects, work with researchers to formalize the software capabilities to meet their needs, and then release a production version of the Cheshire software with these capabilities which may be used in a production environment for the national services. In this respect, we have adopted a collaborative methodology, involving other researchers and services. These capabilities are covered in greater detail in Section 3.

 

2.1Project Workplan and Deliverables

We have executed the project workpackages in accordance with the Cheshire Project plan:

1.The base client and server building was executed within the first 18 months, leading to a release of the browser-client and an incrementally developed server side implementation.

2.The metadata and format handling phase was executed within the next 10 months, which led to the use of the SCAN service of Z39.50 servers to build combined indexes containing information "harvested" from individual servers. This is now in place and is used in production services such as the Archives Hub and MerseyLibraries.org.

3.The Entry Vocabulary Module support was executed also during this 10 month period. This aspect of the project enabled intuitive mapping from natural languages to controlled vocabularies. It is now used in production services such as the Archives Hub and MerseyLibraries.org (under the guise of "Subject Resolver").

4.Support for Geographic Information Retrieval (GIR) applications was executed near the end of the project, which resulted in the capability of extracting geo-spatial and temporal information.

5.The support for non-Z39.50 protocols has been added within the last 12 months, including support for SOAP, ADLP, OAI, SDLIP, and web service protocols. Cross-protocol search and retrieve is now available as part of the Resource Discovery Network (which provides Z39.50 access to harvested OAI data) and MerseyLibraries.org (which supports the cross-searching of Z39.50 and SOAP protocols to include Amazon and Google results alongside bibliographic results).

6.We have not only embedded Cheshire not only in testbed applications such as those cited above (e.g. Bandolino), but have also used Cheshire to underpin a number of national services in a production environment. 

 

In addition to the original workpackage objectives stated above, we have developed a number of new support tools for cross-protocols, as follows:

 

1.Cheshire/Python integration, including Python ZOOM; appropriate SOAP toolkits (ZSI); Creation of SOAP object and message bindings; Mapping from ZeeRex files to autogenerate SRW configurations; Result and authentication handling in a stateful server environment; Redesign for integration into JISC projects and services.

2.The development of the system's scripting capabilities, including Changes to the display code for distributed resolving; MetaClusters for distributed cluster harvesting; Scripts for automating updates; ZeeRex (creation, standardization, processing, automation, dissemination); Postgres database for maintaining harvested state; Postgres databases for user information, including result sets; Scripts and databases for harvesting; More efficient indexing routines.

 

3.Activities

This section surveys the activities of the project during its three year period. Further detailed information is included in the annual reports, available on http://sca.lib.liv.ac.uk/cheshire or http://cheshire.berkeley.edu

 

3.1The development and implementation of an open source Z39.50 browser client (Cheshire Extension 1)

Before starting to work on this extension, we conducted an initial assessment exercise to assess the design objectives in light of new software technologies which evolved since the proposal was submitted in 1999. Although the project proposal implied a Java based system for the client, we subsequently decided to utilize the Mozilla framework being developed on an open source basis. By extending the browser (Netscape/Mozilla) to handle the Z39.50 protocol, we assessed that the client functionality would be familiar to the general user and, additionally, we could take advantage of the modular structure of the Netscape/Mozilla framework. This has ensured that every time Netscape/Mozilla is updated, the changes needed to keep the Z39.50 component sychronized are minimized. Because Netscape/Mozilla also implements XPI (the Cross-Platform Installer), it has also meant that new components or user interface modifications could be installed automatically.  The new architecture has meant that, with a single XPI installation, users are able to get a much speedier response in comparison to a Java client (which would comparatively have a slower start up time). The Z39.50 browser client is entirely standards based: It supports the Document Object Model level 1 (DOM1), HTML version 4.0, ECMAscript (standardized javascript), Resource Discovery Framework (RDF), eXtensible Markup Language (XML), Cascading Style Sheets (CSS), and so forth. We regard adherence to such standards as essential to ensure that the application remains usable with different servers and data types.

 

The construction of the client base was done using the following phases:

 

1.Z39.50 URL Phase. This was to ensure that the client is able to locate a server via a Z39.50 URL. This capability is used to retrieve a single document as well as search the server and display the results. The URL scheme was integrated into Netscape/Mozilla's existing network code structure, reusing existing functions and objects. (The URL scheme was not that proposed in RFC2056, as this did not incorporate user authentication, searching, or other Z39.50 commands.) The retrieve data is then able to be handled by the HTML, text, or other mime type handlers already in Netscape/Mozilla. The work consisted of preparing the draft specifications for the URL scheme, and the start of development work for the same.

2.Z39.50 Scripting Phase. This was to engineer a client which is scriptable via Javascript and XUL. This required the Z39.50 functions to be accessible via XPConnect (Netscape/Mozilla's library to link Javascript and the underlying C++ code). As such, it needed to be compatible with XPCOM (Netscape/Mozilla's cross-platform component object model). This has ensure that Z39.50 session objects are available to the javascript components that can call CONNECT, DISCONNECT, SEARCH, SCAN, and so forth, to enable the client intelligently to process information available via the Z39.50 information retrieval protocol.

3.Interface Building Stage. This phase built upon the Z39.50 scripting phase, outlined above. Using these scripting capabilities and XUL, we built and tested a new interface that allows for imprved accessibility to the target servers. The interface is familiar to the end user, but has additional Z39.50 functionality. This interface is accessible via Netscape/Mozilla's automatic install procedures, so that the client may be seamlessly layered on top of Netscape.

 

Although the client and server are designed to work together in a seamless fashion, they must also interoperate with any other Z39.50 compliant system providing such systems support SCAN, EXPLAIN, SEARCH, etc., to a minimum version 3 standard.

 

3.2Enhanced Support for Metadata and Vocabulary (Cheshire Extension 2)

A primary aim of the project was to enable the enhanced retrieval of unfamiliar metadata across domains, e.g. constructing linkages between natural languages expressions of topical information and controlled vocabularies for geospatial, textual, and statistical. To this end, we developed methods of using Z39.50 to automatically "cluster" together topics which may be semantically related for digital library projects; and have incorporated this technology in a number of national services some cross-domain. The techniques of "Classification Clustering" use natural language parsing software to identify phrases in the language of the users of bibliographic databases, taken from the titles and abstracts in the literature to be searched, and then apply statistical association techniques to associate these words and phrases with the metadata terms of the target.

 

Through this way, we sought to develop a research-oriented method of providing access to subject headings, no matter how unfamiliar they may be to the end user, by automating the process of association between natural language and their subject headings. This capability appears to have been effective in enabling users to map their query to the controlled vocabularies (subject headings) used in descriptive metadata; it may be used to cross-search different thesauri and automate associations between them and the user's inquiry.

 

We are currently using this to facilitate automatic subject retrieval across any number of thesauri supported by a number of distributed datasets. The initial findings suggest that this functionality may facilitate access to metadata describing geospatial datasets.  Specifically, methods of mapping geographic place names in text (natural language) to probable geographic coordinates; for mapping geographic coordinates to sets of nearby named places at different levels of geographic or political detail and of different place name types (e.g. city, country, state or province, country). This would require further development of techniques and standards for authority control of events, for time-lines, and for the generation and display of ad hoc time lines of events relating to any given theme.

 

3.3Distributed Search Support ("Meta-Search" capability) (Extension 3)

 

One of the most important research advances of the project is the development of an architecture enabling efficient searching across hundreds or thousands of distributed network nodes using the Z39.50 SCAN service. This capability will be required for the DNER and similar digital library projects to function effectively.

 

Our initial work in Cross-Domain Resource Discovery concentrated on using the facilities of the Z39.50 protocol to implement what we are calling a "Meta-Search" capability using existing Z39.50 servers and resources, extending also to non-Z39.50 protocols such as SOAP and OAI. Many existing attempts at distributed search and resource discovery have relied on the broadcasting of search requests to all servers making up the distributed resources to be searched. There are a number of practical problems with this approach. One of the chief drawbacks is that all systems must be searched before the user or the search controller can determine which systems are most likely to provide the results that th user is seeking.

 

Instead of using such broadcast searches, we are using the SCAN service of Z39.50 servers to build combined indexes containing information "harvested" from individual servers. The SCAN service permits Z39.50 requests directly to server indexes and returns results containing index information including the words or keys in the index along with their frequency of occurrence information for the database. With this information, indexes combining information from many servers and databases can be combined and statistical ranking methods can be used to rank those servers and databases according to the probability that they contain relevant information for a given user query.

 

The Z39.50 SCAN service is included in all the Cheshire servers. We have implemented a special indexing mode for the Cheshire system, which will take a list of servers and use the Explain and SCAN services to build combined "Meta-Indexes" for those servers. This functionality is included in all Cheshire servers, making it easy for any Cheshire server to function as a Meta-Search server for some group of other servers. This facility can be recursively executed, so that the hierarchies of Meta-Search servers can be constructed. This has meant the construction of topically-oriented Meta-Search servers as well as "global" servers that summarize index information from each of the lower layers in the search hierarchy). This architecture is operating in production for the Archives Hub service, MerseyLibraries.org, etc.

 

The support for SRW (Search/Retrieve Web Service) and SRU has extended this capability beyond the Z39.50 protocol to other protocols, which provides a common usage framework for SOAP, OAI, Z39.50, WSDL, and other protocols which may be use in the future. Cheshire's support for SRW/SRU is part of the Information Environment Service Registry (IESR) and MerseyLibraries.org, to name two implementations.

 

4.Outputs

 

The five major outputs resulting from the three-year project are:

 

1.The system capabilities have been extended to allow users to access different domains and information resources (text and document retrieval, numeric databases, and geographic information systems) through the support for transverse searching (in which data found in a text database can be used to find related data in a numeric or geo-spatial database). This functionality may be used for across, retrieving, and rendering the different types of metadata characteristic of geo-spatial, textual, and statistical in nature. This, in turn, will allow users to  conduct searches in each type of database by accepting a query in the users' own terms and then suggest specialized categorization terms to search for in the information resource through statistical associative techniques.

2.The project has extended development of these associative techniques to provide support for "subdomain" vocabularies, e.g. association dictionaries which will lead searchers to the appropriate term or cluster of subject access terms that are likely to satisfy their information needs for specialized topics ("subdomains") which may be non-textual or include cross-thesauri and trans-lingual support.  The development and implementation of these techniques have enabled the system to develop automatically a "likelihood ratio weighting" associated with each searching term and each metadata value which will may lead the searcher more quickly to required information.

3.We have used these developments to research, develop, and implement support for a number of related neglected digital library problems involving metadata reuse. These include: Graphic display by time and place; Support for searching unfamiliar metadata; automated cross-language searching; improved methods of retrieving information with inadequate metadata.

4.The project has extended support for geo-temporal analysis, providing a means of making text, images, hyperlinks, etc. available on a map interface. These capabilities may be built on to give students and teachers in a variety of disciplines the ability to draw arbitrary selections of both local and internet-accessible resources and enable them to mix, match, and plot relationships of data using dynamic map displays. This accompanies the more general project objectives of metadata reuse which may be deployed in pedagogically inventive ways. The architecture is extensible and has been extended to cater to new web service protocols as and when they may be adopted.

5.The project has resulted in the development of a high performance, scalable, and extensible platform for information retrieval. Scalable performance has allowed us to explore the more resource-intensive information retrieval techniques described above.

 

More recent developments to the software are as follows:

 

1.The automatic transformation of metadata. We have extended the system so it is automatically able to transform existing data formats. While this presently is being used to transform data types such as EAD (Encoded Archival Description) into MARC, in the future we could use this capability to transform existing data into the IMS standards and to build descriptions of full records which will enable access to existing intellectual resources through portals or Virtual Learning Environments.

2.The extension of the system to harvesters and non-Z39.50 services. This objective is a prerequisite for relating information resources in a "protocol neutral" fashion so that web service protocols can be integrated seamlessly with Z-services. This has included the extending of the system's support for non-Z protocols (e.g. SOAP, OAI) as well as providing support for XMLSchemas, which may be provided for the integration of Cheshire-served datasets with commercially available portals. As an example of its cross-protocol functionality, Cheshire's support for SRW ("Search and Retrieve Web service" http://www.loc.gov/z3950/agency/zing/srw) is the most complete package available and Cheshire has been active in promoting and developing the protocol for wider use in the digital library environment.  Cheshire already supports OAI (as part of the RDN) and SOAP.

3.The creation of end-user tools to visualize the information retrieved. We have developed capabilities to allow users to search across databases of unknown scope, to navigate, visualize, and interpret their results. This has included a number of visualization tools for spatial and temporal data.

4.Design and prototype implementation of an extensible architecture for distributed collection search and retrieval. We have developed a new design that accommodates distributed components of the system and paves the way for providing search and retrieval capabilities across a large range of systems ranging from very large central servers and distributed databases using a multitude of federated servers to personal systems managing collections for individuals. The design also provides expanded support for a variety of search protocols and capabilities as discussed in section 2 above.

 

As outlined in Section 2, the research and development of the system has achieved all the objectives outlined in the workpackages and stated in the deliverables. In particular:

1.We enable distributed queries through an inter-server discovery and search interface.

2.We style towards multiple interfaces and extensibility from the ground up, with clean plug-ins for interoperability.

3.We have executed maintainable, concurrent, high performance code.

 

 

5.Impacts

 

The project's primary impact has been the development of a system and framework ("Cheshire") which will support distributed information retrieval and which can be applied to the entire range of information retrieval needs for digital libraries across the entire spectrum of sizes and architectures currently in use or proposed (from very large distributed systems to handheld devices). Within this context, we have developed and demonstrated a solution to distributed search that is machine architecture and protocol neutral, that is a system that allows not only support of a multitude of distributed databases, but also supports a framework which enables support for virtually any of the current search and discovery protocols, operating on machines ranging from large multiprocessor servers to PDAs. A central objective of our research, therefore, has been to make such search capabilities and features a ubiquitous resource for digital libraries which can be exploited by systems and users.

 

Intellectually, the project has aimed to develop fundamental new approaches to information storage and retrieval, as well as to pioneer new approaches to information sharing and metadata reuse across systems and types of information. The result has the potential to change fundamentally the way that information systems and services are used by individuals and how these systems interact with each other.

 

Our work has sought the broader impact of transforming the way that JISC services currently use and disseminates information. The developed systems and infrastructure comprising Cheshire will support not only the current generation of computing hardware, but will also support the "next generation" of computing devices embedded in everyday objects and so have the potential to impact all aspects of higher education services. The international nature of the collaboration has encouraged the greatest possible distribution and use of the systems and framework developed as part of JISC/NSF. This impact has been broadened by the open source nature of the software, which has encouraged independent development and use by Manchester Computing, UKOLN, etc.

 

The project's research innovations have had a number of specific impacts demonstrating the integration of access across domains. These include:

1.Management of vocabulary control in a cross-domain context. The Cheshire system is now able to map the searcher's notion of a topic to the terms or subject headings actually used to describe that topic in the database. The classification clustering technique developed as part of the project have been combined with probabilistic document retrieval algorithms to provide a cost-effective remedy which allows users to access unfamiliar metadata, including support for cross-thesauri and translingual retrieval. This is an alternative to the approach currently being investigated by HILT. The outcome has enabled a more direct connection between ordinary language queries ("query vocabularies") and indexing terms ("entry vocabularies") actually used to organize information in a variety of databases. These innovations are now implemented in a production environment as part of the Archives Hub, MerseyLibraries.org, etc., all of which support cross-thesauri retrieval without the expense associated with the development and maintenance of higher level thesauri. We are planning to implement this innovation as part of the JISC funded Information Environment Service Registry (IESR) which will be extended across all JISC datasets.

2.Distributed access to existing metadata resources. The Cheshire system has had a definite impact on the development and implementation of distributed information servers, most notably on the distributed version of the Archives Hub operating as a national service in the United Kingdom. This service uses the project's intellectual advances to enable individual archival repositories to host and maintain their own data, while at the same time becoming part of a distributed national service. The current JISC project to implement EAC (Encoded Archival Context) on a national basis relies entirely on the distributed infrastructure which resulted from the project outcomes. The MerseyLibrary.org service uses these same advances applied to bibliographic, archival, and museum object information held at distributed repositories. The Resource Discovery Network supports an architecture which relies on distributed OAI repositories which are harvested and served via Cheshire. The Los Alamos Biothreat Reduction Program also relies on the Cheshire distributed architecture to construct virtual information resources for forensics and information analysis.  

3.Navigating collections. The data transformation capabilities (GRS-1) of the Cheshire system combined with the python and tcl scripting capabilities have made an impact on the ways that collections are visualized and navigated. For example, the Archives Hub service formats multilevel EAD documents in ways which will support a "drilling down" approach, which permits users to "drill down" from generic to specific description information. The impact of facilitating such access from collections to item level information was assessed in the RSLG report (paragraph 93d).

4.Support for cross-domain clumps to facilitate resource discovery. The project has made an impact on true and effective cross-domain resource discovery, most notably for the MerseyLibrary.org service which supports cross-searching of distributed datasets in different formats (EAD and MARC).  This model is being investigated by a number of national services with a view to implementation on a production basis.  The project has shown how to deliver entire SGML/XML resources, whether they be bibliographic, full-text, or multimedia, while at the same time supporting Z39.50 and SOAP protocols. This outcome has enabled a much broader range of integrated and complete information resources to be delivered to the user's desktop.

 

6.Future Developments

 

Future developments are, therefore, likely to focus on Cheshire's support for performing "ubiquitous" search, so that:

 

1.Information resources that are currently inaccessible (e.g., the "invisible" or "deep" web) can be made accessible and interoperable with other information resources.

2.Information resources that currently require metadata for effective search (e.g. multimedia information) can automatically acquire metadata through metadata capture, sharing, and reuse processes that leverage ubiquitous search.

 

Future research priorities will have the intention of making a number of extra search capabilities possible, such as:

 

1.Finding appropriate and timely information to aid in user tasks without explicit query formulation (and perhaps without the user even being aware that a search is being performed).

2.Automatic discovery and application of metadata from available sources to enable users to describe, more fully and accurately, their own work.

3.Exploiting and automatically combining multiple sources of information related to the task at hand.

4.Removing the need for information users to know about the structure and content of the databases and information services that they use.

 

The next stage of the project is to create a high performance, scalable, and extensible platform for information retrieval support. Scalable performance will support more resource-intensive information retrieval techniques. Over the next twelve months, we have proposed and will be implementing the following:

 

1.Distributed queries: transparent support for effective search across many diverse databases and resources

2.Interoperability, including support for existing protocols and simple extensions of communication protocols (such as using the existing, but typically unsupported "search" request in HTTP and the service search capabilities of GRID computing protocols).

3.Concurrency, scalability, and robustness: We want the search to be ubiquitous, so that search becomes a virtually invisible part of any digital library system.

4.Dynamic databases

5.Structured, maintainable code

6.Underprivileged deployment

7.Application of shared and sharable metadata to the description of databases and database contents, including currently "opaque" content such as multimedia information resources.

 

In future we hope to extend the system's capability for supporting NLP (natural language processing) techniques, which may be used to identify discussion of related or associated information and the use of information extraction techniques for populating metadata databases with that discovered information. This is beyond the scope of the JISC/NSF grant allocation, but could extend from the existing Cheshire support for extracting geo-temporal references from texts and associating these with appropriate geographic coordinates and events.

 

Future developments could include adding system support for context sensitive reference linking via the OpenURL standard. This will enable users to create open links which are context sensitive and may be dynamically configured within portals or managed learning environments in a fully integrated manner. If possible, this would assist in the cost-effective integration of digital collections into institutional frameworks.

 

The project's capabilities for transforming data will enable existing metadata to be transformed to RDF.  This could be used to combine information from multiple sources, e.g. Extending support for RSS feeds for the display of events; combine, store, and optimize feeds using different modules such as bookmarks and learning objects; enable users to personalize resources or interactions typically by profiling this information.

 

The planned technologies described above may enable students and teachers to deal more effectively with digital library and web based information resources, including new ways to visualize content, querying appropriate collections and organizing results, and exploratory analysis tools.  

 

We plan to extend support for enabling users to find quickly the relevant metadata and information resources themselves to satisfy their queries, wherever these may be, e.g.:

 

1.How to select the appropriate databases or collections for search from a large number of distributed databases;

2.How to perform parallel or sequential distributed search over the selected databases, possibly using different query structures or search formulations, in a networked environment where not all resources are always available; and

3.How to merge results from the different search engines and collections, with differing record contents and structures (sometimes referred to as the "collection fusion" problem).

 

The goal of our future research is to make distributed retrieval an effective part of the entire network and computing infrastructure which is automatically invoked as needed, both explicitly by user interaction but even more commonly as part of the "invisible" computing infrastructure. 

 

7.Research Innovations

The research component has been developed with a mind to advance our basic understanding of information.

 

1.By enabling users to cross-search different data formats, protocols, etc., we have been able to devise a system which can be used to bridge the current classification of information into categories of text, numeric data, and geospatial information.  This outcome has strengthened plain-language access to otherwise hidden information.

2.The project has derived an advanced paradigm for information discovery and retrieval that exploits the fundamental interconnections between diverse information resources, including textual and bibliographic information, numeric databases, and geospatial information resources

3.The systems architecture permits advances in the design and evaluation of information retrieval systems due to its distributed component architecture and protocols which will allow researches easily to combine and test different components of information retrieval systems.

4.The software and the "bridges" developed in the project has provided a tool to foster the serendipitous discovery of new interdisciplinary knowledge by users of the system

 

The project set out to achieve as its primary innovation the merging of five areas of theory and practice which, until now, have remained separate. These are:

 

1.Text indexing and searching using an advanced probabilistic and Boolean search engine, and distributed search and retrieval from heterogeneous databases using the Z39.50 information retrieval protocol.

2.Effective management of different metadata vocabularies, including support for cross-thesauri

3.Production of tools for knowledge discovery and intelligent filtering

4.The automatic transformation of data, e.g. From MARC to RDF/IMS for existing large scale datasets

5.Interoperability and access to geospatial information Interoperability and access to numerical databases and their metadata encoded in the codebook DTD.

 

The following are innovation outcomes in research issues and design which support the text discovery and retrieval elements of the project:

 

1.Tools for knowledge discovery. To enable intelligent metadata reuse within the digital library (and for managed learning) environments, and to allow natural language processing to increase semantic usefulness of this data. The project has resulted in innovative ways of optimally mapping user and document vocabularies to the controlled vocabularies used in descriptive metadata, including advances in natural language processing and devising statistical associations between human and metadata vocabularies. This has improved the effectiveness of existing metadata resources and will reduce the reliance on expensive "handcrafted" links between metadata vocabularies.

2.Access, Retrieval, and Filtering Information: The Z39.50 technology utilized advanced information retrieval techniques to permit users to achieve the depth and diversity of information formerly limited to highly skilled experts. The research component of the project has focused on retrieval techniques which will permit users to achieve the depth and diversity of information formerly limited to highly skilled experts.

3.Advanced methods for Services Related to Access: The outcomes of the project have resulted in extending the software capabilities for protocols other than Z39.50 to facilitate true interoperability among diverse distributed databases. We have developed the Z39.50 system to act as a transformation engine, turning existing Z39.50 and other resource descriptions into RDF and IMS specifications. This technique may in future contribute to the development of teaching and learning resources, integrating geospatial, bibliographic, and other domains within that framework. The Cheshire system may now in effect locate, harvest, and index descriptions using a Z39.50 server, providing Z39.50 SOAP, HTTP, SRW, etc., interfaces which can enhanced the information interchange required by portals, MLEs, etc. We have implemented partial XMLSchema support within the Z39.50 framework for this to fulfill its true potential to assimilate a vast network of resources into portals, managed learning environments, etc.

4.The development and integration of cross-media standards and metadata. We have developed a new method of metadata reuse which is based on the technological advances resulting from the systems architecture: in particular, the probabilistic retrieval capabilities of the system to "cluster together" metadata elements, making it easier for the user to retrieve the most relevant entries in one or more datasets using pseudo-relevance feedback techniques. These technique will facilitate access to metadata describing numerical datasets, geospatial datasets, text and bibliographic datasets. They may be used to extend the existing capabilities for mapping multiple languages to terms used in topical metadata. We have provided preliminary client support for making visual representations of temporal and spatial information using a GIS viewer (TimeMap). The architecture itself is standards-based with support for SGML/XML (including XMLSchemas) and the Z39.50 information retrieval protocol among others (e.g. OAI, SOAP, SRW, etc.) which may further encourage the archiving and preservation of this metadata in a standards-based framework.

5.Generic research in content technologies. The project has also extended natural language processing to support the automatic generation of search terms, which may be extended to foreign language search terms. This may be in future be extended to develop and implement a multilingual query translator which, when combined with the Z39.50 search and retrieve protocol, could be programmed to broadcast the translated query in many languages to search appropriate foreign language catalogues. This could include the language of special communities and dialects.

6.The integration of geographic information into the web-based service environment (e.g. Information Visualization). The project has integrated techniques of natural language processing with a GIS information visualization system (TimeMap).

7.Access to Digital Resoruces. An outcome of the Cheshire project is to exploit the content of distributed repositories in an interoperable, standards-based framework. This will be the basis for extending the technology to support generic information brokers or gateways already funded, accommodating a variety of cultural and scientific datasets. The integration of scientific and cultural resources into VLEs/MLEs may result in improved access for teachers and learners while at the same time resulting in cost-effective operations of services which can remain totally distributed, rather than federated or hybrid.

 

A primary research objective of this project has been to bring complex discovery techniques to bear on narrowing, organizing, and rectifying the resources that follow a first-level search. The development and implementation of this as part of the Cheshire system through the techniques cited above has been undertaken to ensure that users can create the most accurate and precise views of information possible based on their interest, and the project itself is intended to achieve the highest degree of interoperability.