1. Introduction

Cheshire is a client/server information retrieval system that brings modern information retrieval techniques to a wide array of data domains. Cheshire provides uniform document storage in the form of SGML, supports probabilistic search, and supports Z39.50 interoperability with dozens of library information systems around the world.

Cheshire is now several years old. Its program logic is coded entirely in C and most of its user interface is done in Tcl/Tk. Further development is being inhibited by a system complexity that has outgrown its original design and its dated software technologies. New technologies are now available that can dramatically reduce coding effort and enhance robustness, maintainability, and interoperability. This paper proposes to re-engineer Cheshire into a modern software system, with the hope of ensuring its future viability as a platform for information retrieval research. The next section outlines the system objectives of a next generation system. Section 3 discusses the technologies available for meeting those objectives. Section 4 proposes a modular reorganization of Cheshire functionality into packages. Section 5 discusses the directory structure for deployment. Section 6 examines issues in migrating the existing system to the new design. Section 7 concludes.

2. Design Objectives

To continue Cheshire's viability as a research platform, the system must appeal to users and developers alike. It must satisfy the information needs of users, and it must also make it easy for developers to modify and experiment with the system.  I identify below seven sets of features I would like to see in a next generation Cheshire.


3. New Technologies

A number of new software technologies exist today that can help achieve the objectives outlined in the previous section.

All of the technologies listed above are free. Source code is available for most of them.

After some examination, I propose to not adopt the following technologies.


4. Source Code Organization

Java provides a "package" based object naming system in which the location of a file in the directory tree, in either source or object form, corresponds to its package name. The idea is that objects with like functionality would be grouped together in a package. These objects have privileged access to each other's non-private and non-protected fields and methods, whereas only public fields and methods are accessible to objects in different packages. Thus, the "package" provides a level of encapsulation higher than the "class". A package is a collection of classes that do similar work. Packages themselves can be grouped into hierarchies.

For a system as complex as Cheshire, a package oriented organization can be very useful. It makes the broader purpose of a class self documenting, while at the same time encouraging the developer to not straddle functionality that should belong to different classes with a single class. The diagram below gives the proposed modular view of the new Cheshire, with the modules named by the packages that will implement them. A module contained in another module is a subpackage of the container package. Arrows indicate data flow.


Cheshire packages should be rooted at edu.berkeley.cheshire to ensure their uniqueness across the Internet. Symbolic links and aliases can be used to get around any inconveniences caused by long package names. The edu.berkeley prefix was omitted in the above diagram for display considerations. The packages are described below.

5. Deployment Organization

Using standard UNIX naming conventions for directories,I propose the following directory tree for deployment.

This deployment scheme does not require root access.

6. Migration Path

Having outlined a redesign of the Cheshire system using completely new technologies, some important questions now need to be answered. What is the right implementation strategy? Do we integrate the new technologies and new source code with the existing system one piece at a time, or do we build the system completely from scratch all at once?

The integration approach would be done from the top down. The top down approach says first we implement the glue that holds the system together by building interfaces to legacy components, then we replace those components one at a time. The opposite, bottoms up approach makes little sense for our situation; we would see none of the benefits of the new technology at the frontend until the entire system is complete. The integration approach has the benefit of quickly realizing the benefits of modern server technologies in a deployable system. It also forces us to consider the full set of supporting interfaces as each new interface is implemented, reducing the chance that important functionalities become incompatible with the new interface.

The integration approach has two disadvantages. The first is that integrating legacy components at each step of the way forces us to deal with the old architecture, and will not fully benefit from the cleanness and consistency of a new design. The second is the enormous amount of programming effort that will be required to devise, implement and test these interfaces. This author is of the opinion that these two disadvantages far outweight the benefits of the integration approach. The existing Cheshire system is usable and there isn't an immediate need for a new server. An interface that doesn't work can always be redesigned. And such a redesigned interface will still be more self-consistent and would have required less implementation effort than an interface built on integration with legacy components at every step of the way.

Rebuilding the system from scratch does seem to be the more sensible approach. Due to the complexity of the Cheshire system, however, it will be difficult to sustain a happy development effort if we go on to build a massive system and see nothing working whatsoever for three months. I propose that we take a mini-system-evolves-to-full-system approach. We start by building a simple multi-threaded, boolean search, SDLIP service, written all from scratch. From there we expand to where Cheshire is today, and beyond. Under this approach, it is important to have a carefully designed skeleton structure established up-front, as I believe I have done in the previous sections. Each piece of new code may be solving a tiny piece of the puzzle, but with the skeleton for reference, we always know where we are and what our purpose is.

7. Conclusion

I have proposed a re-design of the Cheshire information retrieval system. While the design does not in itself achieve the systems objectives outlined in section 2, it does achieve the architectural objective of accommodating all seven of those system objectives. We enable distributed queries through a clean inter-server discovery and search interface. We style toward multiple interfaces and extensibility from the grounded up, with clean plug-ins for interoperability. We take advantage of Java and Java tools to write maintainable, concurrent, high performance code. Java web technologies enable a broad base of clients and allow Cheshire to be searched and administered from anywhere, without having to install software locally. The groundwork has been laid for a next generation platform for information retrieval.