• Cheshire3: A focus on high performance, network of workstations style operations. A scalable, extensible platform for information retrieval research. Scalable performance allows us to explore more resource intensive IR techniques.
  • JavaSpaces - high level coordination mechanism for distributed systems. Provides a light weight publish/subscribe distributed programming model
    • "messages" in a java space should be small. Use direct, point-to-point mechanisms for bulk data transfers.
    • Cheshire will have low volume transactions.
    • distributed transactions, leasing, events come for "free"
  • Use JaveSpaces to govern distributed transactions
    • search
    • display
    • update
  • A single operational model for Cheshire that encompasses single node installations, uniformly administered clusters, as well as independently administered federations.
    • every operation is a distributed operation
    • an operation is applied over a set of collections
    • collections
      • single node or cluster: can be partitions of other collections
      • federation: can be partitions or subsets of other collections. In other words, collections in a loosely coupled federation may have overlapping records.
      • Virtual Collection: the external interface (or view) to collections. A VC may present only part of the underlying real collection in its interface. A VC may grow or shrink dynamically within the bounds of the real collection. A search only needs to be done over documents in VC, not all documents in the collection. This gives us a way to logically partition a collection across a number of machines for performance increase, but with built in redundancy in the case of node failures. When a node failures, its VC is simply distributed (logically) to other nodes in the cluster.
      • Cheshire servers can be organized into server groups. A server group can be thought of as an administrative unit.
  • Designing for parallelism and scalability
    • One I/O worker per disk (conceptually). The philosophy here is that we don't want the OS to attempt more parallelism in software than actually exists in hardware. If Berkeley DB is intelligent about concurrent accesses to disk, we already have this effect and don't need to do anything special here.
    • One data worker for each data resource. Here, a data resource should not straddle multiple disks. The philosophy is that there is an optimal way to concurrently access each resource. And the data worker is responsible for scheduling this access.
    • One compute worker for each CPU (conceptually). The OS can be thought to be exactly this kind of worker. The OS knows about its CPU resources and schedules them accordingly.
    • One task worker for each task. A task may be protocol translation, search, display, update, etc.
    • A distinction should be made between task and data workers in the same address space and those in difference address spaces. In general, data intensive tranfers should happen only between data and task workers in the same address space. Task workers have direct read access to data. Writes are handled by background data workers.
    • use Profiling tools for performance tuning
  • Building an open system
    • Loosely coupled object architecture
    • distinguish between structural and behavioral patterns in an object architecture
    • superclasses captures the core contract, while interfaces capture ancillary contracts.
    • objects can be created on demand. If this is done (could be an attractive option for Document objects), it should be hidden from clients.
    • object composition (black box reuse) is more reusable than class inheritance (white box reuse)
    • Reduce the proliferation of types with "unsupported operation" exceptions. That is, define broad, general types and leave it to implementation to support or refuse operations.
    • Override uncaughtException in root threadgroup for error handling. All threads should be decendents of root group.
    • Follow these code conventions where practical.
    • Packages contain functionally related classes and do not necessarily reflect on structural patterns. For example, although index and retrieval objects are aggregated under (managed by) document objects, the index and retrieval packages are not subpackages of the document package. Using packages to document structural patterns is both unnecessary and insufficient. They are, however, a useful way to define behavioral groups; the document package deals with documents, index package deals with indexing, etc.