Cheshire II Commands

CONFIGURATION FILE CONTENTS


Configuration File Tags for the Main File Definition

Tags for the Main File Definition (Continued)

Tags for the Index File Definition

Tags for the Index File Definition (Continued)

Sub-Tags for Component Definition

Sub-Tags for Cluster Definition

Sub-Tags for Display Specification

Sub-Tags for Z39.50 Explain Information

ACCESS attribute Codes and Their Meanings

EXTRACT attribute Codes and Their Meanings

NORMAL attribute Codes and Their Meanings

Date Format Patterns and Their Interpretations

LAT_LONG and BOUNDING_BOX Format Patterns and Their Interpretations

MARC 008 Field Sub-Elements

Sub-Tags for Z39.50 Index Mapping Definition

Z39.50 BIB-1 USE Attributes

Z39.50 GILS USE Attributes

Z39.50 BIB-1 Relation Attributes

Z39.50 BIB-1 Position Attributes

Z39.50 BIB-1 Structure Attributes

Z39.50 BIB-1 Truncation Attributes

Z39.50 BIB-1 Completeness Attributes

SGML Catalog Entry formats

SGML Catalog Entry formats (Continued)

Sub-Tags for Index Extraction Elements Definition (tagspec)

Tag Specifications

Configuration File DTD

Sample Configuration File

Cheshire Processing Instructions in SGML/XML files


This document describes the format used in the Cheshire II configuration files and their contents.

Configuration files provide the specifications for indexing of SGML elements and sub-elements by the Cheshire II server and search engine. There may be multiple configuration files for each "file" maintained by the server, or the configuration specifications for multiple files may be included in a single configuration file.

Configuration files are themselves SGML documents (The DTD is appended). The following table describes each element (tag) for the configuration file and it's contents. There are matching end tags (beginning with "</" for each of the tags described below. Everything between a begin tag and end tag is processed as the value of the tag, all "white space" (blanks, tabs, carriage returns, newlines) and SGML comments (anything beginning with "<!--" and ending with "-->") are ignored in processing. The exception to this is blank spaces in QUOTED file names under Windows to permit multiword file and directory names


Configuration File Tags for the Main File Definition

TAG Data Type Meaning
<dbconfig> tag This tag begins a configuration file. It is required.
<dbenv> directory name This tag is optional and may be used to set the (required) database environment directory for indexes defined in the config file. Alternatively, if the CHESHIRE_DB_HOME environment variable is set, then it overrides values specified here. As of version 2.20, one or the other must be set. Unlike the environment variable, the DBENV option lets you have a separate environment for all indexed. For further information in database environments, see the BerkeleyDB documentation included in the distribution.
<filedef TYPE= ?> tag This tag introduces a new file definition. All tags until the matching end tag are part of this file definition. The tag includes the following attributes:
ATTR: TYPE = filetype This attribute specifies the type of file. The current recognized types are SGML, XML, MARC, MARCSGML, AUTH, CLUSTER, LCTREE, MAPPED (these are actually ALL SGML files, but filetypes other than the default (SGML) have special processing routines in the server) and SQL | DBMS | RDBMS (which are external relational database files that are accessible via Z39.50, but are not managed or indexed directly by Cheshire) and finally SGML_DATASTORE, XML_DATASTORE, MARC_DATASTORE and MARCSGML_DATASTORE (which use the Cheshire DataStore facility to store pre-parsed versions of the SGML, XML, or MARC_SGML data). VIRTUAL may be used for virtual databases. Note that MAPPED files assume that the SGML structure is just a definition of how to access the data from non-SGML documents stored in one or more files.
<ConfigInclude> filepath This tag can be used to reference the contents of other files containing configuration file information. It can be used in place of a filedef so that the named file is read and the filedef in that file is included. It is possible to use this for other parts of a configfile, but not in all situations. Note that the configfile using ConfigInclude must AT LEAST include a DBCONFIG tag and DBENV tag in addition to the ConfigInclude tags. DBCONFIG and DBENV tags are ignored in files included with ConfigInclude, so it is possible for multiple standalone configfiles to be combined, for example, for a virtual database definition where the individual databases are added to the virtual DB configfile by ConfigInclude.
<filename> filepath This tag precedes the full path name of the main file for this file when the database is a single file of records. It should contain only the directory name containing the files (not really used) if multiple continuation files are being used. It should contain the name of the external database when the filetype is DBMS. For DataStore files this is the actual Berkeley DB database file in which the pre-parsed SGML/XML data is stored.
<filetag> string This tag precedes a nickname to use for the file, this can be used in place of the full filename whenever the file name needs to be specified. (MUST BE UNIQUE across all files, especially since the server can support multiple config files for different databases)
<filecont ID=? MIN=? MAX=? DB=?> filepath This tag precedes the full path name of a continuation file for very large files that, for example, must be split over different devices, or are stored as multiple separate files in a directory subtree (see the "-r" option of the buildassoc utility that can automatically generate all filecont entries for a given directory subtree).
ATTR: ID=number The id attribute of filecont gives the sequence number of this particular continuation file (there may be multiple continuation files).
ATTR: MIN=number This provides the smallest document ID number contained in the continuation file.
ATTR MAX=number This provides the highest document ID number contained in the continuation file.
ATTR DB=number If DB=1 (and ID=0, MIN=0 and MAX=0) it means that the filecont contents is the path of BerkeleyDB file that contains an indexed version of the FILECONT data. This was created because some databases have very large numbers of records, each of which is a single file in a file tree, and thus there is a single filecont entry for each of them. DB= should be omitted from conventional filecont entries. The utility programs BuildContDB and DumpContDB are included to create and display filecont databases. A filedef using this would include something like:
<FILENAME> DATA/en </FILENAME>
<FILECONT ID=0 MIN=0 MAX=0 DB=1>DATA/en.cont_DB</FILECONT>
the filecont data is usually built using buildassoc before being indexed by BuildContDB.
<continclude> filepath This tag precedes the full path name of a file that contains ONLY filecont definitions as described above. The file is read and processed as if it were included in the configuration file directly. this may be interspersed with filecont tags.
<filedtd TYPE= ?> filepath This tag precedes the full path name of the SGML/XML Document Type Definition (DTD) file for this file. OR, if the TYPE=XMLSCHEMA attribute is set, the name is the name of the XMLSchema DTD file.
ATTR: TYPE = dtdtype This attribute specifies the type of DTD or Schema. The current recognized types are SGML, XML, and XMLSchema. This attribute is OPTIONAL and will default to SGML if not supplied.
<XMLSchema> filepath This tag precedes the full path name of the XMLSchema Definition for this file. If the FILEDTD TYPE=XMLSCHEMA attribute is set, then it is an error not to include this tag.


Configuration File Tags for the Main File Definition (cont.)

TAG Data Type Meaning
<SGMLcat> filepath This tag precedes the full path name of the SGML catalog file used to resolve PUBLIC name references (and potentially other references) for this DTD. This provides the basis for an "Entity Manager" that generates a system identifier for every external entity using catalog entry files in the format defined by SGML Open Technical Resolution TR9401:1995 (http://www.sgmlopen.org/sgml/docs/library/9401.htm). See below for catalog entry formats.
<Explain> explain defs This tag precedes a set of Z39.50 explain definitions for the database (file), This information is used along with the rest of the the configuration information to generate explain records for the file. See the Explain definitions table below.
<assocfil> filepath This tag precedes the full path name of the associator file for this file. This should be empty for DBMS filetypes.
<history> filepath This tag precedes the full path name of the history file for this file.
<indexes> comments This tag precedes the definitions of indexes for this file. Each index definition (described in the following table) is specified following this tag and preceding the matching </indexes> tag.
<components> comments This tag precedes the definitions of components for this file. Each component definition (ComponentDef: described in a following table) is specified following this tag and preceding the matching </componentss> tag.
<clusters> filename If the filedef type attribute is NOT CLUSTER, then this tag is used to introduce the definitions and names of cluster files based on elements extracted from this file and to indicate the cluster key element(s) for the clustering process. Each cluster name should have it's own filedef definition with the type attribute set to CLUSTER. One cluster file name should be specified per occurrance of the tag. When the filedef type attribute IS CLUSTER, then this will contain the name of the file being clustered. (not used for DBMS filetypes.) See the following table on cluster specifications for the sub-elements used in the cluster tag.
<dispoptions> keywords This tag precedes one or more from a set of keywords relating to special options for conversion by the cheshire server when records are returned. The option keywords are: KEEP_AMP to retain any "&amp;"(&) entities, KEEP_LT to retain any "&lt;"(<) entities, KEEP_GT to retain any "&gt;"(>) entities, or KEEP_ALL to retain all three. The default operation of the server is to convert these to their single character form.
<displays> displayspec This tag precedes the definitions of display formats (or "element sets" in Z39.50 jargon) for this file/database. Each "Display Specification" definition (described in a subsequent table) is specified following this tag and preceding the matching </displays> tag. This is an optional tag. If not supplied, the entire record is always returned regardless of the "element set" requested in a present or search. The old form of this tag (<display>) will still work.


Configuration File Tags for the Index File Definition

TAG Data Type Meaning
<indexdef ATTR...> comments This tag signals the beginning of one particular index definition. The attributes that should be included in each indexdef are:
ATTR: ACCESS= access type This is the access method to be used in this index. The methods are BTREE, HASH, VECTOR or BITMAPPED. The default is "BTREE". For DBMS (SQL or RDBMS) filetypes "DBMS" is used. See Table below for access types.
ATTR EXTRACT=key type This attribute specifies the sort of extraction to be performed on the data. Currently "KEYWORD", "EXACTKEY", "FLD008_KEY", "FLD008_DATE", "FLD008_DATERANGE", "URL", "FILENAME", "DATE", "DATE_TIME", "DATE_RANGE", "DATE_TIME_RANGE", "GEOTEXT$quot;, "LAT_LONG$quot;, "BOUNDING_BOX", "GEOTEXT_LAT_LONG$quot;, "GEOTEXT_BOUNDING_BOX", "KEYWORD_EXTERNAL" "KEYWORD_PROXIMITY" or "KEYWORD_EXTERNAL_PROXIMITY" are supported. In DBMS files KEYWORD and EXACTKEY are valid and used for text or char fields. INTEGER_KEY (or INTEGER), DECIMAL_KEY (or DECIMAL), and FLOAT_KEY (or FLOAT) are also available for SINGLE numerical data elements and also for DBMS files mapping. "KEYWORD" is the default. See table below for all EXTRACT attribute Codes and Their Meanings along with aliases for the codes.
ATTR: NORMAL=normalization type This attribute specifies the sort of normalization to be performed on the keys extracted. The options are "STEM", "WORDNET", "CLASSCLUS", "BASIC", "DO_NOT_NORMALIZE", "REMOVE_TAGS_ONLY", "STEM_NOMAP", "WORDNET_NOMAP", "XKEY" or "EXACTKEY", "XKEY_NOMAP" or "EXACTKEY_NOMAP", "CLASSCLUS_NOMAP", "BASIC_NOMAP", "STEM_FREQ" "XKEY_FREQ", and "BASIC_FREQ", "REMOVE_TAGS_ONLY_FREQ". Recent additions include a simple plural-removing stemmer "SSTEM" or "SSTEM_FREQ" and the Snowball stemmer for European languages including: FRENCH_STEM, GERMAN_STEM, DUTCH_STEM, SPANISH_STEM, ITALIAN_STEM, SWEDISH_STEM, PORTUGUESE_STEM, RUSSIAN_STEM (or RUSSIAN_UTF8_STEM), RUSSIAN_KOI8_STEM, DANISH_STEM and NORWEGIAN_STEM If the EXTRACT attribute is one of the date or date time values, NORMAL must be the pattern to use in extracting the date from the records. See the table Date Format Patterns and Their Interpretations for DATE_TIME format patterns. "BASIC" is the default. If the EXTRACT attribute is one of the LAT_LONG or BOUNDING_BOX values, NORMAL must be the pattern to use in extracting the coordinate information from the records. See the table below: LAT_LONG and BOUNDING_BOX Format Patterns and Their Interpretations for LAT_LONG and BOUNDING_BOX format patterns. "BASIC" is the default. See table of normalization types below.
ATTR: PRIMARYKEY=Primary key options "PRIMARYKEY=IGNORE" or simply "PRIMARYKEY" in this attribute indicates that the index is the primary key for the data file, and any duplicate keys are to be ignored (REJECT is a synonym for IGNORE). "PRIMARYKEY=REPLACE" indicates that incoming records with duplicate primary keys are to replace any older existing record with that same primary key. "PRIMARYKEY=NO", "PRIMARYKEY=NONE" or "NOTPRIMARYKEY", or nothing is the default value and indicates a normal non-primary key index. Note that there should only be one primary key defined for any file, if more are defined only the last one defined will be used as the primary key.
<indxtag> string This tag precedes a name for this index (such as "author" "title", etc.) For DBMS indexes this should be the DBMS COLUMN name from the table or view.
<indxname> filepath This tag precedes the full path name for the actual index file created by DBOPEN. Note that this file now contains all index and postings information. For DBMS indexes this should be the DBMS table or View name.
Continued on next page


Configuration File Tags for the Index File Definition (Continued)

TAG Data Type Meaning
<indxmap ATTR...> map entries indxmap entries are used to indicate which Z39.50 attributes should be mapped to this index. Any attributes NOT specified imply that ANY value for that element should be mapped to this index. Note that these are examined sequentially, so the order in the configfile matters. The sub-elements of indxmap are described in a later table.
ATTR: ATTRIBUTESET=Z39.50 Attribute Set This is the attribute set to be used in matching z39.50 queries to this index. (Symolic names for the following Attribute sets supported are "BIB-1", "EXP-1", "EXT-1", "CCL-1", "GILS", "STAS", "COLLECTIONS-1", "CIMI-1", "GEO-1", "ZBIG", "UTIL" ;, "XD-1", "ZTHES", "FIN-1", "DAN-1" and "HOLDINGS"). Either these symbolic names (and many variants) or the OIDs may be specified (OIDs of unlisted attribute sets can also be specified).
Default is BIB-1.
<indxcont ID= ???> filepath This tag precedes the full path name of a continuation index file for very large index files. (NOT IMPLEMENTED IN THIS VERSION)
ATTR: ID = number The id attribute of indxcont gives the sequence number of this particular index continuation file.
<stoplist> filepath This tag precedes the full path name for the stopword list to use in indexing this set of fields.
<RANK_PARAMS TYPE= ???> PARAMS list This tag encloses a set of parameters to be passed to the appropriate ranking method specified in the TYPE attribute.
ATTR: TYPE = ranking type name The TYPE attribute of RANK_PARAMS gives the type of ranking to which the subsequent <PARAM> tags applies the possible values are: "Logistic_Regression" (or "LR" or "LOGREG"), "OKAPI" (or "OK" or "BM25") or "Language_Modeling" (or "LANGMOD" or "LM") (Note however that this last is not available yet - version 40).
<PARAM ID=???> parameter value This tag represents the value of a parameter to be passed to the appropriate ranking method specified in the TYPE attribute of the enclosing RANK_PARAMS tag.
ATTR: ID=integer The ID attribute of PARAM identifies the particular parameter that the contents of the PARAM tag represents. For Logistic Regression there are 7 possible values (with ID's numbered 0-6), and for Okapi there are 3 possible values (with ID's numbered 1-3). For Logistic Regression the parameters are:
ID="0": the LR intercept coefficient
ID="1": the coefficient for average query term frequency (over matching terms)
ID="2": the coefficient for query length
ID="3": the coefficient for average document term frequency (over matching
terms)
ID="4": the coefficient for document length
ID="5": the coefficient for average term Inverse Document Frequency, (over matching terms)
ID="6": the coefficient for the number of matching terms

For Okapi ranking the parameters are:
ID="1": the k1 constant
ID="2": the b or beta constant
ID="3": the k3 constant
For further information on the meanings of these parameters see the published papers on Cheshire as used for information retrieval evaluations such as INEX and TREC, or the papers of Stephen Robertson on the Okapi algorithm.
<extern_app> External indexing application This tag has (currently) two different uses.
First, it can introduce the command name and arguments for external indexing of URLs. The string "%~URL~%" should be used in the place where the URL in the data should be substituted in order to fetch the external data so that it can be indexed. A temporary copy must be made of each item during indexing but NO filename should be specified for the application (i.e., the output should be to stdout). The recommended application to use here is curl (http://curl.haxx.se) which is available on Linux, some other Unixes and Mac OS X. Using curl the tag might look like:
"<extern_app> curl --silent %~URL~% </extern_app>"
The Second use of this tag is to specify the cheshire database to be used as a gazetteer to look up names of places for GEOTEXT extraction methods. In this case the contents of the tag (with no spaces or returns in it) should look like:
"<extern_app>GAZETTEER:/path/to/config/file:database_name:index_to_search_names:exact_tagname_for_element_to_index</extern_app>"
The colons must be used to separate elements. The "/path/to/config/file" should be just that - the full path name of the config file. "database_name" should be the name of the database in the config file. "index_to_search_names" should be the index name (indxtag) for the index to use in matching. "exact_tagname_for_element_to_index" is the name of the tag (exactly as it appears in the gazetteer records) in the gazetteer data file that should be extracted and used for the index entries in the new index.
<indxexc> tag specifications This tag introduces the list of record elements that are to be excluded from indexing for this index file. Any elements included here (and all of their sub-elements) will be ignored during the indexing process. See table below for indxkey (i.e., tagspec) specifications.
<indxkey> tag specifications This tag introduces the list of record elements and sub-elements to be indexed in this index file. See table below for indxkey specifications.
</indxkey> tag End of the code list for this index.


Configuration File Tags for Component Definition

TAG Data Type Meaning
<ComponentDef> Definition This element introduces a new component definition.
<ComponentName> name This element is the name for the component. This should be a file name (full file path) where the component-level information for each document of the current filedef is to be kept.
<ComponentNorm> normalization This optional element specifies normalization to be applied to the component during indexing. Options currently are "NONE", "COMPRESS", and "NOCOMPRESS". NONE is the same as omitting the element entirely. COMPRESS means to flatten out all markup within the bounds of the component. NOCOMPRESS means to retain the markup (currently NOCOMPRESS and NONE function the same).
<ComponentStartTag> tag definition This element introduces an TAGSPEC (see below) that defines the beginning of the component (or the beginning with the end assumed to be the matching end tag, if no ComponentEndTag is supplied). Note that if multiple FTAGs are specified within the TAGSPEC, only the FIRST will be used
<ComponentEndTag> tag definition This element introduces an TAGSPEC (see below) that defines the end of the component. The system assumes that if both a ComponentStartTag and an ComponentEndTag are supplied then the component spans the data from the start tag to the first occurrence of the end tag found in the following data. The span is from the ComponentStartTag beginning UP TO the ComponentEndTag start. Note that if multiple FTAGs are specified within the TAGSPEC, only the FIRST will be used
<ComponentIndexes> index definition This element introduces a set of indexdef's (see above) that define the access to the component component. Indexdefs are identical to those for the file as a whole, but apply only to the components defined by this ComponentDef and not to the whole document.


Configuration File Sub-Tags for Cluster Definition

TAG Sub-tags/ATTR Data Type Meaning
<clusterdef> see below cluster definition This tag is used to introduce a cluster definition. In earlier config files the <CLUSTER> tag was used for this purpose. The preferred form is now to use <CLUSTERS><CLUSTERDEF> ... </CLUSTERDEF> <CLUSTERDEF> ... </CLUSTERDEF></CLUSTERS>, where the ... is composed of the tags below. For compatibility the old form (<CLUSTER>) will still work.
<clusbase> none filename or filetag This tag is used when the filedef type attribute is CLUSTER. It indicates which file is clustered by this file.
<clustag> none filename or filetag This tag is used when the filedef type attribute is NOT CLUSTER. It indicates which file contains the clusters specified by the following cluster definition information.
<clusname> none filename or filetag This tag is used when the filedef type attribute is NOT CLUSTER. It indicates which file contains the clusters specified by the following cluster definition information. (clusname and clustag are aliases for each other).
<cluskey NORMAL=???> tagspec This tag introduces the list of record element(s) that are to be used to cluster this file. These are specified like the indxkey tag specifications. See table below for indxkey specifications.
ATTR: NORMAL=??? normalization type This attribute specifies the sort of normalization to be performed on the keys extracted. The options are "STEM", "WORDNET", "CLASSCLUS", "BASIC" "STEM_NOMAP", "WORDNET_NOMAP", "XKEY" or "EXACTKEY", "XKEY_NOMAP" or "EXACTKEY_NOMAP", "CLASSCLUS_NOMAP", and "BASIC_NOMAP", "STEM_FREQ" "XKEY_FREQ", and "BASIC_FREQ". "BASIC" (formerly NONE was used for this option, and is still a synonym) is the default. See table of normalization types below.
<stoplist> none filepath This tag precedes the full path name for the stopword list to use in extracting cluster keys for the cluster file.
<clusmap> clusmap entries clusmap entries are used to indicate which elements in the base file are to be inserted in the cluster file. Each clusmap entry should contain two subtags, <from> and <to>, and an optional third tag <summarize>. Note that the same clusmap structure is used to define conversions of display formats as well (see Sub-Tags for Display Specification).
SUBTAG <from> tagspec This should contain a tagspec, (i.e., it can contain multiple tag definitions from the base file).
SUBTAG <to> tagspec This should be a simple tagspec name that corresponds to a tag in the cluster file. Patterns in the <to> tagspec are not allowed.
SUBTAG <summarize> tagspec This is used to specify that summary information be created for this clusmap entry and stored as in specificed tagspec. <summarize> should contain the following tags:
SUBTAG SUBTAG <maxnum> This tag, which gives the maximum number of summary entries to generate, is followed by a simple tagspec
SUBTAG SUBTAG <tagspec> (see below) This includes a name that corresponds to a tag in the cluster file. Patterns in the <summarize> tagspec are also not allowed. When a <summarize> tag is given, each unique data item from the <from> tag is collected and the occurances counted. The <maxnum> most frequently occurring data items are output as the contents of the <tagspec> specified.


Configuration File Sub-Tags for Display Specification

TAG Sub-tags/ATTR Data Type Meaning
<displaydef ATTR...> format spec The displaydef tag introduces a named display format definition. This is the treated the same as the format tag below (they are aliases for each other)
<format ATTR...> format spec The format tag introduces a named format definition. This is the treated the same as the displaydef tag above (they are aliases for each other)
ATTR: NAME=??? format name This attribute specifies the name of the format that will be matched against the specified elementset name in a Z39.50 present or search. There are some special names that can be used for special purposes. The special format name "XML_ELEMENT_???" (where an XML or SGML element name is substituted for "???") is used when the document is an XML or SGML document and you only want to extract the named element. The this case the convert operation is used with the function name "XML_ELEMENT" and the FROM specification has the special FTAG name "SUBST_ELEMENT". When this combination is found in the DISPLAY specification (along with the XML or SGML OID) then the named element is extracted and returned as the result of the display operation (wrapped in a <RESULT_DATA DOCID="1"> tag with the appropriate source docid. Note that only a single element will be substituted, though other special elements (such as "#RANK#", etc.) can be included in the results).

In addition to the above, the ELEMENTSET syntax STRING_SEGMENT_s_e (where s and e are numbers giving the starting character position and the ending character position of in a document) can be used in scripts to extract strings from the record without full parsing. (The form STRING_SEGMENT_e can also be used to get everything from the beginning of the record up to character position "e"). This elementset name does NOT need a displaydef/format to operate, but is included here because of its similar operation to XML_ELEMENT_ elementsets. See the client documentation for more details.

The special format name "PAGED_DEFAULT" is used when the document has been indexed and retrieved using a KEYWORD_EXTERNAL index that uses the "PAGED_DIRECTORY_REF" attribute for extraction (see index definition information below). In this case "psuedo-records" will be generated representing each matching page of the page collection. They will include any tags not "excluded" from the base document.

ATTR: OID=??? recsyntax OID This attribute specifies the OID of the record syntax that will be matched against the specified record syntax oid in a Z39.50 present or search. This is typically used for "convert" specifications when converting from XML/SGML to some other record syntax like GRS-1.
ATTR: MARC_DTD=??? filename This attribute specifies the filename of the USMARC DTD that will be used in the MARC conversion for this format. The DTD should be one of the the USMARC dtds supplied in the doc/install directory of the cheshire distribution. MARC conversion for a NON-MARC file will cause an error if this filename is not supplied. DEFAULTPATH specifications are prepended to the DTD name if it is not a full pathname.
ATTR: DEFAULT optional If the DEFAULT keyword attribute is used, then the associated format is used whenever no explicit format is specified in a present or search.

Format specifications are used to indicate which elements in the base file are to be used in sending a record to a client via Z39.50. Each format entry can contain ONE of three subtags <include>, <exclude>, and <convert> (currently these cannot be combined in the same format). These are described below.

<include> tagspec This should contain a tagspec, (i.e., it can contain multiple tag definitions from the base file). It indicates which tags to INCLUDE in the records when this format is requested. This tag is optional and should be used when a particular format uses only some small set of tags from the base record. (NOT CURRENTLY SUPPORTED -- USE EXCLUDE)
<exclude ATTR...> tagspec This should contain a tagspec, (i.e., it can contain multiple tag definitions from the base file). It indicates which tags to EXCLUDE in the records when this format is requested. This tag is optional also and should be used when a particular format uses everything except some small set of tags.
ATTR: COMPRESS=??? exclusion compression This attribute can be used to indicate whether SGML elements that may be listed in the exclude specification and which are required by the DTD for the document are to be reduced to minimal forms (with all data replaced by elipsis (...) and only required tags retained. The options are "COMPRESS=YES" or "COMPRESS=TRUE" simply "COMPRESS", or "COMPRESS=NO" or "NOCOMPRESS" or "COMPRESS=FALSE"
<convert ATTR...> clusmap This is used to specify a conversion function to be run for each record, or for specific tags in a record. The tags to be converted are specified exactly like the cluster mapping (clusmap) described above. There should be a "FROM" and "TO" tagspec for each record element to be converted. The SUMMARIZE clusmap tag may be used to specify a string that should be substituted for matching tags. The string to be substituted is put in the TAGSPEC part of the SUMMARIZE tag. The MAXNUM tag of the SUMMARIZE part is used to indicate the maximum number of conversions of matching tags to be performed (0 is all of them, and any number greater than 0 is the maximum number of conversions). Some special tokens can be used in the FROM part of a tagspec to indicate that metadata or processing information (such as internal document ID number, ranking value in retrieval, etc.) be included in the specified TO tag. See the following table for a description of these tokens.
ATTR: FUNCTION=??? function name or path This attribute is the name of the function to apply to either the specified tags, or to the entire record if no tags are specified. The "PAGE_PATH" function indicates that the complete page path name should be constructed from the indicated tag (which should contain only the directory path) and returned as the tag "<PAGE_PATH>". The "RECMAP" function maps to the requested record syntax specified in the OID attribute of the display tag. When the OID for the format requested is GRS-1, the keywords "TAGSET-M" and "TAGSET-G" can be used to map to the specified tags from these tagsets (by tag name or number). To use TAGSET-G numeric tags in conjunction with the SGML tag subelements of the mapped tags (I.e., to treat all subelements of the TAGSET-G mapped tags like RECMAP conversion) the keyword "MIXED" can be used for the function.

"XML-ELEMENT" can be used to extract a single specified tag, or pattern that can match multiple tags, (in FROM). The TO element is used to substitute a new name for the tag specified in the FROM name tag name. If the conversion record syntax OID is XML or SGML, the element and its subelements are returned. If the conversion record syntax OID is for GRS-1, the specified element is converted to GRS-1 syntax like a "MIXED" record.

"MARC" can be used to do XML/SGML to MARC conversions -- it is up to configfile creator to provide the appropriate mappings from XML/SGML elements to MARC fields and subfields. This requires that the "FROM" and "TO" definitions be structured correctly for MARC conversions. The "FROM" tagspecs should specify one or more tags/elements in the order that they are to appear in the subfields specified in the "TO" specifications. For example, in a MARC conversion from an EAD format finding aid the following might be specified...

        <from>
                <tagspec>
		<ftag>archdesc</ftag><s>repository</s><s>address</s>
                <ftag>archdesc</ftag><s>repository</s><s>corpname|name</s>
                <ftag>archdesc</ftag><s>unitdate</s>
                </tagspec></from>
        <to>
                <tagspec>
                <ftag>260</ftag>
		<ftag>a</ftag>
		<ftag>b</ftag>
		<ftag>c</ftag>
                </tagspec></to> 

Each FROM/TO pair in the specification should correspond to a single MARC field. In the TO specification, the first FTAG should be the three digit MARC field number/name. The following FTAGS (if any) in the TO specification should be the subfields of the MARC field in the order they are to appear. The FROM fields specified in each FTAG line should have a matching TO ftag line (following the first field name line). Data from fields found will be put into the matching subfields of the field. However, if there are multiple occurrences of particular field or subfield values for a given record, sometimes the mapping will be inaccurate (there is currently no way to automatically detect which of the subfields is paired with each of the other subfields).

If the function is a full pathname (e.g. starting with a / in unix or Linux or a drive letter, colon, backslash in NT/Windows), it is taken to be an EXTERNAL conversion program. Such an external program should be set up to read a single record via STDIN and to put out the converted form via STDOUT.

ATTR: ALL? optional If the ALL attribute is used, it indicates that the function is to be applied to the entire record, and no tags are specified. When doing RECMAP conversion the output tags are the same as the SGML record tags (The ALL option works in GRS-1 record syntax conversions ONLY).


Display Specification Tokens for Conversions

TOKEN Meaning
NOTE: The tokens in this table can be used in the part of the FROM tagspec in a FORMAT specification to indicate that metadata or processing information should be placed in the tag specified in the matching TO part of the format.
#DOCID# Insert the internal document id number for the record.
#COMPONENTID# Insert the internal component id number for the retrieved component.
#COMPID# Insert the internal component id number for the retrieved component.
#RANK# Insert the retrieval rank (i.e. the sequence in the ranking) for the record or component.
#SCORE# Insert the retrieval score (normalized to a range from 1000 to 0) for the record or component.
#RELEVANCE# Insert the retrieval score (normalized to a range from 1000 to 0) for the record or component.
#RAWSCORE# Insert the raw retrieval score (calculated probability of relevance) for the record or component.
#RAWRELEVANCE# Insert the raw retrieval score (calculated probability of relevance) for the record or component.
#FILENAME# Insert the full pathname for the document or collection of documents.
#DBNAME# Insert the (short) database name () for the document or component.
#PARENT# For component displays, extract and include the tag (specified as the sub-elements of the from tag after the #PARENT# ftag). In the display the elements will have their names converted to PARENT-x where 'x' is the tag name in the parent document.
SUBST_ELEMENT Insert the element specified by the current elementsetname as "XML_ELEMENT_xxx" where xxx is the element name to be substituted.


Configuration File Sub-Tags for Z39.50 Explain Information
All Explain information is optional and need not be supplied

TAG Sub-tags/ATTR Data Type Meaning
<TITLESTRING ATTR> ATTR text Title to use for this database.
ATTR: LANGUAGE=code language code Use the USMARC three letter language codes
<DESCRIPTION ATTR> ATTR text Description of the database and its contents.
ATTR: LANGUAGE=code language code Use the USMARC three letter language codes
<DISCLAIMERS ATTR> ATTR text Any disclaimers associated with the database.
ATTR: LANGUAGE=code language code Use the USMARC three letter language codes
<NEWS ATTR> ATTR text Any news about the database.
ATTR: LANGUAGE=code language code Use the USMARC three letter language codes
<HOURS ATTR> ATTR text Hours of access for the database.
ATTR: LANGUAGE=code language code Use the USMARC three letter language codes
<BESTTIME ATTR> ATTR text Best times to access the database.
ATTR: LANGUAGE=code language code Use the USMARC three letter language codes
<LASTUPDATE> none text Last time the database was updated.
<UPDATEINTERVAL> SUBTAGS text Frequency of updates of the database.
SUBTAG <VALUE> number The number of days, weeks, years, etc between updates.
SUBTAG <UNIT> time unit info The time unit used (day, week, month, year);
<COVERAGE ATTR> ATTR text Database coverage
ATTR: LANGUAGE=code language code Use the USMARC three letter language codes
<PROPRIETARY> none boolean Is the database proprietary? Acceptable contents are "YES" or "TRUE", "NO" or "FALSE".
<COPYRIGHTTEXT ATTR> ATTR text Copyright text information
ATTR: LANGUAGE=code language code Use the USMARC three letter language codes
<COPYRIGHTNOTICE ATTR> ATTR text Copyright notice
ATTR: LANGUAGE=code language code Use the USMARC three letter language codes
<PRODUCERCONTACTINFO> none contactinfo Producer of the database -- contents are tagged using the CONTACT_... tags below.
<SUPPLIERCONTACTINFO> none contactinfo Supplier of the database -- contents are tagged using the CONTACT_... tags below.
<SUBMISSIONCONTACTINFO> none text Where to send submissions to the database -- contents are tagged using the CONTACT_... tags below.
<CONTACT_NAME> none text Name of the the contact person.
<CONTACT_DESCRIPTION ATTR> ATTR text Description of contact information.
ATTR: LANGUAGE=code language code Use the USMARC three letter language codes
<CONTACT_ADDRESS ATTR> ATTR text Contact Addrress.
ATTR: LANGUAGE=code language code Use the USMARC three letter language codes
<CONTACT_EMAIL> none text Contact Email.
<CONTACT_PHONE> none text Contact Phone number.


ACCESS Codes and Their Meanings

Code Affects Meaning
BTREE access type Selects the DBOPEN BTREE format for the main index file.
HASH access type Selects the DBOPEN HASH format for the main index file.
BITMAPPED access type Selects a bitmapped index file. Bitmap indexes can ONLY be used for Boolean retrieval, and they are specifically designed to speed up access time on indexes that have only a few values (such as an attribute that appears in every record of the database and has only 3 or possible values), and a very large number of records (i.e., hundreds of thousands or millions of records). In a BITMAPPED index only a single bit is stored for (potentially) each record in the database, instead of the 64 bits per entry stored in conventional indexes. More information about bitmapped indexes is available here
VECTOR access type VECTOR files include a normal BTREE and an additional vector file, which must be created by running the index_vectors program after normal indexing. VECTOR files permit quick lookup by internal termids, and a listing of all termids with frequency information for each document with lookup by internal document id). The VECTOR files are primarily used for blind feedback retrieval, to locate additional related terms to be added to an initial query.
DBMS access type Indicates that the "index" name is an external DBMS table and column specification for the database indicated in the index tag.


EXTRACT Codes and Their Meanings

Code Affects Meaning
KEYWORD key type Indicates that keywords are to be extracted from the elements specified for this index. (In external DBMS files this is used to match character or text fields using the "like" operator)
KEYWORD_EXTERNAL key type Indicates that keywords are to be extracted from the elements specified for this index some of which are indications of external non-SGML text files or of URLs of external HTML files.
KEYWORD_PROXIMITY key type Indicates that keywords are to be extracted from the elements specified for this index and that proximity (character position information) is to be maintained in the index. PROXIMITY and KEYWORD_PROX are synonyms for this specification. Note that indexes including proximity information will be much larger than simple keyword indexes.
KEYWORD_EXTERNAL_PROXIMITY key type Indicates that keywords are to be extracted from the elements specified for this index some of which are indications of external non-SGML text files or of URLs of external HTML files and that proximity (character position information) is to be maintained in the index. KEYWORD_PROXIMITY_EXTERNAL, KEYWORD_EXTERNAL_PROX, and KEYWORD_PROX_EXTERNAL are synonyms for this specification. Note that indexes including proximity information will be much larger than simple keyword indexes.
EXACTKEY key type Indicates that exact keys are to be extracted from the elements specified for this index. Nota Bene: if a pattern is used for the tags specified for exact key extraction all items that match that pattern at the same level of nesting will be extracted and concatenated to form the exact key -- see tag specifications below. (In external DBMS files this is used to match text or character fields using "=").
FLD008_KEY key type Indicates that sub-elements of a MARC 008 field are to be extracted from the field tag specified for this index. The particular sub-elements are listed in the "008 elements" table below. (Only applicable for SGML MARC records)
FLD008_DATE key type Indicates that sub-elements of a MARC 008 field are to be extracted from the field tag specified for this index. This is used in conjunction with one of the date formats for the NORMAL attribute(see below). The particular sub-elements are listed in the "008 elements" table below. (Only applicable for SGML MARC records)
FLD008_DATERANGE key type Indicates that sub-elements of a MARC 008 field are to be extracted from the field tag specified for this index. This is used in conjunction with one of the date range formats for the NORMAL attribute(see below). The particular sub-elements are listed in the "008 elements" table below. (Only applicable for SGML MARC records)
URL key type This is used to indicate that the contents to be extracted are URLs.
FILENAME key type This is used to indicate that the contents to be extracted are filenames (for example, full unix pathnames).
DATE key type This is to indicate that the extracted values should be parsed for date values based on the pattern provided in the NORMAL attribute (see below).
DATE_RANGE key type This is to indicate that the extracted values should be parsed for date range values based on the pattern provided in the NORMAL attribute (see below).
DATE_TIME key type This is to indicate that the extracted values should be parsed for date and time values based on the pattern provided in the NORMAL attribute (see below).
DATE_TIME_RANGE key type This is to indicate that the extracted values should be parsed for date and time range values based on the pattern provided in the NORMAL attribute (see below).
INTEGER_KEY key type This is used to extract a single integer value, which should be the first data within the tag/field being extracted (white space will be ignored, but non numerical characters will not be indexed). The index entry is currently a zero-padded string of 10 digits with an optional leading minus sign. It is also used with DBMS files to indicate that the external DBMS type for this field is an integer.
DECIMAL_KEY key type This is used to extract a single decimal value, which should be the first data within the tag/field being extracted (white space will be ignored, but non numerical characters will not be indexed). The number may include a decimal point and fractional part (for example "100.4356". The index entry is currently a zero-padded string of 16 digits with an optional leading minus sign and 6 digits after the decimal point. This specification is also used with DBMS files to indicate that the external DBMS type for this field is an decimal number.
LAT_LONG key type This is to indicate that the extracted values should be parsed for Latitude and Longitude values based on the pattern provided in the NORMAL attribute (see below).
LATITUDE_LONGITUDE key type Same as LAT_LONG above.
LATITUDE/LONGITUDE key type Same as LAT_LONG above.
GEO_POINT key type Same as LAT_LONG above.
BOUNDING_BOX key type This is to indicate that the extracted values should be parsed for a pair of Latitude and Longitude values based on the pattern provided in the NORMAL attribute (see below).
GEO_BOX key type Same as BOUNDING_BOX above.
GEOTEXT key type This normalization relies on an external cheshire database containing a gazetteer, candidate terms in the document are compared against this gazetteer database and only terms/phrases that match gazetteer entries are actually entered into the index. The EXTERN_APP tag (see above) must be used to specify the config file, database name and index to be used in matching. The NORMAL tag should indicate one of the text processing normalization methods below (BASIC) is simplest. The generated terms are, by default, keywords derived from the Gazetteer entry names, using EXACTKEY or EXACTKEY_NOMAP uses the exact phrase from the Gazetteer. DO_NOT_NORMALIZE retains the exact capitalization, spacing, etc. from from Gazetteer entry.
GEOTEXT_LAT_LONG key type This normalization relies on an external cheshire database containing a gazetteer, candidate terms in the document are compared against this gazetteer database and only terms/phrases that match gazetteer entries are actually included, but the (point) coordinates from the gazetteer are used as the generated keys. The EXTERN_APP tag (see above) must be used to specify the config file, database name and index to be used in matching. Lat_long NORMAL specifications should be used to indicate how queries should be parsed when matching to the generated index.
GEOTEXT_BOUNDING_BOX key type This normalization relies on an external cheshire database containing a gazetteer, candidate terms in the document are compared against this gazetteer database and only terms/phrases that match gazetteer entries are actually included. In this case bounding box coordinates from the gazetteer are used directly or generated from point data for the generated keys entered into the index. The EXTERN_APP tag (see above) must be used to specify the config file, database name and index to be used in matching. Bounding_box NORMAL specifications should be used to indicate how queries should be parsed when matching to the generated index.
NGRAMS key type NGRAM indexes segment texts into ngrams of 3, 4, or 5 letters, for each non-keyword stopword. The ngrams are "shingled" or overlapping, so that, for example, the text:
"John Smith"
as 3grams would generate:
' jo':'joh':'ohn':'hn ':' sm':'smi':'mit':'ith':'th '.

Notice that a space is used to represent the beginning and end of words in the generated trigrams. Ngrams will remove any stopwords, and follow any other normalization specifications during index (although all ascii punctuation, except for hyphens, is removed regardless of normalization (non-ascii unicode punctuation will remain). Unicode behavior has not been fully tested, but the segmenting is byte rather than character based, and so may have strange results.

The following aliases can be used for the EXTRACT parameter:

"NGRAM" "NGRAMS" "SHINGLE" "SHINGLES" for 3grams

"NGRAM4" "NGRAMS4" for 4grams

"NGRAM5" "NGRAMS5" for 5grams

ACCESS should be BTREE, and NORMAL can be any of the text normalization options (i.e., GEO or Date normalizations will not work).


NORMAL Codes and Their Meanings

Code Affects Meaning
STEM normalization Indicates that the Porter stemming algorithm should be used to normalize the keywords extracted.
STEM_FREQ normalization Indicates that the Porter stemming algorithm should be used to normalize the keywords extracted. This normalization type assumes that each keyword is paired with frequency information in the form "{word 20}", where 20 is the frequency. THIS IS PRIMARILY USED FOR COLLECTION-LEVEL DOCUMENTS IN DISTRIBUTED SEARCH.
SSTEM normalization Indicates that a simple plural-removal algorithm should be used to normalize the keywords extracted.
SSTEM_FREQ normalization Indicates that a simple plural-removal algorithm should be used to normalize the keywords extracted. This normalization type assumes that each keyword is paired with frequency information in the form "{word 20}", where 20 is the frequency. THIS IS PRIMARILY USED FOR COLLECTION-LEVEL DOCUMENTS IN DISTRIBUTED SEARCH.
FRENCH_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the French language should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
GERMAN_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the German language should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
ENGLISH_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the English language (porter2 stemmer) should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
PORTER_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the English language (original porter stemmer) language should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
DUTCH_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the Dutch language should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
SPANISH_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the Spanish language should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
ITALIAN_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the Italian language should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
SWEDISH_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the Swedish language should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
PORTUGUESE_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the Portuguese language should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
RUSSIAN_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the Russian language (in UTF-8 encoding) should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
RUSSIAN_UTF8_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the Russian language (UTF-8 Encoding) should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ). (alias of the previous code)
RUSSIAN_KOI8_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the Russian language (in KOI-8 encoding) should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
DANISH_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the Danish language should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
NORWEGIAN_STEM normalization Indicates that a Snowball-stemmer generated stemmer for the Norwegian language should be used to normalize the keywords extracted. (Note: FREQ form is also supplied as ..._STEMFREQ).
WORDNET normalization Indicates that WordNet "morphing" should be used to normalize the keywords extracted.
EXACTKEY normalization Indicates that spacing should be normalized, punctuation and stopwords the removed from the key extracted. (XKEY is an alias for EXACTKEY).
XKEY_FREQ normalization Indicates that spacing should be normalized, punctuation and stopwords the removed from the key extracted. This normalization type assumes that each exact key is paired with frequency information in the form "{exact key value 20}", where 20 is the frequency. THIS IS PRIMARILY USED FOR COLLECTION-LEVEL DOCUMENTS IN DISTRIBUTED SEARCH.
CLASSCLUS normalization Indicates that LC class number normalization for classification clustering should be used to normalize the key extracted.
BASIC_NOMAP normalization No term normalization will be done. _NOMAP indicates that the character mappings done during normalization in indexing and retrieval should NOT be used
STEM_NOMAP normalization Indicates that the Porter stemming algorithm should be used to normalize the keywords extracted. _NOMAP indicates that the character mappings done during normalization in indexing and retrieval should NOT be used
WORDNET_NOMAP normalization Indicates that WordNet "morphing" should be used to normalize the keywords extracted. _NOMAP indicates that the character mappings done during normalization in indexing and retrieval should NOT be used
EXACTKEY_NOMAP normalization Indicates that spacing should be normalized, punctuation and stopwords the removed from the key extracted. (XKEY_NOMAP is an alias for EXACTKEY_NOMAP).
CLASSCLUS_NOMAP normalization Indicates that LC class number normalization for classification clustering should be used to normalize the key extracted. _NOMAP indicates that the character mappings done during normalization in indexing and retrieval should NOT be used
DATE PATTERN VALUE normalization Indicates that the key extracted will be a date (the EXTRACT attribute must be one of the date types) and the date pattern will be used in matching the elements of the dates that appear in the database records. See the table Date Format Patterns and Their Interpretations for DATE_TIME format patterns. Date parsing is fairly flexible, and as long as the order of elements is correct, dates should be correctly extracted. (NOTE: date patterns for cluster key normalization is not currently supported)
LAT_LONG PATTERN VALUE normalization Indicates that the key extracted will be a Latitude and Longitude value (the EXTRACT attribute must be one of the LAT_LONG types) and the LAT_LONG pattern will be used in matching the elements of the coordinates that appear in the database records. LAT_LONG parsing is fairly flexible, but the order of elements is important. See the table LAT_LONG and BOUNDING_BOX Format Patterns and Their Interpretations for the specific codes that can be used.
BOUNDING_BOX PATTERN VALUE normalization Indicates that the key extracted will be a pair of Latitude and Longitude values (the EXTRACT attribute must be one of the BOUNDING_BOX types) and the LAT_LONG pattern will be used in matching the elements of the coordinates that appear in the database records. Parsing is fairly flexible, but the order of elements is important. See the table LAT_LONG and BOUNDING_BOX Format Patterns and Their Interpretations for the specific codes that can be used.
BASIC normalization Indicates that the key extracted should simply be converted to lower case and have punctuation, extraneous spaces and stopwords removed.
NONE normalization Same as BASIC, this is the deprecated name for this option. Indicates that the key extracted should simply be converted to lower case and have punctuation, extraneous spaces and stopwords removed. NONE can be used just as BASIC is use (including in the compounds below).
BASIC_FREQ normalization Indicates that the key extracted should simply be converted to lower case and have punctuation and stopwords removed. This normalization type assumes that each keyword is paired with frequency information in the form "{word 20}", where 20 is the frequency. THIS IS PRIMARILY USED FOR COLLECTION-LEVEL DOCUMENTS IN DISTRIBUTED SEARCH.
DO_NOT_NORMALIZE normalization This mean NO normalization, including the simple normalization done by BASIC is to be performed. The terms are put into the index exactly as they appear in the document with capitalization, etc. intact. NOTE that if SGML/XML tags appear in the extracted data, they WILL be included in the generated key.
REMOVE_TAGS_ONLY normalization This mean NO normalization, including the simple normalization done by BASIC is to be performed, EXCEPT FOR REMOVAL OF EMBEDDED SGML/XML TAGS in the data. The terms are put into the index exactly as they appear in the document with capitalization, etc. intact.
REMOVE_TAGS_ONLY_FREQ normalization This mean NO normalization, including the simple normalization done by BASIC is to be performed, EXCEPT FOR REMOVAL OF EMBEDDED SGML/XML TAGS in the data. The terms are put into the index exactly as they appear in the document with capitalization, etc. intact. This normalization type assumes that each key is paired with frequency information in the form "{exact key value 20}", where 20 is the frequency. THIS IS PRIMARILY USED FOR COLLECTION-LEVEL DOCUMENTS IN DISTRIBUTED SEARCH.


DATE FORMATS and Their Interpretation

FORMAT Type Meaning
YYMMDD date Fixed format dates, e.g.: 980223. Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
YYYYMMDD date Fixed format date, e.g.: 19980223
MM/DD/YY date slash style dates, e.g.: 2/23/98 Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
MM/DD/YYYY date slash style dates with full year, e.g.: 2/23/1998.
DD/MM/YY date slash style dates, e.g.: 2/23/98 Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
DD/MM/YYYY date slash style dates with full year, e.g.: 2/23/1998.
MM.DD.YY date dot style dates, e.g.: 2.23.98 Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
MM.DD.YYYY date dot style dates with full year, e.g.: 2/23/1998.
DD.MM.YY date dot style dates, e.g.: 22.02.98 Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
DD.MM.YYYY date dot style dates with full year, e.g.: 2/23/1998.
DD MMM YYYY date Dates with month abbreviations, e.g.: 23 Feb 1998
DD MMM YEAR date Dates with month abbreviations, e.g.: 23 Feb 1998. This form may be used if the data contain years with less (or more) than 4 digits, but are not in the Twentieth century. For example "13 Apr 856 AD", or "25 Dec 1 B.C." are legal dates for this format.
DD MMM YY date Dates with month abbreviations, e.g.: 23 Feb 98. Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
DD MONTH YY date Dates with month abbreviations or full month names, e.g.: 23 February 98. Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
DD MONTH YYYY date Dates with month abbreviations or full month names, e.g.: 23 February 98.
DD MONTH YEAR date Dates with month abbreviations or full month names, e.g.: 23 February 98. This form may be used if the data contain years with less (or more) than 4 digits, but are not in the Twentieth century. For example "13 Apr 856 AD", or "25 Dec 1 B.C." are legal dates for this format.
MONTH DD, YY date Dates with month abbreviations or full month names, e.g.: February 23, 98. Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
MONTH DD, YYYY date Dates with month abbreviations or full month names, e.g.: February 23, 1998.
YYMMDD HH:MM date_time Fixed format date and time, e.g.: 980223 12:04 Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
YYYYMMDD HH:MM date_time Fixed format date and time, e.g.: 19980223 12:04
MM/DD/YY HH:MM date_time slash style dates with time of day, e.g.: 2/23/98 12:02 Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
MM/DD/YYYY HH:MM date_time slash style dates with time of day, e.g.: 2/23/1998 12:02
DD/MM/YY HH:MM date_time slash style dates with time of day, e.g.: 23/2/98 12:02 Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
DD/MM/YYYY HH:MM date_time slash style dates with time of day, e.g.: 23/2/1998 12:02
MM.DD.YY HH:MM date_time dot style dates with time of day, e.g.: 2.23.98 12:02 Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
MM.DD.YYYY HH:MM date_time dot style dates with time of day, e.g.: 2.23.1998 12:02
DD.MM.YY HH:MM date_time dot style dates with time of day, e.g.: 23.2.98 12:02 Note that the system will assume that dates in this form are for the Twentieth century ONLY -- 98 becomes 1998
DD.MM.YYYY HH:MM date_time dot style dates with time of day, e.g.: 23.2.1998 12:02
YYYY date fixed format, year only, e.g.: 1998, 1754, 1292, 1056
YEAR date Variable length years, e.g.: "1998", "24 B.C.E.", "57 A.D.", "500000 B.C.", "1852 C.E."
YYDDD date Julian Dates (not currently supported)
DECADE date_range Decades within a century, e.g. "1940's" is interpreted as the range from 1940-1949, "20's" is interpreted as 1920-1929, 1870's is interpreted as 1870-1879.
CENTURY date_range Centuries, e.g.: 20th is interpreted as 1900-1999, "4th century b.c." is interpreted as 400B.C.-301B.C.
MILLENNIUM date_range Millennia, e.g.: "4th millenium B.C." is interpreted as 4000B.C.- 3001B.C.
MIXED YEAR date Extracts year from mixed data which may include full dates and years. Assumes that only 3 and 4 digit numbers in the string is the year, so it may error if there are decade, century or millenium specifications. Checks for era also. Any text (except for era specifications following the year -- e.g.: 924 B.C.--) or non-numeric characters are ignored.
MIXED_YEAR date Same as above
UNIX_TIME date Dates as returned by the UNIX ctime function, for example "Wed Apr 12 10:18:21 2000 EDT" the trailing zone information is ignored.
UNIX_CTIME date Same as above
DAY MONTH DD HH:MM:SS YEAR date Same as UNIX_TIME above
DAY MMM DD HH:MM:SS YYYY date Same as UNIX_TIME above
YYYY-YYYY date_range Range of years in fixed (4 digit) form, e.g.: 1920-1945
YYYY to YYYY date_range Range of years in fixed (4 digit) form, e.g.: "1920 to 1945"
YYYY through YYYY date_range Range of years in fixed (4 digit) form, e.g.: "1920 through 1945"
YEAR-YEAR date_range Range of years without fixed format, e.g.: "22 B.C - 12 A.D.", 1950-1990, "50000 BCE - 100 CE"
YEAR to YEAR date_range Range of years without fixed format, e.g.: "22 B.C to 12 A.D.", 1950 to 1990, "50000 BCE to 100 CE"
YEAR through YEAR date_range Range of years without fixed format, e.g.: "22 B.C - 12 A.D.", 1950-1990, "50000 BCE to 100 CE"
DECADE-DECADE date_range Range of decades, e.g.: "20's-50's" is interpreted as 1920-1959.
DECADE to DECADE date_range Range of decades, e.g.: "20's to 50's" is interpreted as 1920-1950.
DECADE through DECADE date_range Range of decades, e.g.: "20's through 50's" is interpreted as 1920-1959.
CENTURY-CENTURY date_range Range of Centuries, e.g.: "3rd cent bc - 8th century ad" is interpreted as 300B.C.-700A.D.
CENTURY to CENTURY date_range Range of Centuries, e.g.: "3rd cent bc to 8th century ad" is interpreted as 300B.C.-700A.D.
CENTURY through CENTURY date_range Range of Centuries, e.g.: "3rd cent bc through 8th century ad" is interpreted as 300B.C.-799A.D.
MILLENNIUM-MILLENNIUM date_range Range of Millennia, e.g.: "8th millennium bc through 1st millenium AD" is interpreted as 8000B.C.-999A.D.
MILLENNIUM to MILLENNIUM date_range Range of Millennia, e.g.: "8th millennium bc through 1st millenium AD" is interpreted as 8000B.C.-999A.D.
MILLENNIUM through MILLENNIUM date_range Range of Millennia, e.g.: "8th millennium bc through 1st millenium AD" is interpreted as 8000B.C.-999A.D.
MIXED YEAR RANGE date Extracts one or a pair of years from mixed data which may include full dates and years. Assumes that 3 and 4 digit numbers in the string are the year specifications. The first two three and four digit numbers are used to determine the date range, so it may error if there are decade, century or millenium specifications. Checks for era also. If the data contains only a single (3 or 4 digit) year, the range is created for the that year only. Any text (except for era specifications following the year -- e.g.: 924 B.C.--) or non-numeric characters are ignored.
MIXED_YEAR_RANGE date Same as above


LAT_LONG and BOUNDING_BOX FORMATS and Their Interpretation

FORMAT Type Meaning
DDoMM'SS''NS DDDoMM'SS''EW LAT_LONG Latitude and Longitude using "DD" (degrees), "MM" (minutes) and "SS", separated by "o" (lower case O) for degrees, a single quote for minutes and two single quotes for seconds, followed by EITHER "N" or "S" for Latitudes or "E" or "W" for Longitudes. For example, Latitude and Longitude for San Francisco, California can be expressed as "37o45'53''N 122o24'36''W" and parsed using this format.
DD-MM-SS NS DDD-MM-SS EW LAT_LONG Latitude and Longitude using "DD" (degrees), "MM" (minutes) and "SS", separated by hypens, followed by one or more spaces and EITHER "N" or "S" for Latitudes or "E" or "W" for Longitudes. For example, Latitude and Longitude for San Francisco, California can be expressed as "37-45-53 N 122-24-36 W" and parsed using this format.
DD-MM-SS-NS DDD-MM-SS-EW LAT_LONG Latitude and Longitude using "DD" (degrees), "MM" (minutes) and "SS", and EITHER "N" or "S" for Latitudes or "E" or "W" for Longitudes, separated by hypens. For example, Latitude and Longitude for San Francisco, California can be expressed as "37-45-53-N 122-24-36-W" and parsed using this format.
DECIMAL_LAT_LONG LAT_LONG Latitude and Longitude using degrees and decimal fractions of degrees with positive numbers indicating NORTH latitudes or EAST longitudes, and negative numbers indicating SOUTH latitudes and WEST longitudes. For example, Latitude and Longitude for San Francisco, California can be expressed as "37.765 -122.41" and parsed using this format.
DECIMAL_LONG_LAT LAT_LONG Latitude and Longitude using degrees and decimal fractions of degrees with positive numbers indicating NORTH latitudes or EAST longitudes, and negative numbers indicating SOUTH latitudes and WEST longitudes. For example, Longitude and Latitude for San Francisco, California can be expressed as "-122.41 37.765" and parsed using this format.
DECIMAL_BOUNDING_BOX BOUNDING_BOX A pair of Latitudes and Longitudes using degrees and decimal fractions of degrees with positive numbers indicating NORTH latitudes or EAST longitudes, and negative numbers indicating SOUTH latitudes and WEST longitudes. The first latitude and longitude represent the North-West corner of the bounding box and the second latitude and longitude represent the South-West corner. For example, a bounding box around peninsula of California can be expressed as "37.815834 -122.498886 37.601112 -122.318611" and parsed using this format.
FGDC_BOUNDING_BOX BOUNDING_BOX A set of Latitude and Longitude coordinates using degrees and decimal fractions of degrees with positive numbers indicating NORTH latitudes or EAST longitudes, and negative numbers indicating SOUTH latitudes and WEST longitudes. These coordinates follow the order for bounding boxes in the FGDC guidelines. That is, the first coordinate is the West bounding coordinate (BC), followed by the East BC, the North BC, and the South BC. For example, a FGDC bounding box around peninsula of California can be expressed as "-122.498886 -122.318611 37.815834 37.601112" and parsed using this format. Note that whitespace and XML tags are ignored when extracting the data, so a recommended FGDC XML bounding box definition like:
<bounding>
<westbc>-122.71456201</westbc>
<eastbc>-122.19128542</eastbc>
<northbc>38.00205344</northbc>
<southbc>37.68397727</southbc>
</bounding>
can be parsed correctly without pre-processing.
DDoMM'SS''NS DDDoMM'SS''EW DDoMM'SS''NS DDDoMM'SS''EW BOUNDING_BOX A pair of Latitude and Longitude coordinates using "DD" (degrees), "MM" (minutes) and "SS", separated by "o" (lower case O) for degrees, a single quote for minutes and two single quotes for seconds, followed by EITHER "N" or "S" for Latitudes or "E" or "W" for Longitudes. For example, a bounding box for the peninsula of San Francisco, California can be expressed as "37o48'57''N 122o29'56''W 37o36'4''N 122o19'7''W" and parsed using this format
DDoMM'SS''NS DDDoMM'SS''EW DDoMM'SS''NS DDDoMM'SS''EW BOUNDING_BOX A pair of Latitude and Longitude coordinates using "DD" (degrees), "MM" (minutes) and "SS", separated by "o" (lower case O) for degrees, a single quote for minutes and two single quotes for seconds, followed by EITHER "N" or "S" for Latitudes or "E" or "W" for Longitudes. For example, a bounding box for the peninsula of San Francisco, California can be expressed as "37o48'57''N 122o29'56''W 37o36'4''N 122o19'7''W" and parsed using this format
DD-MM-SS NS DDD-MM-SS EW DD-MM-SS NS DDD-MM-SS EW BOUNDING_BOX A pair of Latitude and Longitude coordinates using "DD" (degrees), "MM" (minutes) and "SS", separated by hyphens, followed whitespace and by EITHER "N" or "S" for Latitudes or "E" or "W" for Longitudes. For example, a bounding box for the peninsula of San Francisco, California can be expressed as "37-48-57 N 122-29-56 W 37-36-4 N 122-19-7 W" and parsed using this format
DD-MM-SS-NS DDD-MM-SS-EW DD-MM-SS-NS DDD-MM-SS-EW BOUNDING_BOX A pair of Latitude and Longitude coordinates using "DD" (degrees), "MM" (minutes) and "SS", and EITHER "N" or "S" for Latitudes or "E" or "W" for Longitudes, all separated by hyphens. For example, a bounding box for the peninsula of San Francisco, California can be expressed as "37-48-57-N 122-29-56-W 37-36-4-N 122-19-7-W" and parsed using this format
Note on Format Parsing: In the above described formats (other than format keywords "DECIMAL_LAT_LONG", "DECIMAL_LONG_LAT" and "DECIMAL_BOUNDING_BOX", which require things in an appropriate decimal form), the elements used in the DATA itself as separators need only be one of any of the separators in all of the above formats. This means that, for example, the format specification "DDoMM'SS''NS DDDoMM'SS''EW" could be used to correctly parse data that looked like "37-48-57-N 122-29-56-W" or "37o48-57 N 122'29'56'W", or even " 37 48 57 n 122 29 56 w", etc. However, the one of the formats, as shown above, should appear in the EXTRACT attribute for the index for correct configuration file processing.


MARC 008 Field Sub-Elements for FLD008_KEY Extraction
(See MARC 008 Documentation for full details of codes)

Name 008 position Meaning
008_entry_date pos 00-05 Date record created as yymmdd.
008_date_type pos 06 Type of date code; s = single, c = 2 dates actual and copyright, etc.
008_date1 pos 07-10 If date type is s or c this is the publication date.
008_date2 pos 11-14 If date type is c this is the copyright date.
008_daterange pos 11-14 If date type is i, k, m, q, c, or d then date1 is used as the starting date of a date range, and date2 is used as the ending date of the date range. If any other letters are used, the date1 value is assumed to both the start and end date of the range.
008_country_code pos 15-17 Country of publication code.
008_illus_code pos 18-21 Types of illustration codes: a = illus; b = maps; c = portraits; etc.
008_intellectual_level pos 22 Intellectual Level j = juvenile, blank = adult.
008_form_of_reproduction pos 23 Form of reproduction: blank = paper or original, a = microfilm, etc.
008_nature_of_contents pos 24-27 Contents: b = is or has bibliography, i = is index, etc.
008_government_pub_code pos 28 Government Document code: f = federal/national, s = state/province; etc.
008_conference_indicator pos 29 Conference code: 0 or blank - not a conference; 1 = conference pub.
008_festschrift_indicator pos 30 0 = Not a Festschrift; 1 = Festschrift.
008_index_indicator pos 31 0 = No index; 1 = has index.
008_main_entry_in_body pos 32 0 = Main entry not in body of record; 1 = is in body of record.
008_fiction_indicator pos 33 0 = Not Fiction; 1 = is Fiction.
008_biography_indicator pos 34 a = autobiography; b = individual biography; etc.
008_language_code pos 35-37 Three character language code (e.g. eng, fre, rus).
008_modified_record_code pos 38 blank = Not modified record; m = modified record.
008_cataloging_source pos 39 Cataloging source code: blank = LC; a = NAL; d = Non-LC; etc.


Configuration File Sub-Tags for Z39.50 Index Mapping Definition

TAG Data Type Meaning
<use> USE attribute USE attribute numbers from the Z39.50 standard (defaults to BIB-1, table A-3-1 reproduced below) in the Appendix 3 of the standard. If USE value of "0" is given, it indicates that the index is the default index when no attributes are specified in the query.
<relation> RELATION attribute Relation attribute numbers from the Z39.50 standard (defaults to BIB-1, table A-3-2 reproduced below) in the Appendix 3 of the standard.
<position> POSITION attribute Position attribute numbers from the Z39.50 standard (defaults to BIB-1, table A-3-3 reproduced below) in the Appendix 3 of the standard.
<struct> STRUCTURE attribute Structure attribute numbers from the Z39.50 standard (defaults to BIB-1, table A-3-4 reproduced below) in the Appendix 3 of the standard.
<trunc> TRUNCATION attribute Truncation attribute numbers from the Z39.50 standard (defaults to BIB-1, table A-3-5 reproduced below) in the Appendix 3 of the standard.
<complet> COMPLETENESS attribute Completeness attribute numbers from the Z39.50 standard (defaults to BIB-1, table A-3-6 reproduced below) in the Appendix 3 of the standard.
<access_point> Access Point attribute type Access Point attributes from the Z39.50 attribute architecture. Type number 1. (<access ... > can be used as an alias for this tag).
<semantic_qualifier> Semantic Qualifier attribute type Semantic Qualifier attributes from the Z39.50 attribute architecture. Type number 2. (<semantic ... > can be used as an alias for this tag).
<language> Language attribute type Language attributes from the Z39.50 attribute architecture. Type number 3.
<content_authority> Content Authority attribute type Content Authority attributes from the Z39.50 attribute architecture. Type number 4. (<authority ... > can be used as an alias for this tag).
<expansion> Expansion attribute type Expansion attributes from the Z39.50 attribute architecture. Type number 5.
<normalized_weight> Normalized Weight attribute type Normalized Weight attributes from the Z39.50 attribute architecture. Type number 6. (<weight ... > can be used as an alias for this tag).
<hit_count> Hit Count attribute type Hit Count attributes from the Z39.50 attribute architecture. Type number 7. (<hits ... > can be used as an alias for this tag).
<comparison> Comparison attribute type Comparison attributes from the Z39.50 attribute architecture. Type number 8.
<format> Format/Structure attribute type Format/Stucture attributes from the Z39.50 attribute architecture. Type number 9.
<occurrence> Occurrence attribute type Occurrence attributes from the Z39.50 attribute architecture. Type number 10.
<indirection> Indirection attribute type Indirection attributes from the Z39.50 attribute architecture. Type number 11.
<functional_qualifier> Functional Qualifier attribute type Functional Qualifier attributes from the Z39.50 attribute architecture. Type number 12. (<functional ... > or <function ... > can be used as an alias for this tag).
<description> DESCRIPTION of USE attribute The description can be used to identify the attribute by a local name in the automatically generated EXPLAIN data, otherwise the USE value will be used to lookup the typical name in an internal table. NOTE: The description applies only to the USE attribute and not to the entire combination. If multiple descriptions are given for the same USE attribute, only the FIRST will be used in generating explain records.


Table A-3-1 : Z39.50 BIB-1 USE Attributes

Use Value Use Value Use Value
Personal name 1 Title collective 34 Author 1003
Corporate name 2 Title parallel 35 Author-name personal 1004
Conference name 3 Title cover 36 Author-name corporation 1005
Title 4 Title added title page 37 Author-name conference 1006
Title series 5 Title caption 38 Identifier -- standard 1007
Title uniform 6 Title running 39 Subject -- LC children's 1008
ISBN 7 Title spine 40 Subject name -- personal 1009
ISSN 8 Title other variant 41 Body of text 1010
LC card number 9 Title former 42 Date/time added to database 1011
BNB card no. 10 Title abbreviated 43 Date/time last modified 1012
BGF number 11 Title expanded 44 Authority/format identifier 1013
Local number 12 Subject precis 45 Concept-text 1014
Dewey classification 13 Subject rswk 46 Concept-reference 1015
UDC classification 14 Subject subdivision 47 Any 1016
Bliss classification 15 Number natl biblio 48 Default 1017
LC call number 16 Number legal deposit 49 Publisher 1018
NLM call number 17 Number govt publication 50 Record-source 1019
NAL call number 18 Number publisher for music 51 Editor 1020
MOS call number 19 Number db 52 Bib-level 1021
Local classification 20 Number local call 53 Geographic-class 1022
Subject heading 21 Code -- language 54 Indexed-by 1023
Subject Rameau 22 Code -- geographic area 55 Map-scale 1024
BDI index subject 23 Code -- institution 56 Music-key 1025
INSPEC subject 24 Name and title 57 Related-periodical 1026
MESH subject 25 Name geographic 58 Report-number 1027
PA subject 26 Place publication 59 Stock-number 1028
LC subject heading 27 CODEN 60 Thematic-number 1030
RVM subject heading 28 Microform generation 61 Material-type 1031
Local subject index 29 Abstract 62 Doc-id 1032
Date 30 Note 63 Host-item 1033
Date of publication 31 Author-title 1000 Content-type 1034
Date of acquisition 32 Record type 1001 Anywhere 1035
Title key 33 Name 1002

Approved Extensions to bib-1 Attribute Set


The attributes listed below are added to the bib-1 attribute set, 1.2.840.10003.3.1.


The following Relation attribute is added to the bib-1 attribute set, as a result of the August 1997 ZIG meeting in Copenhagen:


104: Within
When this relation attribute is supplied the intent is to select records where the value of the access point as determined by the Use attribute is within the bounds specified by the term (where the term itself may be composed of several terms). A profile or implementor agreement may specify precisely what form a term should take when this relation attribute is used, and how the term specifies bounds. As an example, see the implementor agreement Linear Range Searching

The following Truncation attribute is added to the bib-1 attribute set, as a result of the October, 1996 ZIG meeting in Brussels:

104: Z39.58
Truncation as defined in Z39.58-1992, section 7.7.2: "Character Masking" (page 9); including subsections:
  • 7.7.2.1: "Variable number of characters", and
  • 7.7.2.2: "Fixed number of characters".

The rules governing the ussage of this attribute are explicitly articulated as follows (if this is inconsistent with Z39.58, these rules take precedence).

The character '?' (question mark) is used to mask a variable number of characters. It may be followed by a positive integer, i.e. one or more consecutive decimal digits (where the first is positive) in which the positive integer represented by the string of digits (beginning with the digit immediately following the '?', up to and not including the first non-digit character), indicates a range of characters to mask, from zero up to and including the specified integer.

When '?' is not immediately followed by a positive decimal digit, it indicates an arbitrary number of characters to mask (from zero to a system defined limit).

The character '#' (pound or number sign) is used to mask a single character. Multiple consecutive occurrences of '#' may be used to indicate a precise number of characters to mask.


Bib-1 Use Attributes -- Extensions

Name Value Semantics
Doc-id(semantic definition change) 1032 The following semantic definition was adopted at the August 1997 ZIG meeting. An identifier or Doc-ID, assigned by a server, that uniquely identifies a document on that server. May or may not be persistent. May be, for example, a URL.
SICI 1037 The Serial Item and Contribution Identifier; based on NISO Z39.56, which provides an extensible mechanism for the creation of a code which uniquely identifies either an issue of a serial title or a contribution (e.g., article) within a serial regardles of distribution meduim (paper, electronic, microform, etc.).
Following (thru 1083 submitted by German Library)
Abstract-language 1038 The language code or the language name for the language of the abstract. Entries are made in accordance with ISO 639-1988 'Code for the representation of names of language'. The 2-digit letter code is used as language code, the English term is used as language name.
Application-kind 1039 Information (code or full text) about the patent application.
Classification 1040 The code or the section heading of a classification system. The attribute is a general (broad) attribute for codes which are not existing as a serch key.
Classification-basic 1041 The code or the section heading of the Dutch classification system.
Classification-local-record 1042 The code or the section heading of a classification system assigned to a holding record.
Enzyme 1043 The enzyme code or the enzyme nomenclature. Codes are assigned by the 'Enzyme Commission' in the 'International Union of Biochemistry'.
Possessing-institution 1044 A code (library symbol or other code) of the institution which possesses the document. The codes are agreed upon between the partners.
Record-linking 1045 An alphabetical code which determines the type of linking.
Record-status 1046 Information about the status of the record, e.g. new, corrected, deleted, revised.
Treatment 1047 A statement (code or full text) describing subject aspects of the content.
Control-number-GKD 1048 The identification number of a corporate body name from the German authority file for corporate body names "Gemeinsame Koerper-schaftsdatei" (GKD).
Control-number-linking 1049 The unique identification number of the linked record. Attribute 'code-record-type' contains a code for the type of record which contains the 'control-number-linking' (b - bibliographical record, e - record for copy-specific data, n - authority record, l - holding record, p - record for cross references, v - full text). Attribute 'code-record-linking' contains a code which determines the type of linking.
Control-number-PND 1050 The identification number of a personal name from the German authority file for personal names "Personennamendatei" (PND).
Control-number-SWD 1051 The identification number of a subject heading from the German authority file for subject headings "Schlagwortnormdatei" (SWD).
Control-number-ZDB 1052 The identification number of a document in the German database for serials "Zeitschriftendatenbank" (ZDB).
Country-publication (country of Publication)1053 The country code or the country name of the country where the document has been published. Entries are made according to ISO 3166 'Codes for the Representation of Names of Countries'. As country code a 2-digit letter code is used, as country name the English country name.
Date-conference (meeting date)1054 The date of the conference or of another meeting.
Date-record-status 1055 The date on which the record status was assigned.
Dissertation-information 1056 Information about a dissertation thesis, or another publication connected with an academic degree.
Meeting-organizer 1057 The name of the organizer or the sponsor of a conference.
Note-availability 1058 Information about the availability of a document (delivery information).
Number-CAS-registry (CAS registry number)1059 The 'Chemical Abstract Registry' number of the substance described in a document.
Number-document (document number)1060 The publication number of the document (e.g. the number of the abstract in a secondary publication, or the number of the manuscript in a primary journal) provided that this number is not used as 'internal key' in the database. A publication number used as 'internal key' is entered in Attribute 12.
Number-local-accounting 1061 The account number of the document assigned by the accounting system.
Number-local-acquisition 1062 Document acquisition number assigned by the system.
Number-local-call-copy-specific 1063 The document's shelf number.
Number-of-reference (reference count)1064 The number of literature references cited in a document.
Number-norm 1065 The number of a norm or standard.
Number-volume 1066 The number of single volumes of a multivolume publication, (year's) issues of serials, parts of multivolume publications, serials journals etc.
Place-conference (meeting location)1067 The place where the conference or meeting was held.
Reference (references and footnotes)1068 A literature reference from a document. The bibliographic data of a reference consists e.g. of a title, author, journal, publication year, volume and page information of the cited document.
Referenced-journal (reference work)1069 The title of a journal cited in a document.
Section-code 1070 The section code of a subject classification.
Section-heading 1071 The section heading of a subject classification
Subject-GOO 1072 A subject heading from the 'Gemeenschappelijke Onderwerps Onts luiting' (GOO).
Subject-name-conference 1073 A conference name used as subject heading.
Subject-name-corporate 1074 A corporate body name used as a subject heading.
Subject-name-form 1075 A formal topical subject heading, e.g. collection of articles, handbook, source.
Subject-name-geographical 1076 A geographical/ethnographical subject heading.
Subject-name-chronological 1077 A chronological subject heading.
Subject-name-title 1078 A title proper used as subject heading.
Subject-name-topical 1079 A topical subject heading.
Subject-uncontrolled 1080 An uncontrolled subject term. General (broad) attribute for subject terms not specified any further, existing as a search key.
Terminology-chemical (chemical name)1081 The description of a chemical substance. This is either the name of the substance or a name from another classification system, e.g. enzyme code.
Title-translated 1082 The translation of a title.
Year-of-beginning 1083 The publication year of the first issue/volume of serial publications (journals, newspapers, etc.).
Following (thru 1096) submitted by Danish National Library Authority, approved at January 1998 ZIG meeting
Year-of-ending 1084 The publication year of the last issue/volume of serial publications (journals, newspapers, etc.
Subject-AGROVOC 1085 A subject heading from the multilingual agricultural thesaurus from FAO.
Subject-COMPASS 1086 A subject heading from Computer Aided Subject System from British Library.
Subject-EPT 1087 A subject heading from European Pedagogical Thesaurus.
Subject-NAL 1088 A subject heading from National Agricultural Library.
Classification-BCM 1089 A classification number from British Catalogue of Music.
Classification-DB 1090 A classification number from Deutsche Bibliothek.
Identifier-ISRC 1091 ISRC. International standard recording code (ISO 3901).
Identifier-ISMN 1092 International standard music number (ISO 10957). ISMN
Identifier-ISRN 1093 ISRN. International standard technical report number (ISO 10444).
Identifier-DOI 1094 Digital Object Identifier.
Code-language-original 1095 A code that indicates the original language of the item.
Title-later 1096 A later version of title.

The following 15 Use attributes correspond to the Dublin Core elements. Semantics are defined in Dublin Core Metadata Element Set: Reference Description.

Following (1097 thru 1111) are Dublin Core Use attributes, approved at June 1998 ZIG meeting. For detailed semantics see Description of Dublin Core Elements
Name Value
DC-Title 1097 
DC-Creator 1098 
DC-Subject 1099 
DC-Description 1100 
DC-Publisher 1101 
DC-Date 1102 
DC-ResourceType 1103 
DC-ResourceIdentifier 1104 
DC-Language 1105 
DC-OtherContributor 1106 
DC-Format 1107 
DC-Source 1108 
DC-Relation 1109 
DC-Coverage 1110 
DC-RightsManagement 1111 
The following (1112 thru 1184) are GILS Use attributes, approved at the June 1998 ZIG meeting. The GILS Use Attributes correspond semantically to GILS Core Elements, described in the GILS Z39.50 Specification Annex E.
Name Value
Controlled Subject Index 1112 
Subject Thesaurus 1113 
Index Terms -- Controlled 1114 
Controlled Term 1115 
Spatial Domain 1116 
Bounding Coordinates 1117 
West Bounding Coordinate 1118 
East Bounding Coordinate 1119 
North Bounding Coordinate 1120 
South Bounding Coordinate 1121 
Place 1122 
Place Keyword Thesaurus 1123 
Place Keyword 1124 
Time Period 1125 
Time Period Textual 1126 
Time Period Structured 1127 
Beginning Date 1128 
Ending Date 1129 
Availability 1130 
Distributor 1131 
Distributor Name 1132 
Distributor Organization 1133 
Distributor Street Address 1134 
Distributor City 1135 
Distributor State or Province 1136 
Distributor Zip or Postal Code 1137 
Distributor Country 1138 
Distributor Network Address 1139 
Distributor Hours of Service 1140 
Distributor Telephone 1141 
Distributor Fax 1142 
Resource Description 1143 
Order Process 1144 
Order Information 1145 
Cost 1146 
Cost Information 1147 
Technical Prerequisites 1148 
Available Time Period 1149 
Available Time Textual 1150 
Available Time Structured 1151 
Available Linkage 1152 
Linkage Type 1153 
Linkage 1154 
Sources of Data 1155 
Methodology 1156 
Access Constraints 1157 
General Access Constraints 1158 
Originator Dissemination Control 1159 
Security Classification Control 1160 
Use Constraints 1161 
Point of Contact 1162 
Contact Name 1163 
Contact Organization 1164 
Contact Street Address 1165 
Contact City 1166 
Contact State or Province 1167 
Contact Zip or Postal Code 1168 
Contact Country 1169 
Contact Network Address 1170 
Contact Hours of Service 1171 
Contact Telephone 1172 
Contact Fax 1173 
Supplemental Information 1174 
Purpose 1175 
Agency Program 1176 
Cross Reference 1177 
Cross Reference Title 1178 
Cross Reference Relationship 1179 
Cross Reference Linkage 1180 
Schedule Number 1181 
Original Control Identifier 1182 
Language of Record 1183 
Record Review Date 1184 


Table A-3-1b : Z39.50 GILS USE Attributes
(GILS (OID 1.2.840.10003.3.5) also includes all BIB-1 attributes)

Use Value Use Value
DistributorName 2001 Contact Street Address 2025
Index Terms -- Controlled 2002 Contact City 2026
Purpose 2003 Contact State 2027
Access Constraints 2004 Contact Zip Code 2028
Use Constraints 2005 Contact Country 2029
Distributor Organization 2006 Contact Network Address 2030
Distributor Street Address 2007 Contact Hours of Service 2031
Distributor City 2008 Contact Telephone 2032
Distributor State 2009 Contact Fax 2033
Distributor Zip Code 2010 Agency Program 2034
Distributor Country 2011 Sources of Data 2035
Distributor Network Address 2012 Thesaurus 2036
Distributor Hours of Service 2013 Methodology 2037
Distributor Telephone 2014 Bounding Rectangle -- Western-most 2038
Distributor_Fax 2015 Bounding Rectangle -- Eastern-most 2039
Available Resource Description 2016 Bounding Rectangle -- Northern-most 2040
Available Order Process 2017 Bounding Rectangle -- Southern-most 2041
Available Technical Prerequisites 2018 Geographic Keyword Name 2042
Available Time Period -- Structured 2019 Geographic Keyword Type 2043
Available Time Period -- Textual 2020 Time Period - Structured 2044
Available Linkage 2021 Time Period - Textual 2045
Available Linkage Type 2022 Cross Reference Title 2046
Contact Name 2023 Cross Reference Linkage 2047
Contact Organization 2024 Cross Reference Type 2048
Original Control Identifier 2049
Supplemental Information 2050


Table A-3-2: Z39.50 BIB-1 Relation Attributes

Relation Value Relation Value Relation Value
less than 1 greater or equal 4 phonetic 100
less or equal 2 greater than 5 stem 101
equal 3 not equal 6 relevance 102
AlwaysMatches 103


Table A-3-3: Z39.50 BIB-1 Position Attributes

Position Value Position Value Position Value
first in field 1 first in subfield 2 any position in field 3


Table A-3-4: Z39.50 BIB-1 Structure Attributes

Structure Value Structure Value Structure Value
Phrase 1 word list 6 urx 104
word 2 date (un-normalized) 100 free-form-text 105
key 3 name (normalized) 101 document-text 106
year 4 name (un-normalized) 102 local number 107
date (normalized) 5 structure 103 string 108
numeric string 109


Table A-3-5: Z39.50 BIB-1 Truncation Attributes

Truncation Value Truncation Value Truncation Value
Right Truncation 1 Do not truncate 100 Glob (regExpr-1) 102
Left truncation 2 Process # ... 101 Regexp (regExpr-2) 103
Left and right 3


Table A-3-5: Z39.50 BIB-1 Completeness Attributes

Completeness Value Completeness Value Completeness Value
incomplete subfield 1 complete subfield 2 complete field 3


SGML Catalog Entry formats

TAG arg1 arg2 Meaning
PUBLIC pubid sysid This specifies that sysid should be used as the effective system identifier if the public identifier is pubid. Sysid is a system identifier as defined in ISO 8879 and pubid is a public identifier as defined in ISO 8879.
ENTITY name sysid This specifies that sysid should be used as the effective system identifier if the entity is a general entity whose name is name.
ENTITY %name sysid This specifies that sysid should be used as the effective system identifier if the entity is a parameter entity whose name is name. Note that there is no space between the % and the name.
DOCTYPE name sysid This specifies that sysid should be used as the effective system identifier if the entity is an entity declared in a document type declaration whose document type name is name.
LINKTYPE name sysid This specifies that sysid should be used as the effective system identifier if the entity is an entity declared in a link type declaration whose link type name is name.
NOTATION name sysid This specifies that sysid should be used as the effective system identifier for a notation whose name is name. This is an extension to the SGML Open format used for compatibility with SP 1.1.1..
OVERRIDE bool bool may be YES or NO. This sets the overriding mode for entries up to the next occurrence of OVERRIDE or the end of the catalog entry file. At the beginning of a catalog entry file the overriding mode will be NO. A PUBLIC, ENTITY, DOCTYPE, LINKTYPE or NOTATION entry with an overriding mode of YES will be used whether or not the external identifier has an explicit system identifier; those with an overriding mode of NO will be ignored if external identifier has an explicit system identifier. This is an extension to the SGML Open format, and is not implemented in this version.
SYSTEM sysid1 sysid2 This specifies that sysid2 should be used as the effective system identifier if the system identifier specified in the external identifier was sysid1. This is an extension to the SGML Open format. sysid2 should always be quoted to ensure that it is not misinterpreted when parsed by a system that does not support this extension.
SGMLDECL sysid This specifies that if the document does not contain an SGML declaration, the SGML declaration in sysid should be implied.
DOCUMENT sysid This specifies that the document entity is sysid. (for compatibility with SP 1.1.1.)
CATALOG sysid This specifies that sysid is the system identifier of an additional catalog entry file to be read after this one. Multiple CATALOG entries are allowed and will be read in order. This is an extension to the SGML Open format.


SGML Catalog Entry formats (Continued)

TAG arg1 arg2 Meaning
BASE sysid This specifies that relative storage object identifiers in system identifiers in the catalog entry file following this entry should be resolved using first storage object identifier in sysid as the base, instead of the storage object identifiers of the storage objects comprising the catalog entry file. This is an extension to the SGML Open format. This extension is proposed in "Using SGML Open Catalogs and MIME to Exchange SGML Documents" (ftp://ftp.internic.net/internet-drafts/draft-ietf-mimesgml-exch-02.txt) Note that the sysid must exist. This is not currently implemented.
DELEGATE pubid-prefix sysid This specifies that entities with a public identifier that has pubid-prefix as a prefix should be resolved using a catalog whose system identfier is sysid. For more details, see "A Proposal for Delegating SGML Open Catalogs" (http://www.entmp.org/fpi-urn/delegate.html). This is an extension to the SGML Open format. Not currently implemented.

Notes on SGML Catalog Entries.

The delimiters can be omitted from the sysid provided it does not contain any white space. Comments are allowed between parameters delimited by -- as in SGML.

The environment variable SGML_CATALOG_FILES contains a list of catalog entry files. The list is separated by colons under Unix and by semi-colons under MS-DOS and Windows.. These will be searched after any catalog entry file specified using the <SGMLCat> tag, or if no such file is specified, after the catalog entry file called catalog in the same place as the DTD.


Configuration File Sub-Tags for Index Extraction Elements Definition

TAG Data Type Meaning
<tagspec> set of tags This tag introduces the list of record elements and sub-elements to be indexed (or excluded) in this index file, or that are used to specify cluster keys and cluster element mappings as a sub-sub-elements of the <cluster> tag in a filedef.
<ftag> tag pattern This tag introduces a tag or tag pattern that is either the entire specification of the elements to be extracted, or is the "top level" of a sequence of nested or alternative sub-tags.
<s> tag pattern This tag introduces a tag or tag pattern that is either the entire specification of the elements to be extracted at the current level of a sequence of nested sub-tags. <S> tags may be nested.
<attr> name This tag introduces an attribute name that is to be extracted at the current level of a sequence of nested sub-tags. The attribute must be associated with the last tag in sequence of <ftag> and <s> tags. When this tag is used it implies that only the value of the specified attribute, and NOT the value(s) of the entire field are to be extracted for indexing.
<value> name This tag introduces an attribute value OR a tag content value pattern that is to be used to select some subset of tags or records to be processed. This MUST be inclosed in the <attr> </attr> OR <s> </s> tags described above. Values are used primarily for selecting sub-document parts for processing. Another use is to index only fields that have a particular attribute value. When used with "S" tags this tag should be the last in a list of "S" tags (or the only "S" tag). If the index type is one of the "EXTERNAL" types, and the VALUE tag contains either "TEXT_FILE_REF" or "EXTERNAL_URL_REF", then the filename or URL of the external data is taken from enclosing ATTR tag.
<replace> name This tag introduces a pair of values that are to be used to modify the contents of the record. This MUST be inclosed in the <attr> </attr> tags described above and is expected only to be used for the TEXT_FILE_REF attribute value. The two elements are a pattern string and a replacment string. Whenever the pattern string is found in the string specified in the record, it is replaced with the replacement string. This permits re-direction of the external text file location (or, for example, conversion of URLs to file paths)
<attr> 008_element_name This tag introduces an element name that is to be extracted at the current level of tags. This works only for records that are USMARC and in the BOOKS format, and when the EXTRACT attribute for the index is FLD008_KEY anything else (should be) ignored in processing. Only the specified single value of the 008 field is extracted for indexing. (see list of 008 tags above).
<attr> TEXT_FILE_REF This tag introduces an element name that is to be extracted at the current level of tags. The EXTRACT attribute for the index should be KEYWORD_EXTERNAL. This combination indicates that if a matching tag pattern is found then the contents of that tag are to treated as a full path name to a text (or html or sgml) file containing the text to be indexed. (To index the contents of a filename contained in an attribute instead of as the tag contents, use "EXTERNAL_URL_REF" as the content of a VALUE tag enclosed in the appropriate ATTR tag.)
<attr> TEXT_DIRECTORY_REF This tag introduces an element name that is to be extracted at the current level of tags. The EXTRACT attribute for the index should be KEYWORD_EXTERNAL. This combination indicates that if a matching tag pattern is found then the contents of that tag are to treated as a full path name to a directory containing a collection of text (or html or sgml) files containing the texts to be indexed.
<attr> PAGED_DIRECTORY_REF This tag introduces an element name that is to be extracted at the current level of tags. This is treated the same as TEXT_DIRECTORY_REF above with the following additions. The PAGED_DIRECTORY_REF directory should contain separate files for each page of a text document. The file names should contain the page number (all non-numeric parts of names will be used for references to the file but ignored for internal identification. A special index for each page is created by generating a unique "document id" for each page. and also creating some special mapping files to link each id to the "base document" being indexed. NOTE: THIS SHOULD BE THE ONLY TAG/ATTR COMBINATION FOR THE INDEX. Searches in a PAGED_DIRECTORY_REF index will return psuedo-records for each matching page constructed according to the Display Specification named "PAGED_DEFAULT".
<attr> EXTERNAL_URL_REF This tag introduces an element name that is to be extracted at the current level of tags. The EXTRACT attribute for the index should be KEYWORD_EXTERNAL. This combination indicates that if a matching tag pattern is found then the contents of that tag are to treated as a URL indicating the text, html, or other file containing the text to be indexed. The URL will be accessed using the appropriate retrieval protocol (e.g.: http, ftp, gopher, etc.). The actual fetching of the URL is accomplished by an external application (see the <extern_app> tag above). (To index the contents of a URL contained in an attribute instead of as the tag content, use "EXTERNAL_URL_REF" as the content of a VALUE tag enclosed in the appropriate ATTR tag.)

Tag Specifications

Tag specifications are designed to allow any combination of tags (or regular expression patterns representing tags) to be expressed. The configfile parser accepts extended regular expressions in the same form as the grep command. If none of the characters used to create regular expressions is used, then the system will attempt to match the exact tag specified. Case is ignored in tag matching. In the following description `character' excludes new line:

The order of precedence of operators at the same parenthesis level is the following: [], then *+?, then concatenation, then | and new line.

The following examples indicate something of how these specifications are constructed.

<TAGSPEC>
<FTAG>FLD100  </FTAG>
<FTAG>FLD110 </FTAG>
<FTAG>FLD111 </FTAG>
</TAGSPEC> 

This specification indicates that the <FLD100> <FLD110> and <FLD111> are to be extracted for the index.

<TAGSPEC>
<FTAG> FLD100 </FTAG><S> ^a </S>
</TAGSPEC>

This asks for the <a> subelement of the <FLD100> tag. (the following examples will assume the TAGSPEC tags).

<FTAG> FLD1.. </FTAG><s> ^a </S>

This asks for subelement <a> if it is the first element of any tag with the name "FLD1" with two additional characters.

<ftag> FLD[178]00 </ftag><s> [abd] </s>

This asks for any <a>, <b>, or <d> subelements of <FLD100>, <FLD700>, or <FLD800>.

<ftag> stuff </ftag> <s> suba </s> <s> subb </s>

This asks for the sub-elements <suba> and <subb> which are within any tag named "stuff".

<ftag> stuff.* </ftag> <s> sub1 <s> sub2 </s></s>

This asks for the sub-sub-element <sub2> only within tags <sub1> which are within any tag name beginning with the string "stuff" followed by any combination of characters. The nesting may be done as deeping as needed to specify the correct sub-element when element names may be ambiguous (E.g.: specifying the tag <a> in a MARC record would get almost every field. In more deeply nested SGML, similar problems can arise.)

Obviously, these regular expressions may become fairly complex and a single expression may specify a large number of fields to be extracted by the indexing routines. Caution should be used (especially in short tags) to avoid matching OTHER tags. The '^' and '$' anchoring indicators are useful to avoid false matches. For example, a tag specification like:

<FTAG> [AB] </FTAG>

Would match not only "<A>" and "<B>", but also "<CAT>" or "<BAR>", because the pattern is not anchored with '^' or '$'. To correctly match only the "<A>" and "<B>" tags you would need to use:

<FTAG> ^[AB]$ </FTAG>

As noted above, when used with the EXACTKEY extraction flag, any tags at the same level of nesting will be concatenated. For example, if the expression:

<FTAG> FLD6.. </FTAG>

is used to create an "exact subject" index, then all adjacent fields in a record that had matching tags would be concatenated into a single key. If exact keys for each individual 6xx field are wanted then some sub-elements must be specified, even if the subfield specification would match for all subfields. A specification such as:

<FTAG> FLD6.. </FTAG><s>.*</s>

Would match each individual field and extract a key made up of all of the subjfields concatenated together.


CONFIG FILE DTD

The following is the SGML Declaration and DTD for configuration files.

<!SGML  "ISO 8879:1986"
--
--

CHARSET
         BASESET  "ISO 646:1983//CHARSET
                   International Reference Version (IRV)//ESC 2/5 4/0"
         DESCSET  0   9   UNUSED
                  9   2   9
                  11  2   UNUSED
                  13  1   13
                  14  18  UNUSED
                  32  95  32
                  127 1   UNUSED
     BASESET   "ISO Registration Number 100//CHARSET
                ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1"
     DESCSET   128 32 UNUSED
               160 95 32
               255  1 UNUSED


CAPACITY        SGMLREF
                TOTALCAP        150000
                GRPCAP          150000
  
SCOPE    DOCUMENT
SYNTAX   
         SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
                           19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
         BASESET  "ISO 646:1983//CHARSET
                   International Reference Version (IRV)//ESC 2/5 4/0"
         DESCSET  0 128 0
         FUNCTION RE          13
                  RS          10
                  SPACE       32
                  TAB SEPCHAR  9
         NAMING   LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-_:"
                  UCNMCHAR ".-_:"
                  NAMECASE GENERAL YES
                           ENTITY  NO
         DELIM    GENERAL  SGMLREF
                  SHORTREF SGMLREF
         NAMES    SGMLREF
         QUANTITY SGMLREF
                  NAMELEN  34
                  TAGLVL   100
                  LITLEN   1024
                  GRPGTCNT 150
                  GRPCNT   64                   

FEATURES
  MINIMIZE
    DATATAG  NO
    OMITTAG  YES
    RANK     NO
    SHORTTAG YES
  LINK
    SIMPLE   NO
    IMPLICIT NO
    EXPLICIT NO
  OTHER
    CONCUR   NO
    SUBDOC   NO
    FORMAL   YES
  APPINFO    NONE
>
<!doctype dbconfig [

<!-- This is the data type definition for the configuration files used
     in the Cheshire2 system. It spells out the acceptable markup for
     the configuration files and the standard codes used
  -->

<!ELEMENT dbconfig - - (DBENV?,(ConfigInclude|filedef)+)>

<!-- DBENV is a directory where the DB stores some working data -->
<!ELEMENT DBENV - - (#PCDATA)>

<!ELEMENT filedef - - (defaultpath?,(filename | filetag)+,(filecont | continclude)*, (explain | filedtd | sgmlcat | XMLSchema)* , assocfil?, history?, indexes?, clusters?, components?, dispoptions?, displays?)>
<!-- Note that not all of the types listed below are implemented directly -->
<!ATTLIST filedef TYPE (SGML | XML | MARC | MARCSGML | AUTH | CLUSTER | LCCTREE | MAPPED | SQL | DBMS | RDBMS | EXPLAIN | VIRTUAL | SGML_DATASTORE | XML_DATASTORE | MARC_DATASTORE | MARCSGML_DATASTORE) "SGML">

<!-- DefaultPath will be pre-pended to any filename that does not begin at root -->
<!ELEMENT defaultpath - - (#PCDATA)>

<!ELEMENT filename - - (#PCDATA)>

<!ELEMENT filetag   - - (#PCDATA)>

<!ELEMENT sgmlcat   - - (#PCDATA)>

<!-- optional continuation files for very large data files -->
<!ELEMENT filecont   - - (#PCDATA)>
<!ATTLIST filecont id NUMBER #REQUIRED
                   min NUMBER #REQUIRED
                   max NUMBER #REQUIRED>
<!ELEMENT continclude   - - (#PCDATA)>

<!ELEMENT explain   - - (TITLESTRING | DESCRIPTION | DISCLAIMERS | NEWS | HOURS |
			BESTTIME | LASTUPDATE | updateinterval | COVERAGE |
			PROPRIETARY | COPYRIGHTTEXT | COPYRIGHTNOTICE |
			producercontactinfo | suppliercontactinfo | submissioncontactinfo)*>

<!ELEMENT titlestring         - - (#PCDATA)>
<!ELEMENT description         - - (#PCDATA)>
<!ELEMENT disclaimers         - - (#PCDATA)>
<!ELEMENT news                - - (#PCDATA)>
<!ELEMENT hours               - - (#PCDATA)>
<!ELEMENT besttime            - - (#PCDATA)>
<!ELEMENT lastupdate          - - (#PCDATA)>
<!ELEMENT coverage            - - (#PCDATA)>
<!ELEMENT proprietary         - - (#PCDATA)>
<!ELEMENT copyrighttext       - - (#PCDATA)>
<!ELEMENT copyrightnotice     - - (#PCDATA)>


<!ELEMENT updateinterval - - (value, units)>
<!ELEMENT producercontactinfo - - (contact_name?, contact_description?,
		contact_address?, contact_email?, contact_phone?)>
<!ELEMENT suppliercontactinfo - - (contact_name?, contact_description?,
		contact_address?, contact_email?, contact_phone?)>
<!ELEMENT submissioncontactinfo - - (contact_name?, contact_description?,
		contact_address?, contact_email?, contact_phone?)>

<!ELEMENT contact_name         - - (#PCDATA)>
<!ELEMENT contact_description  - - (#PCDATA)>
<!ELEMENT contact_address      - - (#PCDATA)>
<!ELEMENT contact_email        - - (#PCDATA)>
<!ELEMENT contact_phone        - - (#PCDATA)>


<!ELEMENT value     - - (#PCDATA)>
<!ELEMENT units     - - (#PCDATA)>



<!ATTLIST (titlestring, 
          description,
	  disclaimers,
          news,
          hours,
          besttime,
          coverage,
          copyrighttext,
          copyrightnotice)  LANGUAGE CDATA #IMPLIED>


<!ELEMENT filedtd   - - (#PCDATA)>
<!ATTLIST filedtd TYPE (SGML | XML | XMLSCHEMA) "SGML" >
<!ELEMENT XMLSchema - - (#PCDATA)>
<!ELEMENT assocfil  - - (#PCDATA)>
<!ELEMENT history   - - (#PCDATA)>


<!-- The following elements provide the definition language for indexes
     to be generated for the particular file -->
<!ELEMENT indexes  - - (indexdef+)>
<!ELEMENT indexdef - - ((indxname | indxtag)+,indxmap*,indxcont*,stoplist?, indxexc? ,indxkey+)>

<!-- Note that the NORMAL Attribute below has the possible values of:
     "STEM", "WORDNET", "CLASSCLUS", "BASIC", "NONE" "DO_NOT_NORMALIZE", "REMOVE_TAGS_ONLY", 
     "STEM_NOMAP", 
     "WORDNET_NOMAP", "XKEY" or "EXACTKEY", "XKEY_NOMAP" or "EXACTKEY_NOMAP", 
     "CLASSCLUS_NOMAP", "BASIC_NOMAP", "NONE_NOMAP", "STEM_FREQ" "XKEY_FREQ", "BASIC_FREQ" and "NONE_FREQ"
     but it is also used for format specifications for Date and LAT_LONG or
     BOUNDING_BOX formats, so not all are enumerated -->
<!ATTLIST indexdef ACCESS (BTREE | HASH | RECNO | DBMS | BITMAPPED) "BTREE"
         EXTRACT (KEYWORD | EXACTKEY | FLD008KEY | 
		KEYWORD_SUBDOC | EXACTKEY_SUBDOC | URL | FILENAME |
		DATE | DATE_RANGE | DATE_TIME | DATE_TIME_RANGE |
		INTEGER_KEY | DECIMAL_KEY | FLOAT_KEY |
		INTEGER | DECIMAL | FLOAT | LAT_LONG | BOUNDING_BOX |
		KEYWORD_PROXIMITY | KEYWORD_EXTERNAL_PROXIMITY |
		KEYWORD_EXTERNAL) "KEYWORD"
         NORMAL CDATA #IMPLIED
         PRIMARYKEY (NO | REJECT | IGNORE | REPLACE) "NO" >

<!ELEMENT indxname - - (#PCDATA)>
<!ELEMENT indxtag  - - (#PCDATA)>

<!-- indxmap is used to indicate which Z39.50 attributes should be mapped
     to this index. Any attributes NOT specified imply that ANY value 
     for that element should be mapped to this index. Note that the
     order in the configfile doesn't matter (any longer) USE or ACCESS_POINT
     are required to match indexes -->
<!ELEMENT indxmap  - - ((use | relat | posit | struct | trunc | complet
			| access_point | semantic_qualifier | language 
			| content_authority | expansion | normalized_weight
			| hit_count | comparison | format | occurrence
			| indirection | functional_qualifier)+>

<!-- possible values for the ATTRIBUTESET attribute include:
     "BIB-1", "EXP-1", "EXT-1", "CCL-1", "GILS", "STAS", "COLLECTIONS-1", 
     "CIMI-1", "GEO-1", "ZBIG", "UTIL" ;, "XD-1", "ZTHES", "FIN-1", 
     "DAN-1" and "HOLDINGS", or any valid Z39.50 ATTRIBUTESET OID -->
<!ATTLIST indxmap ATTRIBUTESET CDATA #IMPLIED > 

<!-- optional continuation files for large index (NOT CURRENTLY USED) -->
<!ELEMENT indxcont - - (#PCDATA)>
<!ATTLIST indxcont id NUMBER #REQUIRED>

<!ELEMENT stoplist  - - (#PCDATA)>

<!-- The extern_app tag specifies the command name and arguments for 
     external indexing of URLs. The string "%~URL~%" should be used in 
     the place where the URL in the data should be substituted in order 
     to fetch the external data so that it can be indexed. A temporary 
     copy must be made of each item during indexing but NO filename 
     should be specified for the application (i.e., the output should be 
     to stdout). The recommended application to use here is curl 
     (http://curl.haxx.se) which is available on Linux, some other Unixes 
     and Mac OS X. Using curl the tag might look like: 
     "<extern_app> curl --silent %~URL~% </extern_app>" -->

<!ELEMENT extern_app - - (#PCDATA)>

<!-- Indxexc(s) contain the field specifications for areas where index
     keys are NOT to be extracted. This tag effectively masks off parts
     of the record and prohibits indexing subelements within those parts
-->
<!ELEMENT indxexc - - (tagspec+)>

<!-- Indxkey(s) contain the field specifications for index extraction
-->
<!ELEMENT indxkey - - (tagspec+)>

<!ELEMENT tagspec - - (ftag,s*,attr?)+>

<!-- ftag is a field tag: either the SGML or MARC tag generally
-->
<!ELEMENT ftag - - (#PCDATA)>

<!-- s is subfields or other selection criteria; may want to have a more
     detailed syntax for a variety of things. But right now just data 
-->
<!ELEMENT s - - (s* & #PCDATA)*>

<!-- attr is the name of the ATTRIBUTE associated with the tag that
     should be indexed (i.e. don't index the field contents, just 
     the value of that attribute.
-->
<!ELEMENT attr - - (#PCDATA | value)*>

<!-- cluster contains name of a "base" file from which a cluster file 
     was extracted, or a specification of the cluster file to be extracted
     from the current file -->
<!ELEMENT clusters - - (cluster | clusterdef)>
<!ELEMENT cluster - - (clusbase | ((clusname | clustag), cluskey, stoplist?, clusmap+))>
<!ELEMENT clusterdef - - (clusbase | (clusname, cluskey, stoplist?, clusmap+))>

<!-- clusbase is the name of the file clustered by this file -->
<!ELEMENT clusbase - - (#PCDATA)>

<!-- clusname is the name of the cluster file to be extracted from
     this file -->
<!ELEMENT clusname - - (#PCDATA)>
<!-- clustag is the name of the cluster file to be extracted from
     this file: it is an alias for clusname -->
<!ELEMENT clustag - - (#PCDATA)>

<!-- cluskey is the key on which the clustering is based -->
<!ELEMENT cluskey  - - (tagspec)>
<!-- the NORMAL attribute of cluskey indicate the normalization to be done -->
<!ATTLIST cluskey  NORMAL (STEM | WORDNET | CLASSCLUS | EXACTKEY | BASIC | NONE) "BASIC" >

<!-- clusmap indicates the mapping from base file elements to cluster file
     elements -->
<!ELEMENT clusmap  - - ((from,to,summarize?)+)>
<!ELEMENT from - - (tagspec)>
<!ELEMENT to   - - (tagspec)>
<!ELEMENT summarize - - (maxnum, tagspec)>
<!ELEMENT maxnum - - (#PCDATA)>

<!-- The following elements are used to specify a mapping between
     z39.50 attributes and a particular index definition
-->
<!ELEMENT use - - (#PCDATA)>
<!ATTLIST use ATTR CDATA #IMPLIED > 

<!ELEMENT relat - - (#PCDATA)>
<!ATTLIST relat ATTR CDATA #IMPLIED > 
<!ELEMENT posit - - (#PCDATA)>
<!ATTLIST posit ATTR CDATA #IMPLIED > 
<!ELEMENT struct - - (#PCDATA)>
<!ATTLIST struct ATTR CDATA #IMPLIED > 
<!ELEMENT trunc - - (#PCDATA)>
<!ATTLIST trunc ATTR CDATA #IMPLIED > 
<!ELEMENT complet - - (#PCDATA)>
<!ATTLIST complet ATTR CDATA #IMPLIED > 
<!ELEMENT access_point - - (#PCDATA)>
<!ATTLIST access_point ATTR CDATA #IMPLIED > 
<!ELEMENT semantic_qualifier - - (#PCDATA)>
<!ATTLIST semantic_qualifier ATTR CDATA #IMPLIED > 
<!ELEMENT language - - (#PCDATA)>
<!ATTLIST language ATTR CDATA #IMPLIED > 
<!ELEMENT content_authority - - (#PCDATA)>
<!ATTLIST content_authority ATTR CDATA #IMPLIED > 
<!ELEMENT expansion - - (#PCDATA)>
<!ATTLIST expansion ATTR CDATA #IMPLIED > 
<!ELEMENT normalized_weight - - (#PCDATA)>
<!ATTLIST normalized_weight ATTR CDATA #IMPLIED > 
<!ELEMENT hit_count - - (#PCDATA)>
<!ATTLIST hit_count ATTR CDATA #IMPLIED > 
<!ELEMENT comparison - - (#PCDATA)>
<!ATTLIST comparison ATTR CDATA #IMPLIED > 
<!ELEMENT format - - (#PCDATA)>
<!ATTLIST format ATTR CDATA #IMPLIED > 
<!ELEMENT occurrence - - (#PCDATA)>
<!ATTLIST occurrence ATTR CDATA #IMPLIED > 
<!ELEMENT indirection - - (#PCDATA)>
<!ATTLIST indirection ATTR CDATA #IMPLIED > 
<!ELEMENT functional_qualifier - - (#PCDATA)>
<!ATTLIST functional_qualifier ATTR CDATA #IMPLIED > 

<!-- display indicates the mapping from base file elements to
     the elements that should be returned in a given format-->
<!ELEMENT dispoptions - - (#PCDATA)>
<!ELEMENT displays  - - (format | displaydef)+>


<!-- format and displaydef are actually aliases for each other -->
<!ELEMENT format - - (include?,exclude?,convert?)>
<!-- the NAME attribute of format indicates the format name -->
<!-- the DEFAULT attribute indicates the default format -->
<!ATTLIST format  NAME NMTOKEN #REQUIRED 
                  OID  NMTOKEN #IMPLIED
                  DEFAULT (DEFAULT) #IMPLIED>


<!ELEMENT displaydef - - (include?,exclude?,convert?)>
<!-- the NAME attribute of format indicates the format name -->
<!-- the DEFAULT attribute indicates the default format -->
<!ATTLIST displaydef  NAME NMTOKEN #REQUIRED 
                  OID  NMTOKEN #IMPLIED
                  DEFAULT (DEFAULT) #IMPLIED>

<!ELEMENT include - - (tagspec)>
<!ELEMENT exclude - - (tagspec)>

<!ATTLIST exclude COMPRESS (COMPRESS) #IMPLIED>

<!ELEMENT convert - - (clusmap)>

<!ATTLIST convert FUNCTION CDATA #IMPLIED>

<!ELEMENT components - - (componentdef*)>

<!ELEMENT componentdef - - (componentname, componentnorm*, 
                           (compstarttag | componentstarttag | componentstart),
                           (compendtag | componentendtag | componentendtag)?, 
                           componentindexes)>

<!ELEMENT componentname - - (#PCDATA)>
<!ELEMENT componentnorm  - - (#PCDATA)>

<!-- these are really just aliases for each other -->
<!ELEMENT compstarttag  - - (tagspec)>
<!ELEMENT componentstarttag  - - (tagspec)>
<!ELEMENT componentstart  - - (tagspec)>
<!-- these are really just aliases for each other -->
<!ELEMENT compendtag  - - (tagspec)>
<!ELEMENT componentendtag  - - (tagspec)>
<!ELEMENT componentend  - - (tagspec)>

<!ELEMENT componentindexes  - - (indexdef+)>

]>




SAMPLE CONFIGURATION FILE

The following is an example of a configuration file (this is the one used for testing the indexing library).

<!-- This is a test config file for Cheshire II -->
<DBCONFIG>

<!-- The first filedef -->
<FILEDEF TYPE=SGML>
<!-- filetag is the "shorthand" name of the file -->
<FILETAG> bibfile </FILETAG>

<!-- filename is the full path name of the file -->
<FILENAME> /export/home/ray/Work/cheshire2/indexing/TESTDATA/morerecs.sgml 
</FILENAME>

<!-- fileDTD is the full path name of the file's DTD -->
<FILEDTD> /usr3/cheshire2/new/sgml/USMARC07.DTD </FILEDTD>

<!-- assocfil is the full path name of the file's Associator -->
<ASSOCFIL> /export/home/ray/Work/cheshire2/indexing/TESTDATA/morerecs.sgml.asso
</ASSOCFIL>

<!-- history is the full path name of the file's history file -->
<HISTORY> /export/home/ray/Work/cheshire2/indexing/TESTDATA/morerecs.sgml.history 
</HISTORY>

<!-- The following are the index definitions for the file -->
<INDEXES>

<!-- First index def -->
<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=BASIC>
<INDXNAME> 
/export/home/ray/Work/cheshire2/indexing/TESTDATA/dictionary.author
</INDXNAME>
<INDXTAG> author </INDXTAG>

<!-- The following INDXMAP items provide a mapping from the AUTHOR tag to -->
<!-- the appropriate Z39.50 BIB1 attribute numbers      -->
<INDXMAP> <USE> 1 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP> <USE> 2 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP> <USE> 3 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP> <USE> 1002 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP> <USE> 1003 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP> <USE> 1004 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP> <USE> 1005 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP> <USE> 1006 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<!-- The stoplist for this file -->
<STOPLIST> /export/home/ray/Work/cheshire2/indexing/TESTDATA/authorstoplist 
</STOPLIST>

<!-- The INDXKEY area contains the specifications of tags in the doc -->
<!-- that are to be extracted and indexed for this index    -->
<INDXKEY>
<TAGSPEC>
<FTAG>FLD100  </FTAG>
<FTAG>FLD110 </FTAG>
<FTAG>FLD111 </FTAG>
<FTAG>FLD700 </FTAG>
<FTAG>FLD710 </FTAG>
<FTAG>FLD711 </FTAG> 
</TAGSPEC> 
</INDXKEY> 
</INDEXDEF>

<!-- The next index entry definition -->
<INDEXDEF ACCESS=BTREE EXTRACT=EXACTKEY NORMAL=BASIC>
<INDXNAME> /export/home/ray/Work/cheshire2/indexing/TESTDATA/dictionary.xauthor
</INDXNAME>
<INDXTAG> xauthor </INDXTAG>

<INDXMAP> 
<USE> 1 </USE><POSIT> 1 </posit> <struct> 1 </struct> <trunc> 1 </trunc> </INDXMAP>
<INDXMAP> 
<USE> 2 </USE><POSIT> 1 </posit> <struct> 1 </struct> <trunc> 1 </trunc> </INDXMAP>
<INDXMAP> 
<USE> 3 </USE><POSIT> 1 </posit> <struct> 1 </struct> <trunc> 1 </trunc> </INDXMAP>
<INDXMAP> 
<USE> 1002 </USE><POSIT> 1 </posit> <struct> 1 </struct> <trunc> 1 </trunc> </INDXMAP>
<INDXMAP> 
<USE> 1003 </USE><POSIT> 1 </posit> <struct> 1 </struct> <trunc> 1 </trunc> </INDXMAP>
<INDXMAP> 
<USE> 1004 </USE><POSIT> 1 </posit> <struct> 1 </struct> <trunc> 1 </trunc> </INDXMAP>
<INDXMAP> 
<USE> 1005 </USE><POSIT> 1 </posit> <struct> 1 </struct> <trunc> 1 </trunc> </INDXMAP>
<INDXMAP> 
<USE> 1006 </USE><POSIT> 1 </posit> <struct> 1 </struct> <trunc> 1 </trunc> </INDXMAP>

<STOPLIST> /export/home/ray/Work/cheshire2/indexing/TESTDATA/authorstoplist </STOPLIST>
<INDXKEY>
<TAGSPEC>
<!-- Notice the use of pattern matching in the following -->
<FTAG>FLD[178]00 </FTAG> <S> ^a </S>
<FTAG>FLD[178]10 </FTAG> <S> ^[abdc] </S>
<FTAG>FLD[178]11 </FTAG> <S> ^[abdc] </S>
<FTAG>FLD600 </FTAG> <S> ^a </S>
<FTAG>FLD610 </FTAG> <S> ^[ab] </S>
<FTAG>FLD611 </FTAG> <S> ^[ab] </S> </TAGSPEC> </INDXKEY> </INDEXDEF>

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM>
<INDXNAME> /export/home/ray/Work/cheshire2/indexing/TESTDATA/dictionary.title </INDXNAME>
<INDXTAG> title </INDXTAG>

<INDXMAP>
<USE> 4 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP> 
<USE> 5 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP> 
<USE> 6 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> /export/home/ray/Work/cheshire2/indexing/TESTDATA/titlestoplist </STOPLIST>
<INDXKEY>
<TAGSPEC>
<FTAG>FLD245 </FTAG><S>[ab] </S>
<FTAG>FLD240 </FTAG> <S> [atp] </S>
<FTAG>FLD130 </FTAG>
<FTAG>FLD730 </FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF>

<INDEXDEF ACCESS=BTREE EXTRACT=EXACTKEY NORMAL=BASIC>
<INDXNAME> /export/home/ray/Work/cheshire2/indexing/TESTDATA/dictionary.xtitle </INDXNAME>
<INDXTAG> xtitle </INDXTAG>

<INDXMAP>
<USE> 4 </USE><POSIT> 1 </posit> <struct> 1 </struct> <trunc> 1 </trunc> </INDXMAP>
<INDXMAP> 
<USE> 5 </USE><POSIT> 1 </posit> <struct> 1 </struct> <trunc> 1 </trunc> </INDXMAP>
<INDXMAP> 
<USE> 6 </USE><POSIT> 1 </posit> <struct> 1 </struct> <trunc> 1 </trunc> </INDXMAP>

<STOPLIST> /export/home/ray/Work/cheshire2/indexing/TESTDATA/titlestoplist </STOPLIST>
<INDXKEY>
<TAGSPEC>
<FTAG>FLD245 </FTAG><S>[ab] </S>
<FTAG>FLD240 </FTAG> <S>[abtp] </S>
<FTAG>FLD130 </FTAG> <s>[ab] </s>
<FTAG>FLD730 </FTAG> <s>[ab] </s> </TAGSPEC> </INDXKEY> </INDEXDEF>

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM>
<INDXNAME> /export/home/ray/Work/cheshire2/indexing/TESTDATA/dictionary.subject </INDXNAME>
<INDXTAG> subject </INDXTAG>

<INDXMAP>
<USE> 21 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP>
<USE> 26 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP>
<USE> 25 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP>
<USE> 27 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>
<INDXMAP>
<USE> 28 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> /export/home/ray/Work/cheshire2/indexing/TESTDATA/titlestoplist </STOPLIST>
<INDXKEY>
<TAGSPEC>
<FTAG>FLD6.. </FTAG> </TAGSPEC> </INDXKEY> </INDEXDEF>


<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM>
<INDXNAME> /export/home/ray/Work/cheshire2/indexing/TESTDATA/dictionary.topic </INDXNAME>
<INDXTAG> topic </INDXTAG>

<INDXMAP>
<USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct> </INDXMAP>

<STOPLIST> /export/home/ray/Work/cheshire2/indexing/TESTDATA/titlestoplist </STOPLIST>
<INDXKEY>
<TAGSPEC>
<FTAG>FLD6.. </FTAG>
<FTAG>FLD245 </FTAG><S>a </S><S>b </S>
<FTAG>FLD240 </FTAG><S>a </S><S>t </S><S>p </S>
<FTAG>FLD4.. </FTAG>
<FTAG>FLD8.. </FTAG>
<FTAG>FLD130 </FTAG>
<FTAG>FLD730 </FTAG>
<FTAG>FLD740 </FTAG>
<FTAG>FLD1.. </FTAG><S>t </S>
<FTAG>FLD7.. </FTAG><S>t </S> 
</TAGSPEC> 
</INDXKEY> 
</INDEXDEF> 

<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=BASIC PRIMARYKEY=replace>
<INDXNAME> 
/usr/users/ray/Work/z3950_3/index/TESTDATA/dictionary.localnum
</INDXNAME>
<INDXTAG> localnum </INDXTAG>

<!-- The following INDXMAP items provide a mapping from the AUTHOR tag to -->
<!-- the appropriate Z39.50 BIB1 attribute numbers      -->
<INDXMAP> <USE> 12 </USE><STRUCT> 2 </STRUCT></INDXMAP>
<INDXMAP> <USE> 12 </USE><STRUCT> 1 </STRUCT></INDXMAP>
<INDXMAP> <USE> 12 </USE><STRUCT> 6 </STRUCT></INDXMAP>

<!-- The stoplist for this file -->


<!-- The INDXKEY area contains the specifications of tags in the doc -->
<!-- that are to be extracted and indexed for this index    -->
<INDXKEY>
<TAGSPEC>
<FTAG>FLD001  </FTAG>
</TAGSPEC> 
</INDXKEY> 
</INDEXDEF>
</INDEXES> 

<!-- The following defines a cluster index to be extracted -->
<CLUSTER>
<clusname> mainclusfile </clusname>
<cluskey normal=CLASSCLUS> 
     <tagspec>
         <ftag>FLD090</ftag><s>[ab]</s>
     </tagspec>
</cluskey>
<STOPLIST> /export/home/ray/Work/cheshire2/indexing/TESTDATA/cluststoplist </STOPLIST>
<clusmap>
   <from>
       <tagspec>
          <ftag>FLD245</ftag><s>[ab]</s>
       </tagspec></from>
   <to>
       <tagspec>
          <ftag>titles</ftag>
       </tagspec></to>
   <from>
       <tagspec>
          <ftag>FLD650</ftag><s>[abcdz]</s>
       </tagspec></from>
   <to>
       <tagspec>
          <ftag>subjects</ftag>
       </tagspec></to>
   <summarize>
       <maxnum> 10 </maxnum> 
       <tagspec>
          <ftag>subjsum</ftag>
       </tagspec></summarize>

</clusmap>
</CLUSTER>
</FILEDEF> 

<!-- ******************************************************************* -->
<!-- ************************* CLUSTER FILE **************************** -->
<!-- ************************** DEFINITIONS **************************** -->
<!-- ******************************************************************* -->
<!-- The next filedef is for a cluster file implementing class clusters  -->
<FILEDEF TYPE=CLUSTER>
<!-- filetag is the "shorthand" name of the file -->
<Filetag>  mainclusfile
</FILETAG>
<!-- filename is the full path name of the file -->
<FILENAME>      /export/home/ray/Work/cheshire2/indexing/TEST2/classcluster.data
</FILENAME>
<!-- fileDTD is the full path name of the file's DTD -->
<FILEDTD>       /usr3/cheshire2/data2/classcluster.dtd
</FILEDTD>
<!-- assocfil is the full path name of the file's Associator -->
<ASSOCFIL>      /export/home/ray/Work/cheshire2/indexing/TEST2/classcluster.assoc
</ASSOCFIL>
<!-- history is the full path name of the file's history file -->
<HISTORY>       /export/home/ray/Work/cheshire2/indexing/TEST2/classcluster.history
</HISTORY>

<!-- The following are the index definitions for the file -->

<INDEXES>
<!-- The following defines the cluster key index for the cluster -->
<INDEXDEF ACCESS=BTREE EXTRACT=EXACTKEY NORMAL=CLASSCLUS PRIMARYKEY=REPLACE>
<INDXNAME>      
                /export/home/ray/Work/cheshire2/indexing/TEST2/clasclus.lcckey.index
                
</INDXNAME>
<INDXTAG>  lckey
                
</INDXTAG>

<INDXMAP>
<USE> 16 </USE><POSIT> 1 </posit> <struct> 1 </struct> </INDXMAP>
<INDXMAP>
<USE> 16 </USE><POSIT> 1 </posit> <struct> 3 </struct> </INDXMAP>
<INDXMAP>
<USE> 16 </USE><POSIT> 3 </posit> <struct> 1 </struct> </INDXMAP>
<INDXMAP>
<USE> 16 </USE><POSIT> 3 </posit> <struct> 3 </struct> </INDXMAP>

<INDXKEY>
<TAGSPEC>
        <FTAG> CLUSKEY </FTAG>
</TAGSPEC> 
</INDXKEY> 
</INDEXDEF>

<!-- the following provides general term access to the cluster file -->
<INDEXDEF ACCESS=BTREE EXTRACT=KEYWORD NORMAL=STEM>
<INDXNAME>
                /export/home/ray/Work/cheshire2/indexing/TEST2/clasclus.terms.index
                
</INDXNAME>
<INDXTAG>       terms
</INDXTAG>

<!-- this maps from z39.50 "local subject index" to this index -->
<INDXMAP>
<USE> 29 </USE><POSIT> 3 </posit> <struct> 6 </struct>
</INDXMAP>
<INDXMAP>
<USE> 29 </USE><RELAT> 102 </RELAT><POSIT> 3 </posit> <struct> 6 </struct>
</INDXMAP>

<STOPLIST> /usr3/cheshire2/data2/titlestoplist
</STOPLIST>
<INDXKEY>
<TAGSPEC>
        <FTAG>titles</FTAG>
        <FTAG>subjects</FTAG>
</TAGSPEC>
</INDXKEY>
</INDEXDEF>

</INDEXES>
<CLUSTER>
        <CLUSBASE>bibfile</CLUSBASE>
</CLUSTER>
</FILEDEF>

</DBCONFIG>


Cheshire Processing Instructions in SGML/XML files

Although the following processing instructions would be included in the SGML or XML data files and not in the config file, they affect how data is processed during indexing or display and thus are related to the other information in the configuration files. Currently, all of the processing instructions affect indexing of data. Eventually, there will be some similar instructions for display processing using the CHESHIREDISPLAY keyword. Currently there are six processing instructions "IGNORE_TAG", "DELETE_TAG", "SUBSTITUTE_ATTR", "INDEX_IGNORE_TAG", "INDEX_DELETE_TAG", and "INDEX_SUBSTITUTE_ATTR", these are described below:

Note that multiple index-specific processing instructions may be included for a given tag. However, if non-index-specific and index-specific processing instructions are included for a particular tag, then the non-index-specific will have priority (and apply to all indexes using the tag).


Cheshire II Project

Ray R. Larson

School of Information Management and Systems

University of California, Berkeley

Berkeley, California 94720-4600