Cheshire Configuration - Indexes

[Back to Explain ] [ Up to the contents ] [ On to Components ]

The index section of the configuration file is where you can set which tags and attributes can be searched by the system. Cheshire doesn't do a free text search of the SGML, the information must be extracted from each record in advance and the indexes section is where this is done. Searching then takes place on these indexes, specified by the name given to it in the <indxtag> field, or via the BIB1 Attributes as defined in the <indxmap> field.

The indexes tag encloses all of the indexes defined for the database, each of which is defined in an 'indexdef' tag described below.

This is the tag inside which is defined the individual index definitions. It has several attributes governing how the index is to be processed:
AccessThe format of database that the index will be recorded in. This has three possible values:
BTREEThe default binary tree database
HASHA Hash style database
DBMSAn SQL or (R)DBMS style database
ExtractThe sort of extraction to be done on the data. The possible values are:
KEYWORDExtract keywords. This is the most common type of index as it simply goes through and records all words that aren't stoplisted as being present. The words are split by the following list of punctuation: \t\n\r\a\b\v\f`~!@#$%^&*()_=|\\{[]};:'\",<.>?/ and space
KEYWORD_PROXIMITYExtract keywords and maintain the proximity information. This is to allow for proximity searching - for example searching for 'cat' within 4 words of 'hat' to look for 'cat in the hat'. These indexes are much larger than plain KEYWORD indexes.
KEYWORD_EXTERNALThe keywords extracted are indicators of external, non SGML files. These files are read in and processed ala KEYWORD. There is also a KEYWORD_EXTERNAL_PROXIMITY, which maintains proximity information as above. This can be used with tags that are not external as well, so for example you could index the contents of the subject tag, the topic tag and the file pointed to by the external_topic tag.
EXACTKEYExtract the exact contents of the tag specified to index. This is useful for things like ID Numbers or Exact Titles.
DATEThe contents are dates, and should be parsed so that they can be searched with non boolean operators (eg 'search date > 1600' ). The type of date format needs to be specified in the Normal attribute, as described below.
DATE_RANGEThe contents are ranges of dates, for example Jan 2000 - Mar 2000. Again, they need to have the format set in the Normal attribute.
DATE_TIMEThe contents are dates with times, for example 1pm Jan 1, 2000. There is also DATE_TIME_RANGE, which has ranges of dates and times. Again the format needs to be specified.
URLThe contents extracted are URLs, and hence will not be split up at punctuation like '.'s and '/'s. The full list of punctuation that will be split at is: \t\n\r\a\b\v\f<> and space; the same applies for FILENAME below.
FILENAMEThe contents of the index are filenames
FLD008Used only for (US)MARC records, the index is for special processing of the 008 field.
INTEGER, DECIMAL, FLOATThese extractions types are used for numeric fields. This will allow < and > (etc) operators to work properly on the extracted data. Only a single value can be included in such a field.
Normal What normalisation should be done on the data once extracted. Possible values:
STEMThe words should be stemmed - this allows you to find 'colours' with a search for 'colour' and so forth. Useful for KEYWORD style indexes.
NONEThe words should be left as is. Especially useful for EXACTKEY indexes (In fact they're pretty useless without this as the normalisation!)
DO_NOT_NORMALIZEThis is a further step from NONE in which case, punctuation etc are all left alone as well. This is especially useful for PRIMARYKEY indexes which need to maintain the difference between case (etc) or title indexes where case is important and not reconstructable.
CLASSCLUSThis is used for indexes of Library of Congress classification numbers. The index is then specially processed after extraction into a normalised form.
WORDNETThis indicates that the wordnet morphing algorithm should be used. If you don't know what this is, don't worry.

All of the above except DO_NOT_NORMALIZE can have '_NOMAP' appended to them to prevent character mapping from happening during the normalisation process.
The possible date formats are listed in the configuration file documentation and aren't gone over here as there are many and they're fairly straight forwards.

PrimaryKey Primary Key indexes are those in which the data should be unique. This attribute governs whether this is the case, and how duplicate data should be handled. If the index is not a primary key, then you can leave out the attribute. It has three possible values:
IGNOREAny duplicated keys should be ignored, keeping the first occurance.
REPLACEAny duplicated keys should replace the previous occurance.
NONEIndex is not a primary key index.

An example or two then. For a unique identifier index, we might use:
<indexdef access=btree extract=exactkey normal=do_not_normalize primarykey=replace>

Or for a topic keyword index:
<indexdef access=btree extract=keyword normal=stem>

This contains the name of the file in which the index database will be kept. For ease, we keep all indexes in the 'indexes' directory, and call them '(name).index'. The name is typically the name of the SGML tag to be indexed, and is the contents of the next field, indxtag. An example:
<indxname> /home/cheshire/cheshire/ead/indexes/subject.index </indxname>

This is the name of the index, and needs to be unique within the database. This is used when querying to tell which particular index we want to search. To continue the example above,
<indxtag> subject </indxtag>

The indxmap tag is used to provide mappings for the Z39.50 server. Queries typically come in the BIB-1 set, and this then needs to be mapped to where the appropriate information is for this database. Thankfully, for those of us who are not Z39.50/BIB1 gurus and just want to index our SGML, this tag is not necessary.

It has several subtags, which correspond to the various z39.50 attributes: use, relation, position, struct, trunc and complet. It would take many many pages to describe how to properly set all of these up. If you need to, there's good z39.50 documentation around -- if you have the time and found this document useful, perhaps you could contribute a Cheshire and Z39.50 page! See the configfiles documentation on this subject.

The stoplist tag contains the path to a file which has a list of words, one per line, to exclude from this index. Such words as 'and', 'the', 'us', 'I' etc are often put into these stoplists. Some default stoplists are provided, and should be put into the default/stoplists directory. As such an example might be:
<stoplist> /home/cheshire/cheshire/default/stoplists/basic.stoplist </stoplist>

This is where we specify which tags and attributes we want to process for this index. Each index can have multiple tag specifications to be included in it, which are recorded in a tagspec field, described below.

This field encloses a list of tags or attributes, which may be only one. This has good examples in the configfile documentation as well, but will be explained here again. All of the tagspec subfields may contain regular expressions, in the same manner as the 'grep' command. More about the regular expressions later. The names are not case sensitive, so subject will work the same as SUBJECT.

The ftag field gives the base tag to look for. If you wanted to index the contents of the subject tag, then this would be:
<ftag> subject </ftag>

By using the S tag, you can specify a subtag to index within the one given in ftag. So to index only the contents of the 'title' tags within a 'bibliography' tag, and no titles anywhere else:
<ftag> bibliography </ftag> <s> title </s>

If S tags are repeated one after the other, then it will index all of the sub tags listed. To index title and article tags in the bibliography tag:
<ftag> bibliography </ftag> <s> title </s> <s> article </s>

If the S tags are nested, then you can specify multiple levels down to index. For example to index the surname tag, within the author tag, within bibliography:
<ftag> bibliography </ftag> <s> author <s> surname </s> </s>

The attr tag allows us to index the contents of an attribute within a tag, rather than the main contents of the tag itself. For example, to create an index of all the URLs stored in the HREF attribute of the A tags:
<ftag> a </ftag> <attr> href </attr>

Used in combination with the attr tag above, the value tag implies: Index the contents of the given tag if the given attribute has this value. The value tag goes inside the attr tag, or the S tag as described below, and applies only to this attribute or tag. So to index all of the red text in a manuscript, you might use something like:
<ftag> hand </ftag> <attr> color <value> red </value> </attr>

This can also be used in conjunction with the S tag, to index elements which have a subtag with the given value. To index the fld550 tag, if it contains a subtag called w with the contents of just 'g', we would use:
<ftag> fld550 </ftag> <s> w <value> g </value> </s>

Regular Expressions
Rather than repeat documentation that's available everywhere, I'll only discuss regular expressions as they apply to Cheshire.

If there are no regular expression symbols in the tag name, then it will be treated as an exact match only. The A tags above would not match 'table' or 'clusmap'.

If regular expressions are used, then the tags should be anchored with the regexp indicators of beginning and end - ^ and $ respectively. Otherwise strange results may occur where the tag name occurs within another. This is especially true of short tags such as A from HTML or W from USMARC.

There are many examples of regular expressions in tags in the configfile documentation, as referenced above, but here's one quick example. To index all the tags that start with 'fld5' followed by two further numbers (for USMARC in particular), we could use:
<ftag> ^fld5[0-9][0-9]$ </ftag>

Regular expressions are particularly useful in the value tag as they allow you to specify a broad possibility of choices. For example to index all the Anchor tags in HTML that have the HTTP protocol but none of the other possibilities we might use:
<ftag> ^a$ </ftag> <attr> href <value> ^http://(.+)$ </value> </attr>
Though note well that this would index the contents of the A tag, not the contents of the attribute! Indexing the contents of an attribute that matches a regular expression cannot be done - simply index them all.