The index section of the configuration file is where you can set which tags and attributes can be searched by the system. Cheshire doesn't do a free text search of the SGML, the information must be extracted from each record in advance and the indexes section is where this is done. Searching then takes place on these indexes, specified by the name given to it in the <indxtag> field, or via the BIB1 Attributes as defined in the <indxmap> field.
indexes
The indexes tag encloses all of the indexes defined for the database, each of which is defined in an 'indexdef' tag described below.
indexdef
This is the tag inside which is defined the individual index definitions. It has several attributes governing how the index is to be processed:
| Access | The format of database that the index will be recorded in. This has three possible values:
| ||||||||||||||||||||||
| Extract | The sort of extraction to be done on the data. The possible values are:
| ||||||||||||||||||||||
| Normal | What normalisation should be done on the data once extracted. Possible values:
All of the above except DO_NOT_NORMALIZE can have '_NOMAP' appended to them to prevent character mapping from happening during the normalisation process. | ||||||||||||||||||||||
| PrimaryKey | Primary Key indexes are those in which the data should be unique. This attribute governs whether this is the case, and how duplicate data should be handled. If the index is not a primary key, then you can leave out the attribute. It has three possible values:
|
An example or two then. For a unique identifier index, we might use:
<indexdef access=btree extract=exactkey normal=do_not_normalize primarykey=replace>
Or for a topic keyword index:
<indexdef access=btree extract=keyword normal=stem>
indxname
This contains the name of the file in which the index database will be kept. For ease, we keep all indexes in the 'indexes' directory, and call them '(name).index'. The name is typically the name of the SGML tag to be indexed, and is the contents of the next field, indxtag. An example:
<indxname> /home/cheshire/cheshire/ead/indexes/subject.index </indxname>
indxtag
This is the name of the index, and needs to be unique within the database. This is used when querying to tell which particular index we want to search. To continue the example above,
<indxtag> subject </indxtag>
indxmap
The indxmap tag is used to provide mappings for the Z39.50 server. Queries typically come in the BIB-1 set, and this then needs to be mapped to where the appropriate information is for this database. Thankfully, for those of us who are not Z39.50/BIB1 gurus and just want to index our SGML, this tag is not necessary.
It has several subtags, which correspond to the various z39.50 attributes: use, relation, position, struct, trunc and complet. It would take many many pages to describe how to properly set all of these up. If you need to, there's good z39.50 documentation around -- if you have the time and found this document useful, perhaps you could contribute a Cheshire and Z39.50 page! See the configfiles documentation on this subject.
stoplist
The stoplist tag contains the path to a file which has a list of words, one per line, to exclude from this index. Such words as 'and', 'the', 'us', 'I' etc are often put into these stoplists. Some default stoplists are provided, and should be put into the default/stoplists directory. As such an example might be:
<stoplist> /home/cheshire/cheshire/default/stoplists/basic.stoplist </stoplist>
indxkey
This is where we specify which tags and attributes we want to process for this index. Each index can have multiple tag specifications to be included in it, which are recorded in a tagspec field, described below.
tagspec
This field encloses a list of tags or attributes, which may be only one. This has good examples in the configfile documentation as well, but will be explained here again. All of the tagspec subfields may contain regular expressions, in the same manner as the 'grep' command. More about the regular expressions later. The names are not case sensitive, so subject will work the same as SUBJECT.
ftag
The ftag field gives the base tag to look for. If you wanted to index the contents of the subject tag, then this would be:
<ftag> subject </ftag>
s
By using the S tag, you can specify a subtag to index within the one given in ftag. So to index only the contents of the 'title' tags within a 'bibliography' tag, and no titles anywhere else:
<ftag> bibliography </ftag> <s> title </s>
If S tags are repeated one after the other, then it will index all of the sub tags listed. To index title and article tags in the bibliography tag:
<ftag> bibliography </ftag> <s> title </s> <s> article </s>
If the S tags are nested, then you can specify multiple levels down to index. For example to index the surname tag, within the author tag, within bibliography:
<ftag> bibliography </ftag> <s> author <s> surname </s> </s>
attr
The attr tag allows us to index the contents of an attribute within a tag, rather than the main contents of the tag itself. For example, to create an index of all the URLs stored in the HREF attribute of the A tags:
<ftag> a </ftag> <attr> href </attr>
value
Used in combination with the attr tag above, the value tag implies: Index the contents of the given tag if the given attribute has this value. The value tag goes inside the attr tag, or the S tag as described below, and applies only to this attribute or tag. So to index all of the red text in a manuscript, you might use something like:
<ftag> hand </ftag> <attr> color <value> red </value> </attr>
This can also be used in conjunction with the S tag, to index elements which have a subtag with the given value. To index the fld550 tag, if it contains a subtag called w with the contents of just 'g', we would use:
<ftag> fld550 </ftag> <s> w <value> g </value> </s>
Regular Expressions
Rather than repeat documentation that's available everywhere, I'll only discuss regular expressions as they apply to Cheshire.
If there are no regular expression symbols in the tag name, then it will be treated as an exact match only. The A tags above would not match 'table' or 'clusmap'.
If regular expressions are used, then the tags should be anchored with the regexp indicators of beginning and end - ^ and $ respectively. Otherwise strange results may occur where the tag name occurs within another. This is especially true of short tags such as A from HTML or W from USMARC.
There are many examples of regular expressions in tags in the configfile documentation, as referenced above, but here's one quick example. To index all the tags that start with 'fld5' followed by two further numbers (for USMARC in particular), we could use:
<ftag> ^fld5[0-9][0-9]$ </ftag>
Regular expressions are particularly useful in the value tag as they allow you to specify a broad possibility of choices. For example to index all the Anchor tags in HTML that have the HTTP protocol but none of the other possibilities we might use:
<ftag> ^a$ </ftag> <attr> href <value> ^http://(.+)$ </value> </attr>
Though note well that this would index the contents of the A tag, not the contents of the attribute! Indexing the contents of an attribute that matches a regular expression cannot be done - simply index them all.