Exteca Categorisation

The Exteca Categorisation module can be used to find metadata for a document based on an ontology. The ontology provides a comprehensive description of a domain of knowledge. The framework is provided as a set of interlinked concepts. These can be embellised with rules that guide the categorisation process.

Using the API

The API is very simple yet very flexible. Here's an example of a typical usage:

First let's create the objects we will need

Ontology ontology;
MetaData md;

Next we create the categorisation engine itself:

// Create the categorisation engine
CategorisationEngine engine = new CategorisationEngine();

Then we create a tokeniser. The tokeniser is used to split any text into a sequence of tokens, tokens being words, puctuation, etc. This is used in two places in the categorisation engine. Firstly to interpret any rules within the ontology. Secondly to interpret the documents that will be processed. In this instance we use the supplied FsmTokeniser but you can provide your own tokenisation code as long as it supports the Tokeniser interface.

// Create a tokeniser
Tokeniser tokeniser;
FsmTokeniserRules tokeniserRules;
tokeniserRules = new FsmTokeniserRules();
tokeniserRules.read("english.tok");
tokeniser = new FsmTokeniser(tokeniserRules);
engine.setTokeniser(tokeniser);

Next comes an optional step to introduce morphers to the engine. Morphers can transform words and are used to increase the likelihood of matching a word pattern. Rules can be written to match the transformed form of the word. The most commonly used morphers are stemmers and case converters. In this instance we load a stemmer into the engine.

// Optionally create and add morphers
Morpher stemmer;
AfixStemmerRules rules;
rules = new AfixStemmerRules();
rules.read("english.stm");
stemmer = new AfixStemmer(rules, "stem");
engine.addMorpher(stemmer);
Now we add some optional markers. Markers find meaningful blocks of text in the document. Rules can be written to match within these blocks of text.
// Optionally create and add markers
SentenceMarker sentenceMarker;
sentenceMarker = new SentenceMarker();
engine.addMarker(sentenceMarker);

Next we can load our ontology. An efficient in-memory representation of the ontology is created.

// Create and load an ontology
ontology = new Ontology(Ontology.XML_PERSISTENCE);
ontology.read("myontology.xml");
engine.load(ontology);

Finally we can process our document. This must be supplied as XML. The return value from the process() method is the metadata, containing the categoristation information.

// Process a document
md = engine.process(new FileReader("mydocument.xml"));

The metadata can then be examined and used as required.

For information on building ontologies refer to the guides and specifications.

Default rules

For ontologies that contain few or no rules the categorisation results will be poor or non-existent! To address this issue you can tell the engine to build rules according to some predefined methodology. This is simply a case of adding a default rules file before the ontology is loaded:

DefaultRules defaultRules = new DefaultRules();
defaultRules.read("mydefaultrules.xml");
engine.generateDefaultRules(defaultRules);