The Exteca Categorisation module can be used to find metadata for a document based on an ontology. The ontology provides a comprehensive description of a domain of knowledge. The framework is provided as a set of interlinked concepts. These can be embellised with rules that guide the categorisation process.
The API is very simple yet very flexible. Here's an example of a typical usage:
First let's create the objects we will need
Ontology ontology; MetaData md;
Next we create the categorisation engine itself:
// Create the categorisation engine CategorisationEngine engine = new CategorisationEngine();
Then we create a tokeniser. The tokeniser is used to split any text into a sequence of tokens, tokens being words, puctuation, etc. This is used in two places in the categorisation engine. Firstly to interpret any rules within the ontology. Secondly to interpret the documents that will be processed. In this instance we use the supplied FsmTokeniser but you can provide your own tokenisation code as long as it supports the Tokeniser interface.
// Create a tokeniser Tokeniser tokeniser; FsmTokeniserRules tokeniserRules; tokeniserRules = new FsmTokeniserRules(); tokeniserRules.read("english.tok"); tokeniser = new FsmTokeniser(tokeniserRules); engine.setTokeniser(tokeniser);
Next comes an optional step to introduce morphers to the engine. Morphers can transform words and are used to increase the likelihood of matching a word pattern. Rules can be written to match the transformed form of the word. The most commonly used morphers are stemmers and case converters. In this instance we load a stemmer into the engine.
// Optionally create and add morphers Morpher stemmer; AfixStemmerRules rules; rules = new AfixStemmerRules(); rules.read("english.stm"); stemmer = new AfixStemmer(rules, "stem"); engine.addMorpher(stemmer);
// Optionally create and add markers SentenceMarker sentenceMarker; sentenceMarker = new SentenceMarker(); engine.addMarker(sentenceMarker);
Next we can load our ontology. An efficient in-memory representation of the ontology is created.
// Create and load an ontology ontology = new Ontology(Ontology.XML_PERSISTENCE); ontology.read("myontology.xml"); engine.load(ontology);
Finally we can process our document. This must be supplied as XML. The return value from the process() method is the metadata, containing the categoristation information.
// Process a document md = engine.process(new FileReader("mydocument.xml"));
The metadata can then be examined and used as required.
For information on building ontologies refer to the guides and specifications.
For ontologies that contain few or no rules the categorisation results will be poor or non-existent! To address this issue you can tell the engine to build rules according to some predefined methodology. This is simply a case of adding a default rules file before the ontology is loaded:
DefaultRules defaultRules = new DefaultRules(); defaultRules.read("mydefaultrules.xml"); engine.generateDefaultRules(defaultRules);