Textual documents written by humans for humans are not, in their raw form, particularly digestible to computer software. In order to apply many processing technologies to textual documents the documents must first be processed to make them more readable for a computer. Our language technology provides algorithms to achieve this. These algorithms are guided by sets of rules that make supporting new languages straightforward.
Stemming is the process of finding the root of a set of related words. For example the root of farming, farmer and farmed is farm. Stemming facilitates higher levels of recall in knowledge management systems because different forms of the same word are treated the same. In addition stemming can reduce the size of knowledge models and indexes.
Tokenisation is the process of breaking down a stream of data into meaningful units. For textual documents this typically involves splitting the text into words.