Barnes, Andrew James (2013) A modular architecture for systematic text categorisation. Doctoral thesis, University of Huddersfield.

This work examines and attempts to overcome issues caused by the lack of formal standardisation when defining text categorisation techniques and detailing how they might be appropriately integrated with each other. Despite text categorisation’s long history the concept of automation is relatively new, coinciding with the evolution of computing technology and subsequent increase in quantity and availability of electronic textual data. Nevertheless insufficient descriptions of the diverse algorithms discovered have lead to an acknowledged ambiguity when trying to accurately replicate methods, which has made reliable comparative evaluations impossible.

Existing interpretations of general data mining and text categorisation methodologies are analysed in the first half of the thesis and common elements are extracted to create a distinct set of significant stages. Their possible interactions are logically determined and a unique universal architecture is generated that encapsulates all complexities and highlights the critical components. A variety of text related algorithms are also comprehensively surveyed and grouped according to which stage they belong in order to demonstrate how they can be mapped.

The second part reviews several open-source data mining applications, placing an emphasis on their ability to handle the proposed architecture, potential for expansion and text processing capabilities. Finding these inflexible and too elaborate to be readily adapted, designs for a novel framework are introduced that focus on rapid prototyping through lightweight customisations and reusable atomic components.

Being a consequence of inadequacies with existing options, a rudimentary implementation is realised along with a selection of text categorisation modules. Finally a series of experiments are conducted that validate the feasibility of the outlined methodology and importance of its composition, whilst also establishing the practicality of the framework for research purposes. The simplicity of experiments and results gathered clearly indicate the potential benefits that can be gained when a formalised approach is utilised.

abarnesfinalthesis[1].pdf - Submitted Version

Download (4MB) | Preview


Downloads per month over past year

Add to AnyAdd to TwitterAdd to FacebookAdd to LinkedinAdd to PinterestAdd to Email