Recent technological advances in modern healthcare have lead to a vast wealth of patient data being collected. This data is not only utilised for diagnosis but also has the potential to be used for medical research. However, there are often many errors in datasets used for medical research, with one study finding error rates ranging from 2.3% to 26.9% in a selection of medical research databases.
Previous methods of automatically assessing data quality have often relied on threshold rules. These rules can sometimes miss errors requiring complex domain knowledge to correctly identify. To combat this, a semantic framework has been developed to assess the quality of medical data expressed in the form of linked open data. Early work in this direction revealed that existing triplestores are unable to cope with the large amounts of medical data.
In this thesis, a system for storing and querying medical RDF data using Hadoop is de-veloped. This approach enables the creation of an inherently parallel framework that will scale the workload across a cluster. Unlike existing solutions, this framework uses highly optimised joining strategies to enable the completion of eight separate SPARQL queries, comprising over eighty distinct joins, in only two Map/Reduce iterations. Results are pre-sented comparing both na¨ıve and optimised versions of the solution against Jena TDB, demonstrating the superior performance of the Hadoop system and its viability for assess-ing the quality of medical data.
Downloads
Downloads per month over past year