Computing and Library Services - delivering an inspiring information environment

Exact string matching algorithms for searching DNA and protein sequences and searching chemical databases

Klaib, Ahmad (2014) Exact string matching algorithms for searching DNA and protein sequences and searching chemical databases. Doctoral thesis, University of Huddersfield.

[img] PDF - Submitted Version
Restricted to Repository staff only until 20 April 2017.

Download (6MB)


The enormous quantities of biological and chemical files and databases are likely to grow year on year, consequently giving rise to the need to develop string-matching algorithms capable of minimizing the searching response time. Being aware of this need, this thesis aims to develop string matching algorithms to search biological sequences and chemical structures by studying exact string matching algorithms in detail. As a result, this research developed a new classification of string matching algorithms containing eight categories according to the pre-processing function of algorithms and proposed five new string matching algorithms; BRBMH, BRQS, Odd and Even algorithm (OE), Random String Matching algorithm (RSMA) and Skip Shift New algorithm (SSN).
The main purpose behind the proposed algorithms is to reduce the searching response time and the total number of comparisons. They are tested by comparing them with four well- known standard algorithms, Boyer Moore Horspool (BMH), Quick Search (QS), TVSBS and BRFS.
This research applied all of the algorithms to sample data files by implementing three types of tests. The number of comparison tests showed a substantial difference in the number of comparisons our algorithms use compared to the non-hybrid algorithms such as QS and BMH. In addition, the tests showed considerable difference between our algorithms and other hybrid algorithm such as TVSBS and BRFS. For instance, the average elapsed search time tests showed that our algorithms presented better average elapsed search time than the BRFS, TVSBS, QS and BMH algorithms, while the average number of tests showed better number of attempts compared to BMH, QS, TVSBS and BRFS algorithms. A new contribution has been added by this research by using the fastest proposed algorithm, the SSN algorithm, to develop a chemical structure searching toolkit to search chemical structures in our local database. The new algorithms were paralleled using OpenMP and MPI parallel models and tested at the University of Science Malaysia (USM) on a Stealth Cluster with different number of threads and processors to improve the speed of searching pattern in the given text which, as we believe, is another contribution.

Item Type: Thesis (Doctoral)
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QA Mathematics > QA76 Computer software
Depositing User: Elizabeth Boulton
Date Deposited: 20 Apr 2015 10:19
Last Modified: 19 Apr 2016 04:08


Downloads per month over past year

Repository Staff Only: item control page

View Item View Item

University of Huddersfield, Queensgate, Huddersfield, HD1 3DH Copyright and Disclaimer All rights reserved ©