Classification in data mining is one of the well-known tasks that aim to construct a
classification model from a labelled input data set. Most classification models are
devoted to a static environment where the complete training data set is presented to the
classification algorithm. This data set is assumed to cover all information needed to
learn the pertinent concepts (rules and patterns) related to how to classify unseen
examples to predefined classes. However, in dynamic (non-stationary) domains, the set
of features (input data attributes) may change over time. For instance, some features
that are considered significant at time Ti might become useless or irrelevant at time Ti+j.
This situation results in a phenomena called Virtual Concept Drift. Yet, the set of
features that are dropped at time Ti+j might return to become significant again in the
future. Such a situation results in the so-called Cyclical Concept Drift, which is a direct
result of the frequently called catastrophic forgetting dilemma. Catastrophic forgetting
happens when the learning of new knowledge completely removes the previously
learned knowledge.
Phishing is a dynamic classification problem where a virtual concept drift might occur.
Yet, the virtual concept drift that occurs in phishing might be guided by some
malevolent intelligent agent rather than occurring naturally. One reason why phishers
keep changing the features combination when creating phishing websites might be that
they have the ability to interpret the anti-phishing tool and thus they pick a new set of
features that can circumvent it. However, besides the generalisation capability, fault
tolerance, and strong ability to learn, a Neural Network (NN) classification model is
considered as a black box. Hence, if someone has the skills to hack into the NN based
classification model, he might face difficulties to interpret and understand how the NN
processes the input data in order to produce the final decision (assign class value).
In this thesis, we investigate the problem of virtual concept drift by proposing a
framework that can keep pace with the continuous changes in the input features. The
proposed framework has been applied to phishing websites classification problem and
it shows competitive results with respect to various evaluation measures (Harmonic
Mean (F1-score), precision, accuracy, etc.) when compared to several other data mining
techniques. The framework creates an ensemble of classifiers (group of classifiers) and it
offers a balance between stability (maintaining previously learned knowledge) and
plasticity (learning knowledge from the newly offered training data set). Hence, the
framework can also handle the cyclical concept drift. The classifiers that constitute the
ensemble are created using an improved Self-Structuring Neural Networks algorithm
(SSNN). Traditionally, NN modelling techniques rely on trial and error, which is a
tedious and time-consuming process. The SSNN simplifies structuring NN classifiers
with minimum intervention from the user. The framework evaluates the ensemble
whenever a new data set chunk is collected. If the overall accuracy of the combined
results from the ensemble drops significantly, a new classifier is created using the SSNN
and added to the ensemble. Overall, the experimental results show that the proposed
framework affords a balance between stability and plasticity and can effectively handle
the virtual concept drift when applied to phishing websites classification problem. Most
of the chapters of this thesis have been subject to publication
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Download (4MB) | Preview
Downloads
Downloads per month over past year