Accuracy test in identifying the splice site type of DNA sequences by using probabilistic neural networks and support vector machines

It has been known that Probabilistic Neural Networks as machine learning is very fast in it’s computation time and give a better accuracy comparing to another type of neural networks, on solving a real-world application problem. In the recent years, Support Vector Machines has become a popular model over other machine learning. It can be analyzed theoretically and can achieve a good performance at same time. This paper will describe the use of those machines learning to solve pattern recognition problems with a preliminary case study in detecting the type of splice site on the DNA sequences, particularity on the accuracy level. The results obtained show that Support Vector Machines have a good accuracy level about 95 % comparing to Probabilistic Neural Networks with 92 % approximately. | Probabilistic Neural Networks | Support Vector Machines | Splice sites type detection | Accuracy level |


Introduction
Neural Networks (NNs) and Support Vector Machines (SVMs) both are machine learning techniques that learn a model or pattern based on training data, and use the model to predict or to classify on future data.Both two machine learning models can be regarded as a tool in soft computing approach.By a soft computing technique, we try to understand a system behavior in exchange for unnecessary precision for practical problem.In addition, there are no superposition are made about preexisting analytical model.This technique is rather different with hard computing approach which is always require a precisely stated analytical model.
Among several model type of NNs, the Probabilistic Neural Networks (PNNs) introduced by Specht [1], has been known for speed advantage and the accuracy level comparing to other type of NNs.SVMs are even more recent technology than NNs [4].In about 1979 Vladimir Vapnik has had developed the theoretical foundation for SVMs based on statistical learning approach [3,5].After this development, SVMs are now widely used and studied as a machine learning model [6,7,8].
In this paper, those two machines learning above will be used in the domain of pattern recognition problems, in particularity, splice site type detection on the DNA sequences.This present paper is focused on the accuracy level (or generalization capability) which is regarded as one of machine performance measurement.

Learning To Classify
Let us start with a general notion of the learning problem that we consider in this paper.The task of classification is to find a rule, which based on external observation, assign an object to one of several classes.In the simplest case there are only two different classes.Possible formalization of this task is to estimate a function f : R n → {1,-1}, using input-output data pairs generated independent identically distributed according to an unknown probability distribution function P(x,y), In the training stage, we estimate f, minimizing the expected error (risk) where L denotes a suitably chosen loss function.
The task of classification above is done by machine learning, is this paper we use PNNs and SVMs.

Probabilistic Neural Networks
PNNs is a one type of Neural Networks model using in supervised learning.This model is widely known in pattern recognition problems.The initial model was introduced by Specht [1], applying estimated Parzen non parametric probability function and Bayes classification rules.The architecture introduced consist of input layer, pattern layer, summation layer, and output layer (Figure 1).
input layer pattern layer summation layer output layer Fig. 1 PNNs architecture, with input layer, pattern layer, summation layer, and output layer.
In estimation process, PNNs use two stages, learning stage and estimation stage [2].In learning stage, input neurons (in input layer) distribute input vector according their weight and send it to pattern neurons (in pattern layer).Then, patterns neurons transmit it to summation neuron (in summation layer) for summation purpose.In estimation stage, this summation results are connected with the decision, to verify whether the input vector is a member of any class given or not.
Suppose that the PNNs consist of two input neurons (Figure 1), four pattern neurons, and two summation neurons.Input vector x through input neurons, is sent to pattern neurons Z A1 , Z A2 , Z B1 , and Z B2 .This input to pattern neurons is denoted by z_in j = x T .w,j=1,…,4, where w is weighted vector obtained at the learning stage.The output from pattern neurons is z_out j which will be obtained by using activation function f as follows (1) z_out j will be an input for corresponding summation neuron.Hence, input to summation neuron S A is s_in A = z_out A1 + z_out A2 , and input to summation neurons S B is s_in B = z_out B1 + z_out B2 .Here, the connection weight between neurons input and summation neurons is 1, between summation neuron S A and output neurons is v A , and between summation neuron S B and output neurons is v B .In this case , where is approximated probability density function (obtained in previous stage).Similarly, if x ∈ class B.

Support Vector Machine
As a machine learning model, SVMs is a recently developed which designed for efficient multidimensional function approximation.The basic idea is to determine a classifier minimizing the empirical risk (that is, training set error) which corresponds to the generalization or test set error [5,6,7,8].The key to understanding SVMs is to see how they introduce an optimum hyperplane to separate classes of data in the classifiers.
Given an input vector X = {x 1 , x 2 , …, x m }, each x i ∈ R n and a target vector Y= {y 1 , y 2 , …, y m }, each y i ∈ {1,-1}, i=1,2, .., m.Let us consider (x i , y i ) ∈ X x Y as a pair of training data which is separable.
The main idea of SVMs is to determine a hyperplane <w, x> + b = 0 (w ∈ R n , b∈ R) separating X into two classes associated with the value of 1 or -1, with a maximum margin.In this case, the margin is a distance between two parallel hyperplanes with the middle hyperplane (Figure 1).This middle hyperplane will be considered as a decision function for two class classification, The width of margin γ is a distance from x + (or x -) to hyperplane, This implies that the maximization of margin is equivalent to the minimization of ||w|| subject to <w, x i > + b ≥ +1 if y i = 1 and <w, x i > + b ≤ -1 if y i = -1 which can be combined into one constraint Hence, SVMs learning problem becomes the following quadratic programming problem (QP) as follows : Minimize <w, w> subject to y i (<w, (2) The solution of QP problem above (i.e.w, b) give the optimum hyperplane, with maximum margin γ = 1/||w||.We transform (2) into dual form by introducing Lagrange multiplier α i ≥ 0 (Karush-Kuhn-Tucker condition),  The optimum value of b* can be determined from the constraints of QP problem (2).Before continuing, we consider the inequality in (2).We observe that for this inequality, there will be some training vectors only for which the equality y i (<w, w i > + b) = 1, i = 1, …, m, holds true.Those training vectors are called support vectors.

Linear Support Vector
Learning technique described above will only work in linearly separable training data.For some problems, sometimes we have non linearly separable training data.The classifier of this data is obtained by introducing a function which map the data space to finite dimensional space (we call feature space).By this mapping the nonlinear training data becomes linear (Figure 3) After then, we try to seek an optimal hyperplane as described above.
By using the weight w obtained, decision function f(x) can be calculated as an inner product of training data and testing data, α If we denote the function which map the data to the feature space as φ, then the decision function above can be denoted as follows : If we want to determine inner product <φ(w i ), φ(w)> directly, we have to know φ explicitly.But there is a way for computing the inner product without using φ in feature space.This effective trick is done by using kernel function, K(x i , x) = <φ(x i ), φ(x)> [8].
Hence, decision function ( 6) can be denoted as follows α There are many kernel function has been developed.In this paper we use a polynomial function,

SVMs with Soft Margin
The main problem with the optimal hyperplane described above is under assumption that there is no data training error, but normally there are errors happened.Such problem can be omitted by using a soft margin technique.By this technique, we introduce slack variable ξ in (2 ), By this technique, learning method would be a problem to determine an optimum hyperplane having minimum earning data error simultaneously.We call this technique by structural risk minimization.By using the same procedure as above, the dual form of ( 8) can be denoted as Suppose the solution obtained is α * , the decision function is , where b * is selected in such a way that y i f(x i ) = 1 ∀ i with C > α * i > 0. Hence, decision function sign(f(x)) is a hyperplane in feature space which implicitly defined by using K(x i , x), with margin

Computer Simulation
A computer simulation has been executed by using Matlab 6.5.1 running in a PC Pentium IV.In this simulation, PNNs and SVMs models are used to determine whether a DNA (Deoxyribo Nucleic Acid ) sequence is a donor type or not.This is part of a process in splice site type detection on DNA sequences (compose of nucleotide Guanine (G), Adenine (A), Thymine (T), Cytosin (C).This splice site detection will be used later in protein folding (introns removal and exons binding).

Data Used For Experiment
Data in form of DNA sequences existed in Splice-junction Gene Sequences Data Base, which is part of Molecular Biology Data Bases available in [10].Each sequence has 60 base-pair length, and each nucleotide is represented in four bits, A: 1000, T: 0100, G: 0010 and C: 0001.Hence, the input data dimension is 60 x 4 = 240 bit.Meanwhile, the output (placed in the beginning of DNA sequence) consist of 1 bit, with the value of 1 if the splice site is donor type and 0 otherwise.
In the experiment have been done, the data above is divided into three groups, called learning (or training) data L, validation data V, and testing data T.Those three groups of data is used sequentially.The member of each group is selected randomly with cross validation technique from total data given (we used 3175 sequences).This selection technique is nearly similar to technique used by Bolat et al. [8].
In the application with PNNs model, L is used to obtain the weight of PNNs, V is used to determine the best parameter σ in (1).In the application with SVMs (here we used linear SVMs, and Non linear SVMs), L and V are used to obtain the parameter values of C in (9), d in polynomial function (7), in a similar way.

Test of Accuracy Level
In this paper we define the accuracy level (sometimes called generalization capability) as follows: with the set of testing data T = {(x 1 ,y 1 ), ……, (x k ,y k )}, we verify whether output i y ) = f(x i ) is equal or not to y i as the target.The accuracy level is : [(the number of x i where i y ) = y i ) / |T| ] x 100 %.
By using a various size of testing data T that have been prepared before, we determine the accuracy level of PNNs and SVMs.Here, 2 nd bit until 241 th bit from each DNA sequences are processed by PNNs and SVMs, and the result is verified with 1 st bit of the sequences, whether equal or not.The verification result gives the level of accuracy of the PNNs and SVMs

Results Analysis and Discussion
The implementation results are given in Table 1 below.Figure 5 shows the accuracy level of PNNs, Linear SVMs, Non Linear SVMs, for learning data size |L|=600, validation data size |V|=250, with various testing data size |T|.
From Table 1, roughly we can see that the increasing of testing data size will decrease the accuracy level.In the case of increasing of learning data, the accuracy level will increase also.Intuitively, this phenomena is quite similar to the meaning of learning.Table 1 shows also that SVMs have a better accuracy level than PNNs.But, there are no significant different between Linear SVMs and Nonlinear SVMs in their accuracy level.It shows that the data is linearly separable.Finally, we can see that there are slightly decreasing of accuracy level, in case of testing data size getting bigger (|T| > 400), perhaps it is caused by the computer limitation.
Roughly, those results show that SVMs have accuracy level about 95.7 % and PNNs have about 92.8 %.

Conclusion
Based on the result discussed above, in the case of splice site detection of DNA sequences problem, SVMs has better accuracy level than PNNs.There is no significant different between Nonlinear SVMs and Linear SVMs on their accuracy level.
input data in class A. In the same manner, h B is probability (a priori) of input data in class B. While λ AB is a cost function to represent the fact that it may be misclassify input data is in class A instead of in class B. Similarly with λ BA .For the classification into two classes, clearly that λ AB = λ BA = 1.Finally, input to neuron output Y represented by y_in is {v A .S A_out , v B .S B_out }.By using a decision criteria will be decided whether input vector is in class A or in class B. Here, x ∈ class A if v A .S A_out > v B .S B_out .Otherwise, x ∈ class B. That criteria is called Bayes criteria, that can be stated by

( 3 )
By equalizing zero the derivation of L with respect to w and b, will give into Lagrange function (3), will give The solution of (5) will give an optimum weight vector w * = ∑ = hyperplane with maximum margin, γ = 1/||w||.

Fig. 3
Fig. 3 Mapping to feature space by function φ.