#### Comments

#### Description

**Download Advantages of Unbiased Support Vector Classifiers for Data Mining Applications**

#### Transcript

- Journal of VLSI Signal Processing 37, 223–235, 2004 c© 2004 Kluwer Academic Publishers. Manufactured in The Netherlands. Advantages of Unbiased Support Vector Classifiers for Data Mining Applications A. NAVIA-V ´AZQUEZ, F. P ´EREZ-CRUZ, A. ART ´ES-RODR´IGUEZ AND A.R. FIGUEIRAS-VIDAL DTSC, Univ. Carlos III de Madrid, Avda Universidad 30, 28911-Legane´s, Madrid, Spain Abstract. Many learning algorithms have been used for data mining applications, including Support Vector Clas- sifiers (SVC), which have shown improved capabilities with respect to other approaches, since they provide a natural mechanism for implementing Structural Risk Minimization (SRM), obtaining machines with good generalization properties. SVC leads to the optimal hyperplane (maximal margin) criterion for separable datasets but, in the non- separable case, the SVC minimizes the L1 norm of the training errors plus a regularizing term, to control the machine complexity. The L1 norm is chosen because it allows to solve the minimization with a Quadratic Programming (QP) scheme, as in the separable case. But the L1 norm is not truly an “error counting” term as the Empirical Risk Minimization (ERM) inductive principle indicates, leading therefore to a biased solution. This effect is specially severe in low complexity machines, such as linear classifiers or machines with few nodes (neurons, kernels, ba- sis functions). Since one of the main goals in data mining is that of explanation, these reduced architectures are of great interest because they represent the origins of other techniques such as input selection or rule extraction. Training SVMs as accurately as possible in these situations (i.e., without this bias) is, therefore, an interesting goal. We propose here an unbiased implementation of SVC by introducing a more appropriate “error counting” term. This way, the number of classification errors is truly minimized, while the maximal margin solution is obtained in the separable case. QP can no longer be used for solving the new minimization problem, and we ap- ply instead an iterated Weighted Least Squares (WLS) procedure. This modification in the cost function of the Support Vector Machine to solve ERM was not possible up to date given the Quadratic or Linear Programming techniques commonly used, but it is now possible using the iterated WLS formulation. Computer experiments show that the proposed method is superior to the classical approach in the sense that it truly solves the ERM problem. Keywords: support vector, classifier, empirical risk minimization, unbiased, data mining, error count, weighted least squares 1. Introduction The Empirical Risk Minimization (ERM) principle states that the objective of any classifier is to attain the solution which provides a minimum number of training errors [1]. Functional terms that count er- rors are known, such as the classical Perceptron rule [2], but convergence is not guaranteed in the gen- eral case. A good initial discussion of the problem can be found in [3], where it is shown that a very steep sigmoid approaches the step function and, thus, serves as a “counting errors” loss function. A com- prehensive discussion of the topic, and its relationship with maximal margin solutions can be found in [4– 6]. In the latter, it is shown that, when the soft deci- sion is based on a sigmoidal saturation function, the minimization of any L p error norm approaches ERM as long as the slope is high enough, in which case, also a maximal margin solution is obtained. This ap- proach has been followed by many authors, which ap- ply these principles to several learning architectures: see for instance [3, 7, 8] or [9]. In this work we will
- 224 Navia-Va´zquez et al. extend these principles to the Support Vector Machines case. Instead of following the ERM principle, the Support Vector Classifier (SVC), as proposed in [1], minimizes a combination between the L2 norm of the weights and the L1 norm of margin slack variables. Although it has been recently demonstrated that this strategy min- imizes some generalization bounds, in Cristianini and Shawe-Taylor [10] it is recognized that “optimizing the norm of the margin slack vector does not necessarily mean minimizing the number of misclassifications”. To achieve an unbiased SVC (USVC) solutions truly solv- ing ERM it is necessary to use a cost function actually counting errors, while keeping the maximal margin cri- terion. Quadratic Programming (QP) can no longer be used to solve this new minimization problem, and we propose an alternative minimization mechanism based on a WLS formulation, eventually obtaining an Unbi- ased SVC (USVC). The paper is organized as follows: in the next sec- tion we derive the unbiased formulation for nonlinear machines (the linear case being directly obtained as a particular case), for both parametric and nonparametric machines. We then analyze and discuss about the con- vergence properties of the proposed algorithms. Finally we analyze the performance of the proposed algorithms on several biomedical datasets using limited complex- ity (semiparametric) machines and, finally, some con- clusions are presented. 2. Derivation of the Unbiased Algorithms Let us consider a set of training data pairs (xi , yi ), i = 1, . . . , �, corresponding to a binary classification prob- lem (xi ∈ Rn and yi = {+1, −1}), and a general clas- sifier used to estimate targets ŷi = sign(φT (xi )w + b), φ(·) being a mapping to a higher dimensional space F . SVCs follow the optimal hyperplane (OH) crite- rion, which maximizes the distance between the train- ing samples and the classification boundary (margin), thereby maximizing the generalization capability of the machine (linearly separable datasets). To extend the OH criterion to the non-separable case, Vapnik pro- poses to find the hyperplane that makes the smallest number of errors [1], by adding a term that takes errors into account, i.e., by minimizing L = 1 2 ‖w‖2 + C �∑ i=1 θ (ξi ) (1) with respect to w, b, ξi subject to yi (φT (xi )w + b) − 1 + ξi ≥ 0, ∀ i = 1, . . . , � (2) ξi ≥ 0, ∀ i = 1, . . . , � (3) where θ (·) represents a shifted step function (Heavi- side). This optimization problem is NP-complete, as noted in [1] and, to find a solution by means of a practical optimization algorithm, Vapnik proposes to minimize L = 1 2 ‖w‖2 + C �∑ i=1 ξi (4) leading in this case to the well-known QP formula- tion of the SVC problem [1] (QP-SVC from now on). This represents a rather coarse approximation to term θ (ξi ) in (1). To solve a SVC which combines maximal margin and minimum number of errors, we propose to minimize a modified functional ˜L = 1 2 ‖w‖2 + C �∑ i=1 ˜θs(ξi ) (5) where we use the following cost function ˜θs(ξi ) = {(( ξi + k1/s� )s)/(2(1 + k1/s� )s), 0 ≤ ξi < 1 1 − ((1 + k1/s� )s)/(2(ξi + k1/s� )s), ξi ≥ 1 (6) which is a smooth approximation to the step function, and with k� being a very small constant, necessary to ensure that the maximum margin solution is obtained when the training set is separable. We have represented in Fig. 1 the “error count”, QP-SVC and USVC ( ˜θs(ξi )) costs, the latter for different values of s. It can be seen how ˜θs(ξi ) asymptotically repre- sents an appropriate approximation to the step function shifted to ξi = 1, leading to a true error counting cost function as s tends to infinity (in fact, it will not be necessary that s takes a very large value to get good approximations to ERM). To minimize (5) we will follow parallel steps as those in the SVM formulation: we first introduce the linear restrictions (2) and (3), by means of Lagrange multipliers ˜L = 1 2 ‖w‖2 + C ∑ i ˜θs(ξi ) − ∑ i αi (yi (φT (xi )w + b) − 1 + ξi ) − ∑ i ηiξi (7)
- Advantages of Unbiased Support Vector Classifiers for Data Mining Applications 225 Figure 1. Comparison of QP-SVC, “error count” and USVC costs. and we obtain the Karush-Khun-Tucker (KKT) condi- tions that define the solution, which are, besides (2) and (3), the following: ∂ ˜L ∂ξi = C ∂ ˜θs(ξi ) ∂ξi − (αi + ηi ) = 0 ∀i = 1, . . . , � (8) ∂ ˜L ∂b = �∑ i=1 αi yi = 0 (9) ∂ ˜L ∂w = w − �∑ i=1 αi yiφ(xi ) = 0 (10) αi , ηi ≥ 0 (11) αi {yi (φT (xi )w + b) − 1 + ξi } = 0, ∀ i = 1, . . . , � (12) ηiξi = 0, ∀ i = 1, . . . , � (13) These conditions are identical to the ones in the Vapnik’s SVC, except the one in (8), which reflects the new cost function. To optimize (7) we first reorga- nize the summatories by grouping terms depending on ξi ˜L = 1 2 ‖w‖2 + ∑ i αi (1 − yi (φT (xi )w + b)) + ∑ i (C ˜θs(ξi ) − ξi (αi + ηi )) (14) We use the error definition ei = yi − (φT (xi )w + b) and, recalling that y2i = 1, we compute weights ai as ai = 2αi1 − yi (φT (xi )w + b) (15) to obtain ˜L = 1 2 ‖w‖2 + 1 2 ∑ i ai e 2 i + ∑ i (C ˜θs(ξi ) − ξi (αi + ηi )) (16) Functional (16) can be iteratively solved by first min- imizing it with respect to w, b and ξi and computing afterwards ai from the previously obtained solution. To minimize ˜L with respect to w, b and ξi we equal its partial derivatives to zero, obtaining1 ∂ ˜L ∂w = w − ΦDa[y − ΦT w − 1b] = 0 (17) ∂ ˜L ∂b = −aT [y − ΦT w − 1b] = 0 (18) ∂ ˜L ∂ξi = C ∂ ˜θs(ξi ) ∂ξi − (αi + ηi ) = 0 ∀ i = 1, . . . , � (19) where Φ = [φ(x1),φ(x2), . . .φ(x�)] a = [a1, a2 . . . , a�]T w = [w1, w2 . . . , wn]T (Da)i j = ai jδ[i − j] ∀ i, j = 1, . . . , � Equation (19) is the KKT condition in (8) and, there- fore, it will vanish once we attain the minimum. Equa- tions (17) and (18) define the solution for w and b and can be joined together to obtain the following system of equations, which has to be solved to obtain the USVC solution[ (ΦDaΦT + I) Φa aT ΦT aT 1 ][ w b ] = [ ΦDay aT y ] (20) Note that (20) is actually an implicit equation in both w, b and a, and can not be solved simultaneously for these parameters. However, if we fix ai values, we can obtain an explicit solution for w and b. Then, we can update ai values and solve (20) again. From KKT
- 226 Navia-Va´zquez et al. conditions in (8), (11)–(13) we find that the new αi can be calculated as αi = 0, ei yi < 0 C d ˜θs(ξi ) dξi ∣∣∣∣ ξi , ei yi ≥ 0 (21) and, from them, the values of ai can be computed as in (15), taking into account that ξi = 0 when ei yi < 0 and ξi = ei yi when ei yi ≥ 0. We can explicitly write ai as a function of alphai and ei yi ai = { 0, ei yi < 0 2αi/ei yi , ei yi ≥ 0 (22) Once ai has been calculated for every sample, the system in (20) can be solved again. This process must continue until there are no changes in neither ai nor w and b, i.e., the fixed point of the iterative procedure is found. This fixed point coincides with the desired USVC solution, since it simultaneously satisfies all the KKT conditions which are known to determine the so- lution of the minimization problem. The minimization problem to be solved by the unbiased algorithms is not convex and they may get trapped in local minima. In following sections we run initially the algorithm with QP-SVC and then we change the cost function and let the USVC algorithm progress until a better minimum is found: experience tells us that this solution is, in terms of Classification Error (CE), better or equal to the one found with SVC, but never (statistically significant) worst. In the following section we will illustrate this behaviour using a simple problem, and discuss about convergence of the methods. 2.1. Parametric Kernelization of the Unbiased Algorithm The above algorithm is apparently limited to those ap- plications where the nonlinear mapping φ(·) is known. If we only know its Reproducing Kernel Hilbert Space (RKHS), K (·, ·) = φT (·)φ(·), the system in (20) can not be solved in its present form, but it can be modified to be solved using RKHS, as was done in [24] with the SVC. In this paper we will propose a (semipara- metric) model as the unbiased solution (named USVC from now on). This will allow us to work with RKHS while controlling the complexity of the resulting clas- sifier with respect to the nonparametric solution where the final machine is built using as many RKHS as SVs are found (recall that the number of SVs can be very large). Therefore, we force the solution w to be w = R∑ i βiφ(ci ) = Ψβ (23) where ci are predetermined kernel prototypes which can be obtained by either nonlinear PCA [12] or clus- tering techniques such as FSCL [13], as shown in [14], and Ψ = [ψ(c1),ψ(c2), . . .ψ(cR)] (24) If we replace the definition of w in (23) into (20), multiply it by [ ΨT 00T 1 ], and reorder terms, we attain the following system of linear equations[ (ΨT ΦDaΦT Ψ + ΨT Ψ) ΨT Φa aT ΦT Ψ aT 1 ][ β b ] = [ ΨT ΦDay aT y ] (25) where every element of kernel matrices K = ΦT Ψ and I = ΨT Ψ is a RKHS, (K)i j = φT (ci )φ(x j ) = K (ci , x j ) ∀ i = 1, . . . , R ∀ j = 1, . . . , � (I )i j = φT (ci )φ(c j ) = K (ci , c j ) ∀ i, j = 1, . . . , R such that the solution can now be obtained in terms of RKHS. Note that now β is of finite dimension R, and the system in (25) can be solved using any LS method, for instance, the Moore-Penrose pseudoinverse2 β = (KT DaK + IΨ)−1KT Day (26) Note that when we introduce (23) into a model yˆ = sign(φT (x)w) we obtain the following compact architecture for the classifier yˆ = sign([φT (x)Ψ]β) = sign ( R∑ i=1 βi K (ci , x) ) (27) The USVC algorithm can be summarized as follows: 0.- Initialize ai = 1 1.- Determine an appropriate set of kernel prototypes ci
- Advantages of Unbiased Support Vector Classifiers for Data Mining Applications 227 2.- Compute kernel matrices K and IΨ 3.- Solve the WLS problem using (26) 4.- Obtain new weighting values ai using (22) 5.- Go back to 3 and repeat until the fixed point is attained. In what follows, to avoid confussion about the par- ticular algorithm is being used, we will refer as USVC to the unbiased algorithm for parametric machines, and ERM-SVC its counterpart for nonparametric SVM-like solutions [15]. 3. On USVC and ERM-SVC Convergence The convergence of the WLS procedure for solving the SVC is shown in [16] for the particular choice of the same loss-function as the QP-SVC. This demonstration can be directly applied to any other loss-function as the one proposed for the USVC. We will just highlight the key point of the demonstration without entering in deeper detail. First of all, in each iteration of the WLS procedure the resolution of (25) (3rd step of the WLS algorithm) guarantees that the functional in (5) is minimized. Sec- ond, the computation of ai in the fourth step of the WLS algorithm is done by fulfilling all the KKT conditions except (9) and (10). When the values of β and b are not modified in any given iteration, equation (25) can be simplified to (9) and (10), such that, at this point, all KKT conditions will be held and we will be at the USVC solution. To illustrate the convergence properties of both WLS-SVC and USVC algorithms we have defined a synthetic problem, such that a different data set can be generated in every trial following the same data distribution and, therefore, statistical relevance can be extracted from the experiments. We consider a bi- nary classification problem built with a combination of Gaussian and uniform distributions, and we define two different scenarios, cases A and B. In case A we have f (x | y = 1) = 1 4 N (0.75, 0.15) + 3 4 N (0.2, 0.02); P(y = 1) = 2 3 (28) f (x | y = −1) = U (0, 0.5); P(y = −1) = 1 3 and in case B f (x | y = 1) = 1 1.4 U (0.7, 0.95) + 0.4 1.4 N (0.25, 0.02); P(y = 1) = 0.7 (29) f (x | y = −1) = U (0, 0.5); P(y = −1) = 0.3 where N (µ, σ ) represents a Gaussian probability distri- bution function (pdf) with mean µ and standard devia- tion σ , and U (x1,x2) an uniform pdf in (x1,x2). These two scenarios are represented in Fig. 2. We select ‘a priori’ probabilities P(y = +1), P(y = −1) such that the classification error achievable us- ing a single decision threshold (Th) has a strong local minimum around x = 0.17 and a global minimum at x = 0.5 (CEmin = 20%) in case A, as shown in (c). In case B, we have a ‘narrow’ global minimum at x = 0.21 (with CEmin = 17%) and a ‘wide’ local min- imum around 0.6 (CE = 20%), as it can be observed in (d). Although very simple, these scenarios represent a fairly common situation in real world problems, where a portion of the patterns are immersed in the wrong class, and which, unless special care is taken, may in- duce a bias in the solution when either linear or nonlin- ear machines are used (specially when the maximum complexity of the machines is limited). We generate 500 training and 500 testing patterns, and we apply SVC (C = 10) and USVC (C = 10) al- gorithms with a linear machine, and with a “damped” weight update between iterations (wk = dwk−1 + (1 − d)w, d = 0.2 and w being the result of the pseudoin- verse in (26)). We carry out 100 independent trials of every ex- periment using a linear classifier and the resulting CE figures (indicating mean and standard deviation) and decision threshold (T h = −b/w) are shown in Table 1. In the average, CE obtained with USVC in cases A and B is around 20% (great reduction with respect to values attained using SVC). With respect to threshold values found in every case, it is worth mentioning that, in case A, USVC converges to values very close to T h = 0.5 (for both s = 2 and s = 4), therefore find- ing the minimum classification error solution. We have observed that the algorithm is not very sensitive with re- spect to the exact value of s, as long as it is large enough, therefore we will not further explore this aspect, and we will use s = 2 throughout the paper. Scenarios A and B being unidimensional and the linear classifica- tion machine represented by only two parameters, it is
- 228 Navia-Va´zquez et al. Figure 2. Data distribution for synthetic data scenarios A (a) and B (b) (where class “+1” distribution has been plotted with continuous line and that of class “−1” with a dashed one). The corresponding theoretical CE curves have also been plotted in (c) and (d), respectively. possible to graphically analyze the convergence of the algorithms, as represented in Fig. 3, where error sur- faces, CE functions and trajectories for a single trial are shown for both SVC and USVC. Error surface3 ˜L P (T h, w) and convergence trajec- tory for both SVC4 and USVC algorithms solving prob- lem A have been depicted in (a) and (c), respectively. It can be observed in (a) how, starting at point ‘o’ (Least Squares (LS) solution) the WLS-SVC algorithm con- verges to the QP-SVC solution (marked as ‘x’). Any other initialization of the algorithm can alternatively be used, because it has been shown in [1] that the asso- ciated error surface is convex with a single minimum. The cross-section represented by a dashed line in (a) represents the locus of the minima of ˜L P (T h, w) for fixed values of T h, for the SVC case. This profile is relevant, because it represents the CE function that the algorithm is actually minimizing. It has been plotted in (b) for the SVC case, where it can be observed that it represents a poor estimate of CE. The theoretical CE curve as a function of T h has been represented with a continuous line, and the empirical estimation of CE has been plotted with a dotted line. The QP- SVC minimum (around T h = 0.35, marked as ‘x’) is clearly biased with respect to the optimal solution (T h = 0.5). In (c) we can see the convergence of the USVC al- gorithm, also starting at the LS solution (the trajectory
- Advantages of Unbiased Support Vector Classifiers for Data Mining Applications 229 Table 1. Results for experiment 1, using a linear classifier trained with regularized LS, SVC and USVC (s = 2, 4) algorithms. Optimal LS USVC USVC values (reg.) SVC (s = 2) (s = 4) Case A CE (%) 20 30.5 26.8 19.9 21.1 (Train) (2.9) (2.6) (1.7) (2.0) CE (%) 20 30.7 27.1 20.4 21.8 (Test) (2.7) (2.4) (1.9) (2.6) Th 0.5 0.3 0.35 0.49 0.47 (0.02) (0.02) (0.06) (0.1) Case B CE (%) 17 31.8 24.8 19.7 20.1 (Train) (1.4) (2.6) (1.8) (5.2) CE (%) 17 31.7 25.5 19.8 19.7 (Test) (1.9) (2.4) (1.9) (5.4) Th 0.6 0.3 0.4 0.58 0.56 (0.02) (0.02) (0.01) (0.02) Mean and standard deviations (below, in brackets) of CE and thresh- old values are shown, as well as optimal (theoretical) ones. of WLS-SVC is kept as a reference). It can be observed how the error surface is no longer strictly convex (there is a local minimum around (0.15, 12)), but the abso- lute minimum is located at T h = 0.5, in coincidence with the optimal theoretical value. Cross-section rep- resented in (d) using a dashed line, provides a much better approximation to the theoretical CE curve than in the SVC case. The same convergency analysis has been carried out for scenario B, and the results can be observed in Fig. 4. In this case the QP-SVC solution is also biased, be- cause the minimum of ˜L P (T h, w) (marked as ‘x’ in (a) and (b)) is not placed at the optimal (maximal margin) value 0.6. The theoretical CE curve presents, in this case, a global minimum at T h = 0.2. Nevertheless, it can be observed in (c) how the error surface ˜L P (T h, w) that the USVC algorithm tries to minimize, presents a global minimum at T h = 0.6, marked as ‘∗’ in (d). This threshold value provides a ‘safer’ solution to the problem, in the sense that final performance is less sen- sitive to (possible) perturbations in the data. It can be observed in (c) how USVC, unlike SVC, converges to this solution. 3.1. Discussion About the Benefits of Unbiased SVCs We have shown that the SVC can be solved with a loss-function closer to the step-function than the one proposed in [1], which guarantees that the solution will asymptotically tend to the solution with minimum num- ber of errors (Bayes classifier). But, when working with a finite set, one may ask, how close is close enough or, alternatively, how much am I biasing the solution when using the SVC instead of USVC. The answers to these questions are not trivial and we will not make any at- tempt to simplify the problem to arrive to an analytical solution. Instead, we will give some heuristical rules that illustrate in which cases the USVC may improve the SVC solution. First of all, the number of training samples directly affects the benefits of using the USVC (asymptotical property), or more precisely the number of erroneously classified training samples, which are the samples that bias the SVC solution. Secondly, the larger the values of ξi , the best the USVC will be, because the samples with larger ξi are the samples that are worstly represented by the SVC L1 norm loss-function, i.e. the samples that bias most the SVC solution (see Fig. 1). As linear kernels present larger values of ξi than RBF kernels and the value of ξi increases with the width of the used RBF kernel (similar arguments can be carried out with any other kernel), one may find that the benefits of the USVC appears when the VC dimension is low.5 Basically the USVC works best when the number of training samples increases and the VC dimension of the classifier decreases. We must also point out that the USVC does not al- ways improve the SVC when working with finite data sets. This fact can be explained if one sees the loss- function selection as an extra “regularizing” parameter (although it is not actually one). The SVC has a true regularizing parameter in the norm of w. In regulariza- tion theory [18], it is shown that adding a non-negative convex functional to the empirical error helps control- ling the machine complexity (generalization ability). As the number of samples increases the empirical er- ror is more reliable and the weight of the regularizing functional is reduced to zero at a given rate, as the number of samples tends to infinity (see [18] for fur- ther details). The degree of similarity between the used loss-function and the step loss operates in a similar way. With few samples the use of Vapnik loss is pre- ferred over the USVC, because the later might provide a lower empirical error but with worse generalization. But as the number of samples increases the empirical error will tend to the true error and the use of Vapnik loss will bias the minimum error solution, because it is not truly an error counting term. As the number of
- 230 Navia-Va´zquez et al. Figure 3. Convergency analysis of SVC and USVC algorithms in case A. Trajectories over the error surface ˜L P (T h, w) are shown in (a) and (c), as well as CE functions being minimized in every case, in (b) and (d). samples increases we must therefore modify Vapnik loss-function to a shifted step loss-function that truly minimizes the empirical error. 4. Experiments 4.1. Comparison with Other Machine Learning Algorithms We use here a collection of datasets from the UCI Machine Learning repository,6 we normalize the input variables to have zero mean and unit standard deviation, we use 10 random splits of the data (80% for training and 20% for testing), and find optimal parameters for every algorithm and split using 5-fold cross-validation. In the experiments that follow, we concentrate on measuring Classification Error (CE) and machine size, to reveal the potential benefits of using a different cost function. We are not addressing here the computational cost problem, since the iterated WLS approach for solv- ing SVMs has already been benchmarked in [24], with favorable results with respect to state of the art SVM technology. Since the main modification of USVC is the different computation of weighting values, little dif- ference is obtained with respect to training cost. We
- Advantages of Unbiased Support Vector Classifiers for Data Mining Applications 231 Figure 4. Convergency analysis in case B for both SVC ((a) and (b)) and USVC with s = 2 ((c) and (d)) and USVC with s = 4 ((e) and (f)). Trajectories over the error surface ˜L P (T h, w) are shown in (a) and (c), as well as CE functions for every algorithm, in (b) and (d). compare QP-SVC (SVMlight, [20]), Adaboost7 [21] and the USVC and ERM-SVC algorithms, with both linear and nonlinear (RBF) machines. The results for the linear machine are detailed in Table 2. It can be observed how USVC is able to perform bet- ter than QP-SVC (SVMlight) in several cases (‘Image’, ‘Ringnorm’, and ‘Thyroid’), in the remaining datasets, it presents a performance similar to QP-SVC, and never is worst. Both SVC and USVC/ERM-SVC usually out- perform Least Squares (LS). The linear machine case is more than a simple “baseline experiment”, because very often input data is already in a high dimensional space, training patterns are scarce and, therefore, a lin- ear machine is simple to train and provides good results [22, 23]. We believe that this algorithm could be very useful in those data mining scenarios. To implement the kernelized USVC algorithm we applied firstly the Frequency Sensitive Competitive Learning (FSCL) algorithm [13] to find the R centroids needed in every case (parameter R has also been found using 5-fold cross-validation). The final performance of RBF kernel machines trained with USVC is shown in Table 3, as well as results with SVMlight, ERM-SVC and Adaboost. We observe, in this nonlinear case, how the USVC is also able to improve some of the QP-SVC results,
- 232 Navia-Va´zquez et al. Table 2. Results for the real world datasets, using a linear model trained with LS, SVMlight, and USVC/ERM-SVC algorithms. LS SVMlight USVC Diabetes 24.7 ± 2.6 24.2 ± 2.2 24.2 ± 3.3 Image 16.2 ± 1.2 15.8 ± 0.8 12.1 ± 1.1 Ringnorm 23.3 ± 1.1 23.4 ± 1.1 22.9 ± 1.1 Thyroid 17.7 ± 5 9.1 ± 4.5 7.6 ± 3.8 Twonorm 2.2 ± 0.3 2.2 ± 0.3 2.2 ± 0.2 Waveform 12.0 ± 0.7 11.5 ± 0.6 11.4 ± 0.5 CE is shown (mean ± std. dev.). Table 3. Results for the RBF classifiers trained with SVMlight, Adaboost and unbiased algorithms. SVMlight Adaboost USVC Diabetes CE(%) 24.5 ± 2.3 27.2 ± 3.7 22.4 ± 3.1 Mach. size 352 ± 12 3000 ± 0 40 Image CE(%) 3.7 ± 0.6 3.6 ± 0.9 3.7 ± 0.8 Mach. size 285 ± 14 17116 ± 1370 120 Ringnorm CE(%) 1.8 ± 0.4 1.9 ± 0.4 1.8 ± 0.6 Mach. size 615 ± 25 2400 ± 0 90 Thyroid CE(%) 3.2 ± 2.9 3.7 ± 3.2 2.6 ± 3.2 Mach. size 25 ± 4 1534 ± 143 40 Twonorm CE(%) 2.2 ± 0.9 2.9 ± 0.4 2.2 ± 0.8 Mach. size 565 ± 12 2400 ± 0 12 Waveform CE(%) 8.9 ± 0.8 9.1 ± 0.9 8.1 ± 1.1 Mach. size 1286 ± 25 2000 ± 0 30 CE is shown (mean ± std. dev.), as well as the machine size in every case. namely, cases ‘Diabetes’, ‘Thyroid’ and ‘Waveform’, obtaining similar (never worst) results in the remain- ing datasets. Furthermore, note how it usually outper- forms Adaboost, since the generalization criterion that Table 4. Results for the biomedical datasets, using a linear model trained with LS, SVMlight, and USVC/ERM-USVC algorithms. LS SVMlight ERM-SVC USVC Breast CE(%) 4.16 ± 2.1 4.17 ± 2.2 3.74 ± 2.9 3.73 ± 2.8 Dermat CE(%) 0.28 ± .7 0.28 ± 0.7 0.28 ± 0.79 0.28 ± 0.7 Haberman CE(%) 25.9 ± 8.7 27.3 ± 6.1 24.7 ± 6.28 24.3 ± 6.8 Postoperative CE(%) 29.5 ± 17.3 28.4 ± 18.5 27.3 ± 18.8 27.2 ± 18.46 Heart disease CE(%) 16.55 ± 3.3 16.8 ± 3.4 17.6 ± 5.6 15.5 ± 3.1 Hepatitis CE(%) 12.5 ± 7.4 15.1 ± 6.5 19.1 ± 6.2 12.5 ± 5.4 Liver disorder CE(%) 30.5 ± 4.1 31.1 ± 3.7 34.3 ± 10.8 30.2 ± 7.4 CE is shown (mean ± std. dev.). SVC theory imposes on the solution is really benefit- ing the USVC algorithm. Finally, note that, in some cases with Adaboost and SVMlight, we obtain very large machines (1200, 2400, or even 17000 RBF ker- nels in every classifier), which does not happen with USVC, where the semiparametric nature of the model imposed guarantees a certain (moderate) complexity of the solution (from 12 to 120 RBF kernels), obtain- ing thereby machine complexity reductions up to two orders of magnitude. 4.2. Unbiased SVCs for Biomedical Data Finally, we have chosen several data sets with medical or biological content that are characterized by present- ing very few examples, which makes even harder the use of USVC. At the same time we have restricted our- selves to using linear machines because, we wanted to facilitate the explainability of the achieved results. The results are shown in Table 4, where we have also in- cluded the ERM-SVC case (unbiased SVC procedure that yields a non-parametric solution and uses a sig- moidal loss-function [15]). The USVC is equal or best in most data sets because in this case the complexity of the machine is fixed and in spite of the small number of patterns in the training sets. In the ‘dermat’ case all algorithms attain the same solu- tion, since the training set is linearly separable and no error penalty terms are present. The ERM-SVC should lead to the same solution as the SVC but the chosen parameters for ‘Heart disease’, ‘Hepatitis’ and ‘Liver disorder’ were such that it obtained a lower empiri- cal (training) error but a higher testing error. This can be explained because the parameters of the ERM-SVC loss-function (sigmoidal type) lead to a much closer
- Advantages of Unbiased Support Vector Classifiers for Data Mining Applications 233 approximation to the step loss-function, which, as pre- viously argued, might lead to a worse generalization error if the data points are scarce. Therefore, these cases of scarce data are not ideally suited for USVC algorithms with a loss function “too” close to the ideal one, as already discussed in Section 3.1, since then the generalization capabilities severely degrade. As long as more data is available, the loss function can be chosen to be closer to the step function (in the limit of infinite data, the step function could be used). This is one of the reasons why QP-SVCs (with L1 norm) have proven to be successful in datasets with so few training points, since they provide in these cases a reasonable compro- mise between generalization and minimum number of training errors. 5. Conclusions and Further Work We have presented an unbiased implementation of SVC (USVC algorithm) which on average provides better solutions in terms of classification error than classi- cal QP-based SVC implementations (especially when the complexity of the machine is reduced -low VC dimension-, and there are a relatively large number of misclassified patterns), and also compares well with other competing algorithms (Adaboost). QP-SVC has shown to yield biased solutions because it is not ac- tually minimizing a true ERM cost, due to the coarse approximation to the “error counting” cost included in the QP-SVC functional. We include a term in the optimization which represents a better approximation to the number of misclassified patterns. The resulting functional can no longer be minimized using QP but, solving SVCs with this new cost function is possible thanks to our reformulation of the problem in terms of iterated Weighted Least Squares Problems, what we have called Unbiased SVC, since it corrects the ‘bias’ of QP-SVC solutions observed in practice. The resulting USVC algorithm outperforms QP-SVC in both clas- sification error and machine complexity, as illustrated with some experiments with synthetic and real-world data. As a final consideration, we believe that no univer- sally best SVC method exists to solve any classification problem and one should try to solve the SVC with sev- eral approximations to the step function and the loss- function must be viewed as another “hyperparameter” of the SVC as usually done with the kernel function or the penalty factor value. As an ongoing work, we are currently exploring USVC schemes that differently weight error types I and II (detection miss or false alarm, so to speak), to solve problems with asymmetric-penalty cost func- tions, -which is now feasible since with USVC algo- rithms every classification error contributes with the right weight in the cost functional- to eventually ob- tain Constant False Alarm Rate SVCs (CFAR-SVCs), useful in any “alert-type” application either in biomed- ical monitoring or in data (text) mining (news alert), by combining USVC algorithm and adaptive implemen- tations already available [14]. Acknowledgments This work has been partially supported by the Commu- nity of Madrid grant CAM-07T/0014/20031. We would like to thank Dr. Cid-Sueiro for his valu- able comments about error-counting cost functions and their relationship to Empirical Risk Minimization and maximal margin solutions. Notes 1. Some of these derivatives have been directly written in matricial form. 2. For simplicity, we assume from now on b = 0, usually done when solving SVCs with Gaussian kernels, although the system in (25) could also provide the value of b. 3. Note that, to make visualization easier, we have used T h = −b/w as a second parameter, instead of directly using b. 4. Although SVC solutions have been obtained using the WLS-SVC algorithm, it has been shown in Pe´rez-Cruz et al. [17] and Navia- Va´zquez et al. [14] that, in the linear case, it attains the same solution as QP-SVC. 5. The VC dimension depends not only on the selected kernel but also on the dimension of the input space and the value of penalty factor C . A way of computing an upper bound on the VC dimen- sion is shown in [19]. 6. http://www.ics.uci.edu/∼mlearn/MLSummary.html. 7. Software available at ida.first.gmd.de/homepages/raetsch/. References 1. V. Vapnik, The Nature of Statistical Learning Theory, New York: Springer-Verlag, 1995. 2. F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, Washington, DC: Spartan Press, 1961. 3. B. Telfer and H. Szu, “Energy Functions for Minimizing Miss- classification Error with Minimum Complexity Networks,” Neu- ral Networks, vol. 7, 1994, pp. 805–818.
- 234 Navia-Va´zquez et al. 4. S. Raudys, “Evolution and Generalization of a Single Neurone I. Single-Layer Perceptron as Seven Statistical Classifiers,” Neural Networks, vol. 11, 1998, pp. 283–296. 5. S. Raudys, “Evolution and Generalization of a Single Neu- rone II. Complexity of Statistical Classifiers and Sample Size Considerations,” Neural Networks, vol. 11, 1998, pp. 297– 313. 6. J. Cid-Sueiro and J.L. Sancho-Go´mez, “Saturated Perceptrons for Maximum Margin and Minimum Missclassification Er- ror,” Neural Processing Letters, vol. 14, no. 3, 2001, pp. 217– 226. 7. L. Bobrowsky and J. Sklansky, “Linear Classifiers by Window Training,” IEEE Transactions on Systems, Man and Cybernetics, vol. 25, 1995, pp. 1–9. 8. H. Do-Tu and M. Installe, “Learning Algorithms for Non- Parametric Solution to the Minimum Error Classification Prob- lem,” IEEE Transactions on Computers, vol. 27, no. 7, 1978, pp. 648–659. 9. A. Guerrero Curieses and J. Cid-Sueiro, “A Natural Approach to Sample Selection in Binary Classification,” in Proceed- ings of Learning’00, Madrid, Spain (CD-ROM), paper no. 29, 2000. 10. N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge: Cambridge University Press, 2000. 11. F. Pe´rez-Cruz, A. Navia-Va´zquez, P. Alarco´n-Diana. and A. Arte´s-Rodrı´guez, “Support Vector Classifier with Hyperbolic Tangent Penalty Function,” in Proc. IEEE International Confer- ence on Acoustics, Speech, and Signal Processing ICASSP’2000, vol. 6, Piscataway, NJ, USA, 2000, pp. 3458–3461. 12. B. Scholkopf, P. Knirsch, A. Smola, and C. Burges, “Fast Ap- proximation of Support Vector Kernel Expansions, and an In- terpretation of Clustering as Approximation in Feature Spaces,” in Proc. 20. DAGM Symp. Mustererkennung, Springer Lecture Notes in Computer Science, vol. 1, 1998. 13. S. Ahalt, A. Krishnamurthy, P. Chen, and D. Melton, “Com- petitive Learning Algorithms for Vector Quantization,” Neural Networks, vol. 3, 1990, pp. 277–290. 14. A. Navia-Va´zquez, F. Pe´rez-Cruz, A. Arte´s-Rodrı´guez, and A. Figueiras-Vidal, “Weighted Least Squares Training of Support Vector Classifiers Leading to Compact and Adaptive Schemes,” IEEE Trans. Neural Networks, vol. 12, no. 5, 2001, pp. 1047– 1059. 15. F. Pe´rez-Cruz, A. Navia-Va´zquez, A.R. Figueiras-Vidal, and A. Arte´s-Rodrı´guez, “Empirical Risk Minimization for Support Vector Machines,” IEEE Trans. on Neural Networks, vol. 14, no. 2, 2003, pp. 296–303. 16. F. Pe´rez-Cruz, “Ma´quina de Vectores Soporte Adaptativa y Com- pacta,” Ph.D. thesis, Universidad Polite´cnica de Madrid, 2000. 17. F. Pe´rez-Cruz, Navia-Va´zquez, J. Rojo- ´Alvarez, and A. Arte´s- Rodrı´guez, “A New Training Algorithm for Support Vector Ma- chines,” in Proc. 5th Bayona Workshop on Emerging Technolo- gies in Telecomms., vol. 1, 1999, pp. 343–351. 18. A.N. Tikhonov and V.Y. Arsenin, Solution of Ill–Posed Prob- lems, Washington, DC: Winston, 1977. 19. C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, 1998, pp. 121–167. 20. T. Joachims, “Making large-Scale SVM Learning Practical,” in Advances in Kernel Methods—Support Vector Learning, B. Schlkopf, C. Burges, and A. Smola (Eds.), Cambridge, MA: MIT Press, 1999, pp. 169–184. 21. Y. Freund and R. Shapire, “A Decission-Theoretic Generaliza- tion of On-line Learning and an Application to Boosting,” Jour- nal of Computer Science, vol. 55, 1997, pp. 119–139. 22. T. Joachims, “Text Categorization with Support Vector Ma- chines: Learning with Many Relevant Features,” in Proc. 10th European Conf. on Machine Learning (ECML), 1998. 23. J. Platt, “Inductive Learning Algorithms and Representations for Text Categorization,” in Proc. 7th International Conference on Information and Knowledge Management, 1998. 24. F. Pe´rez-Cruz, P. Alarco´n-Diana, A. Navia-Va´zquez, and A. Arte´s-Rodrı´guez, “Fast Training of Support Vector Classifiers,” in Advances in Neural Information Processing Systems, vol. 13, 2000, pp. 734–740. 25. A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via EM Algorithm (with discussion),” Journal of the Royal Statistical Society, vol. 39, 1977, pp. 1– 38. 26. J. Shawe-Taylor and N. Cristianini, “On the Generalisation of Soft Margin Algorithms,” NeuroCOLT2 Tech. Rep. 82, Dep. Computer Science, Univ. London, 2000. A. Navia-Va´zquez received his Degree in Telecommunications En- gineering in 1992 (Universidad de Vigo, Spain), and finished his PhD, also in Telecommunications Engineering in 1997 (Universidad Politécnica de Madrid, Spain). He is now an Associate Professor at the Department of Communication Technologies, Universidad Car- los III de Madrid, Spain. His research interests are focused on new architectures and algorithms for nonlinear processing, as well as their application to multimedia processing, communications, data min- ing, knowledge management and teleeducation. He has (co)authored more than 20 international journal and conference papers in these areas. Fernando Pe´rez-Cruz (member IEEE) born in Sevilla 1973. He received his Telecommunication Engineering degree in 1996
- Advantages of Unbiased Support Vector Classifiers for Data Mining Applications 235 (Universidad de Sevilla, Spain) and PhD degree in Telecommunica- tion Engineering in 2000 (Universidad Politécnica de Madrid, Spain). He is an Associated Professor at the Department of Signal Theory and Communication (Universidad Carlos III de Madrid). He is currently at a sabbatical period at Gatsby Computational Neuroscience Unit at University College London. His current interest lies in machine learning algorithmic and theoretical developments and its applica- tion to signal processing and financial data. He has authored over forty contributions in international journal and conferences. fernandop@ieee.org; fernando@gatsby.ucl.ac.uk Antonio Arte´s-Rodrı´guez was born in Alhama de Almerı´a, Spain, in 1963. He received the Ingeniero de Telecomunicacio´n and Doc- tor Ingeniero de Telecomunicacio´n degrees, both from the Uni- versidad Polite´cnica de Madrid, Spain, in 1988 and 1992, respec- tively. He is now a Professor at the Departamento de Teorı´a de la Seo`al y Comunicaciones, Universidad Carlos III de Madrid, Spain. His research interests include detection, estimation, and statistical learning methods, and their application to signal processing and com- munication. He is Senior Member of the IEEE. Anı´bal R. Figueiras received the Telecomm Engineer degree from Universidad Polite´cnica de Madrid, Spain, in 1973 and the Doctor degree in 1976 from Universidad Polite´cnica de Barcelona. He is a Professor at Universidad Carlos III de Madrid, and, at the present mo- ment, Head of its Department Signal Theory and Communications. His research interests are digital signal processing, digital commu- nications, neural networks, and learning theory. In these subjects he has coauthored more than 200 international journal and confer- ence papers. Dr. Figueiras is a member of the Spain Academy of Engineering.