# Advantages of Unbiased Support Vector Classifiers for Data Mining Applications

• CategoryDocuments

• View212

Report
• 224 Navia-Va´zquez et al. extend these principles to the Support Vector Machines case. Instead of following the ERM principle, the Support Vector Classifier (SVC), as proposed in [1], minimizes a combination between the L2 norm of the weights and the L1 norm of margin slack variables. Although it has been recently demonstrated that this strategy min- imizes some generalization bounds, in Cristianini and Shawe-Taylor [10] it is recognized that “optimizing the norm of the margin slack vector does not necessarily mean minimizing the number of misclassifications”. To achieve an unbiased SVC (USVC) solutions truly solv- ing ERM it is necessary to use a cost function actually counting errors, while keeping the maximal margin cri- terion. Quadratic Programming (QP) can no longer be used to solve this new minimization problem, and we propose an alternative minimization mechanism based on a WLS formulation, eventually obtaining an Unbi- ased SVC (USVC). The paper is organized as follows: in the next sec- tion we derive the unbiased formulation for nonlinear machines (the linear case being directly obtained as a particular case), for both parametric and nonparametric machines. We then analyze and discuss about the con- vergence properties of the proposed algorithms. Finally we analyze the performance of the proposed algorithms on several biomedical datasets using limited complex- ity (semiparametric) machines and, finally, some con- clusions are presented. 2. Derivation of the Unbiased Algorithms Let us consider a set of training data pairs (xi , yi ), i = 1, . . . , �, corresponding to a binary classification prob- lem (xi ∈ Rn and yi = {+1, −1}), and a general clas- sifier used to estimate targets ŷi = sign(φT (xi )w + b), φ(·) being a mapping to a higher dimensional space F . SVCs follow the optimal hyperplane (OH) crite- rion, which maximizes the distance between the train- ing samples and the classification boundary (margin), thereby maximizing the generalization capability of the machine (linearly separable datasets). To extend the OH criterion to the non-separable case, Vapnik pro- poses to find the hyperplane that makes the smallest number of errors [1], by adding a term that takes errors into account, i.e., by minimizing L = 1 2 ‖w‖2 + C �∑ i=1 θ (ξi ) (1) with respect to w, b, ξi subject to yi (φT (xi )w + b) − 1 + ξi ≥ 0, ∀ i = 1, . . . , � (2) ξi ≥ 0, ∀ i = 1, . . . , � (3) where θ (·) represents a shifted step function (Heavi- side). This optimization problem is NP-complete, as noted in [1] and, to find a solution by means of a practical optimization algorithm, Vapnik proposes to minimize L = 1 2 ‖w‖2 + C �∑ i=1 ξi (4) leading in this case to the well-known QP formula- tion of the SVC problem [1] (QP-SVC from now on). This represents a rather coarse approximation to term θ (ξi ) in (1). To solve a SVC which combines maximal margin and minimum number of errors, we propose to minimize a modified functional ˜L = 1 2 ‖w‖2 + C �∑ i=1 ˜θs(ξi ) (5) where we use the following cost function ˜θs(ξi ) = {(( ξi + k1/s� )s)/(2(1 + k1/s� )s), 0 ≤ ξi < 1 1 − ((1 + k1/s� )s)/(2(ξi + k1/s� )s), ξi ≥ 1 (6) which is a smooth approximation to the step function, and with k� being a very small constant, necessary to ensure that the maximum margin solution is obtained when the training set is separable. We have represented in Fig. 1 the “error count”, QP-SVC and USVC ( ˜θs(ξi )) costs, the latter for different values of s. It can be seen how ˜θs(ξi ) asymptotically repre- sents an appropriate approximation to the step function shifted to ξi = 1, leading to a true error counting cost function as s tends to infinity (in fact, it will not be necessary that s takes a very large value to get good approximations to ERM). To minimize (5) we will follow parallel steps as those in the SVM formulation: we first introduce the linear restrictions (2) and (3), by means of Lagrange multipliers ˜L = 1 2 ‖w‖2 + C ∑ i ˜θs(ξi ) − ∑ i αi (yi (φT (xi )w + b) − 1 + ξi ) − ∑ i ηiξi (7)
• Advantages of Unbiased Support Vector Classifiers for Data Mining Applications 225 Figure 1. Comparison of QP-SVC, “error count” and USVC costs. and we obtain the Karush-Khun-Tucker (KKT) condi- tions that define the solution, which are, besides (2) and (3), the following: ∂ ˜L ∂ξi = C ∂ ˜θs(ξi ) ∂ξi − (αi + ηi ) = 0 ∀i = 1, . . . , � (8) ∂ ˜L ∂b = �∑ i=1 αi yi = 0 (9) ∂ ˜L ∂w = w − �∑ i=1 αi yiφ(xi ) = 0 (10) αi , ηi ≥ 0 (11) αi {yi (φT (xi )w + b) − 1 + ξi } = 0, ∀ i = 1, . . . , � (12) ηiξi = 0, ∀ i = 1, . . . , � (13) These conditions are identical to the ones in the Vapnik’s SVC, except the one in (8), which reflects the new cost function. To optimize (7) we first reorga- nize the summatories by grouping terms depending on ξi ˜L = 1 2 ‖w‖2 + ∑ i αi (1 − yi (φT (xi )w + b)) + ∑ i (C ˜θs(ξi ) − ξi (αi + ηi )) (14) We use the error definition ei = yi − (φT (xi )w + b) and, recalling that y2i = 1, we compute weights ai as ai = 2αi1 − yi (φT (xi )w + b) (15) to obtain ˜L = 1 2 ‖w‖2 + 1 2 ∑ i ai e 2 i + ∑ i (C ˜θs(ξi ) − ξi (αi + ηi )) (16) Functional (16) can be iteratively solved by first min- imizing it with respect to w, b and ξi and computing afterwards ai from the previously obtained solution. To minimize ˜L with respect to w, b and ξi we equal its partial derivatives to zero, obtaining1 ∂ ˜L ∂w = w − ΦDa[y − ΦT w − 1b] = 0 (17) ∂ ˜L ∂b = −aT [y − ΦT w − 1b] = 0 (18) ∂ ˜L ∂ξi = C ∂ ˜θs(ξi ) ∂ξi − (αi + ηi ) = 0 ∀ i = 1, . . . , � (19) where Φ = [φ(x1),φ(x2), . . .φ(x�)] a = [a1, a2 . . . , a�]T w = [w1, w2 . . . , wn]T (Da)i j = ai jδ[i − j] ∀ i, j = 1, . . . , � Equation (19) is the KKT condition in (8) and, there- fore, it will vanish once we attain the minimum. Equa- tions (17) and (18) define the solution for w and b and can be joined together to obtain the following system of equations, which has to be solved to obtain the USVC solution[ (ΦDaΦT + I) Φa aT ΦT aT 1 ][ w b ] = [ ΦDay aT y ] (20) Note that (20) is actually an implicit equation in both w, b and a, and can not be solved simultaneously for these parameters. However, if we fix ai values, we can obtain an explicit solution for w and b. Then, we can update ai values and solve (20) again. From KKT
• 226 Navia-Va´zquez et al. conditions in (8), (11)–(13) we find that the new αi can be calculated as αi =  0, ei yi < 0 C d ˜θs(ξi ) dξi ∣∣∣∣ ξi , ei yi ≥ 0 (21) and, from them, the values of ai can be computed as in (15), taking into account that ξi = 0 when ei yi < 0 and ξi = ei yi when ei yi ≥ 0. We can explicitly write ai as a function of alphai and ei yi ai = { 0, ei yi < 0 2αi/ei yi , ei yi ≥ 0 (22) Once ai has been calculated for every sample, the system in (20) can be solved again. This process must continue until there are no changes in neither ai nor w and b, i.e., the fixed point of the iterative procedure is found. This fixed point coincides with the desired USVC solution, since it simultaneously satisfies all the KKT conditions which are known to determine the so- lution of the minimization problem. The minimization problem to be solved by the unbiased algorithms is not convex and they may get trapped in local minima. In following sections we run initially the algorithm with QP-SVC and then we change the cost function and let the USVC algorithm progress until a better minimum is found: experience tells us that this solution is, in terms of Classification Error (CE), better or equal to the one found with SVC, but never (statistically significant) worst. In the following section we will illustrate this behaviour using a simple problem, and discuss about convergence of the methods. 2.1. Parametric Kernelization of the Unbiased Algorithm The above algorithm is apparently limited to those ap- plications where the nonlinear mapping φ(·) is known. If we only know its Reproducing Kernel Hilbert Space (RKHS), K (·, ·) = φT (·)φ(·), the system in (20) can not be solved in its present form, but it can be modified to be solved using RKHS, as was done in [24] with the SVC. In this paper we will propose a (semipara- metric) model as the unbiased solution (named USVC from now on). This will allow us to work with RKHS while controlling the complexity of the resulting clas- sifier with respect to the nonparametric solution where the final machine is built using as many RKHS as SVs are found (recall that the number of SVs can be very large). Therefore, we force the solution w to be w = R∑ i βiφ(ci ) = Ψβ (23) where ci are predetermined kernel prototypes which can be obtained by either nonlinear PCA [12] or clus- tering techniques such as FSCL [13], as shown in [14], and Ψ = [ψ(c1),ψ(c2), . . .ψ(cR)] (24) If we replace the definition of w in (23) into (20), multiply it by [ ΨT 00T 1 ], and reorder terms, we attain the following system of linear equations[ (ΨT ΦDaΦT Ψ + ΨT Ψ) ΨT Φa aT ΦT Ψ aT 1 ][ β b ] = [ ΨT ΦDay aT y ] (25) where every element of kernel matrices K = ΦT Ψ and I = ΨT Ψ is a RKHS, (K)i j = φT (ci )φ(x j ) = K (ci , x j ) ∀ i = 1, . . . , R ∀ j = 1, . . . , � (I )i j = φT (ci )φ(c j ) = K (ci , c j ) ∀ i, j = 1, . . . , R such that the solution can now be obtained in terms of RKHS. Note that now β is of finite dimension R, and the system in (25) can be solved using any LS method, for instance, the Moore-Penrose pseudoinverse2 β = (KT DaK + IΨ)−1KT Day (26) Note that when we introduce (23) into a model yˆ = sign(φT (x)w) we obtain the following compact architecture for the classifier yˆ = sign([φT (x)Ψ]β) = sign ( R∑ i=1 βi K (ci , x) ) (27) The USVC algorithm can be summarized as follows: 0.- Initialize ai = 1 1.- Determine an appropriate set of kernel prototypes ci
• Advantages of Unbiased Support Vector Classifiers for Data Mining Applications 227 2.- Compute kernel matrices K and IΨ 3.- Solve the WLS problem using (26) 4.- Obtain new weighting values ai using (22) 5.- Go back to 3 and repeat until the fixed point is attained. In what follows, to avoid confussion about the par- ticular algorithm is being used, we will refer as USVC to the unbiased algorithm for parametric machines, and ERM-SVC its counterpart for nonparametric SVM-like solutions [15]. 3. On USVC and ERM-SVC Convergence The convergence of the WLS procedure for solving the SVC is shown in [16] for the particular choice of the same loss-function as the QP-SVC. This demonstration can be directly applied to any other loss-function as the one proposed for the USVC. We will just highlight the key point of the demonstration without entering in deeper detail. First of all, in each iteration of the WLS procedure the resolution of (25) (3rd step of the WLS algorithm) guarantees that the functional in (5) is minimized. Sec- ond, the computation of ai in the fourth step of the WLS algorithm is done by fulfilling all the KKT conditions except (9) and (10). When the values of β and b are not modified in any given iteration, equation (25) can be simplified to (9) and (10), such that, at this point, all KKT conditions will be held and we will be at the USVC solution. To illustrate the convergence properties of both WLS-SVC and USVC algorithms we have defined a synthetic problem, such that a different data set can be generated in every trial following the same data distribution and, therefore, statistical relevance can be extracted from the experiments. We consider a bi- nary classification problem built with a combination of Gaussian and uniform distributions, and we define two different scenarios, cases A and B. In case A we have f (x | y = 1) = 1 4 N (0.75, 0.15) + 3 4 N (0.2, 0.02); P(y = 1) = 2 3 (28) f (x | y = −1) = U (0, 0.5); P(y = −1) = 1 3 and in case B f (x | y = 1) = 1 1.4 U (0.7, 0.95) + 0.4 1.4 N (0.25, 0.02); P(y = 1) = 0.7 (29) f (x | y = −1) = U (0, 0.5); P(y = −1) = 0.3 where N (µ, σ ) represents a Gaussian probability distri- bution function (pdf) with mean µ and standard devia- tion σ , and U (x1,x2) an uniform pdf in (x1,x2). These two scenarios are represented in Fig. 2. We select ‘a priori’ probabilities P(y = +1), P(y = −1) such that the classification error achievable us- ing a single decision threshold (Th) has a strong local minimum around x = 0.17 and a global minimum at x = 0.5 (CEmin = 20%) in case A, as shown in (c). In case B, we have a ‘narrow’ global minimum at x = 0.21 (with CEmin = 17%) and a ‘wide’ local min- imum around 0.6 (CE = 20%), as it can be observed in (d). Although very simple, these scenarios represent a fairly common situation in real world problems, where a portion of the patterns are immersed in the wrong class, and which, unless special care is taken, may in- duce a bias in the solution when either linear or nonlin- ear machines are used (specially when the maximum complexity of the machines is limited). We generate 500 training and 500 testing patterns, and we apply SVC (C = 10) and USVC (C = 10) al- gorithms with a linear machine, and with a “damped” weight update between iterations (wk = dwk−1 + (1 − d)w, d = 0.2 and w being the result of the pseudoin- verse in (26)). We carry out 100 independent trials of every ex- periment using a linear classifier and the resulting CE figures (indicating mean and standard deviation) and decision threshold (T h = −b/w) are shown in Table 1. In the average, CE obtained with USVC in cases A and B is around 20% (great reduction with respect to values attained using SVC). With respect to threshold values found in every case, it is worth mentioning that, in case A, USVC converges to values very close to T h = 0.5 (for both s = 2 and s = 4), therefore find- ing the minimum classification error solution. We have observed that the algorithm is not very sensitive with re- spect to the exact value of s, as long as it is large enough, therefore we will not further explore this aspect, and we will use s = 2 throughout the paper. Scenarios A and B being unidimensional and the linear classifica- tion machine represented by only two parameters, it is
• 228 Navia-Va´zquez et al. Figure 2. Data distribution for synthetic data scenarios A (a) and B (b) (where class “+1” distribution has been plotted with continuous line and that of class “−1” with a dashed one). The corresponding theoretical CE curves have also been plotted in (c) and (d), respectively. possible to graphically analyze the convergence of the algorithms, as represented in Fig. 3, where error sur- faces, CE functions and trajectories for a single trial are shown for both SVC and USVC. Error surface3 ˜L P (T h, w) and convergence trajec- tory for both SVC4 and USVC algorithms solving prob- lem A have been depicted in (a) and (c), respectively. It can be observed in (a) how, starting at point ‘o’ (Least Squares (LS) solution) the WLS-SVC algorithm con- verges to the QP-SVC solution (marked as ‘x’). Any other initialization of the algorithm can alternatively be used, because it has been shown in [1] that the asso- ciated error surface is convex with a single minimum. The cross-section represented by a dashed line in (a) represents the locus of the minima of ˜L P (T h, w) for fixed values of T h, for the SVC case. This profile is relevant, because it represents the CE function that the algorithm is actually minimizing. It has been plotted in (b) for the SVC case, where it can be observed that it represents a poor estimate of CE. The theoretical CE curve as a function of T h has been represented with a continuous line, and the empirical estimation of CE has been plotted with a dotted line. The QP- SVC minimum (around T h = 0.35, marked as ‘x’) is clearly biased with respect to the optimal solution (T h = 0.5). In (c) we can see the convergence of the USVC al- gorithm, also starting at the LS solution (the trajectory
• Advantages of Unbiased Support Vector Classifiers for Data Mining Applications 229 Table 1. Results for experiment 1, using a linear classifier trained with regularized LS, SVC and USVC (s = 2, 4) algorithms. Optimal LS USVC USVC values (reg.) SVC (s = 2) (s = 4) Case A CE (%) 20 30.5 26.8 19.9 21.1 (Train) (2.9) (2.6) (1.7) (2.0) CE (%) 20 30.7 27.1 20.4 21.8 (Test) (2.7) (2.4) (1.9) (2.6) Th 0.5 0.3 0.35 0.49 0.47 (0.02) (0.02) (0.06) (0.1) Case B CE (%) 17 31.8 24.8 19.7 20.1 (Train) (1.4) (2.6) (1.8) (5.2) CE (%) 17 31.7 25.5 19.8 19.7 (Test) (1.9) (2.4) (1.9) (5.4) Th 0.6 0.3 0.4 0.58 0.56 (0.02) (0.02) (0.01) (0.02) Mean and standard deviations (below, in brackets) of CE and thresh- old values are shown, as well as optimal (theoretical) ones. of WLS-SVC is kept as a reference). It can be observed how the error surface is no longer strictly convex (there is a local minimum around (0.15, 12)), but the abso- lute minimum is located at T h = 0.5, in coincidence with the optimal theoretical value. Cross-section rep- resented in (d) using a dashed line, provides a much better approximation to the theoretical CE curve than in the SVC case. The same convergency analysis has been carried out for scenario B, and the results can be observed in Fig. 4. In this case the QP-SVC solution is also biased, be- cause the minimum of ˜L P (T h, w) (marked as ‘x’ in (a) and (b)) is not placed at the optimal (maximal margin) value 0.6. The theoretical CE curve presents, in this case, a global minimum at T h = 0.2. Nevertheless, it can be observed in (c) how the error surface ˜L P (T h, w) that the USVC algorithm tries to minimize, presents a global minimum at T h = 0.6, marked as ‘∗’ in (d). This threshold value provides a ‘safer’ solution to the problem, in the sense that final performance is less sen- sitive to (possible) perturbations in the data. It can be observed in (c) how USVC, unlike SVC, converges to this solution. 3.1. Discussion About the Benefits of Unbiased SVCs We have shown that the SVC can be solved with a loss-function closer to the step-function than the one proposed in [1], which guarantees that the solution will asymptotically tend to the solution with minimum num- ber of errors (Bayes classifier). But, when working with a finite set, one may ask, how close is close enough or, alternatively, how much am I biasing the solution when using the SVC instead of USVC. The answers to these questions are not trivial and we will not make any at- tempt to simplify the problem to arrive to an analytical solution. Instead, we will give some heuristical rules that illustrate in which cases the USVC may improve the SVC solution. First of all, the number of training samples directly affects the benefits of using the USVC (asymptotical property), or more precisely the number of erroneously classified training samples, which are the samples that bias the SVC solution. Secondly, the larger the values of ξi , the best the USVC will be, because the samples with larger ξi are the samples that are worstly represented by the SVC L1 norm loss-function, i.e. the samples that bias most the SVC solution (see Fig. 1). As linear kernels present larger values of ξi than RBF kernels and the value of ξi increases with the width of the used RBF kernel (similar arguments can be carried out with any other kernel), one may find that the benefits of the USVC appears when the VC dimension is low.5 Basically the USVC works best when the number of training samples increases and the VC dimension of the classifier decreases. We must also point out that the USVC does not al- ways improve the SVC when working with finite data sets. This fact can be explained if one sees the loss- function selection as an extra “regularizing” parameter (although it is not actually one). The SVC has a true regularizing parameter in the norm of w. In regulariza- tion theory [18], it is shown that adding a non-negative convex functional to the empirical error helps control- ling the machine complexity (generalization ability). As the number of samples increases the empirical er- ror is more reliable and the weight of the regularizing functional is reduced to zero at a given rate, as the number of samples tends to infinity (see [18] for fur- ther details). The degree of similarity between the used loss-function and the step loss operates in a similar way. With few samples the use of Vapnik loss is pre- ferred over the USVC, because the later might provide a lower empirical error but with worse generalization. But as the number of samples increases the empirical error will tend to the true error and the use of Vapnik loss will bias the minimum error solution, because it is not truly an error counting term. As the number of
• 230 Navia-Va´zquez et al. Figure 3. Convergency analysis of SVC and USVC algorithms in case A. Trajectories over the error surface ˜L P (T h, w) are shown in (a) and (c), as well as CE functions being minimized in every case, in (b) and (d). samples increases we must therefore modify Vapnik loss-function to a shifted step loss-function that truly minimizes the empirical error. 4. Experiments 4.1. Comparison with Other Machine Learning Algorithms We use here a collection of datasets from the UCI Machine Learning repository,6 we normalize the input variables to have zero mean and unit standard deviation, we use 10 random splits of the data (80% for training and 20% for testing), and find optimal parameters for every algorithm and split using 5-fold cross-validation. In the experiments that follow, we concentrate on measuring Classification Error (CE) and machine size, to reveal the potential benefits of using a different cost function. We are not addressing here the computational cost problem, since the iterated WLS approach for solv- ing SVMs has already been benchmarked in [24], with favorable results with respect to state of the art SVM technology. Since the main modification of USVC is the different computation of weighting values, little dif- ference is obtained with respect to training cost. We
• Advantages of Unbiased Support Vector Classifiers for Data Mining Applications 231 Figure 4. Convergency analysis in case B for both SVC ((a) and (b)) and USVC with s = 2 ((c) and (d)) and USVC with s = 4 ((e) and (f)). Trajectories over the error surface ˜L P (T h, w) are shown in (a) and (c), as well as CE functions for every algorithm, in (b) and (d). compare QP-SVC (SVMlight, [20]), Adaboost7 [21] and the USVC and ERM-SVC algorithms, with both linear and nonlinear (RBF) machines. The results for the linear machine are detailed in Table 2. It can be observed how USVC is able to perform bet- ter than QP-SVC (SVMlight) in several cases (‘Image’, ‘Ringnorm’, and ‘Thyroid’), in the remaining datasets, it presents a performance similar to QP-SVC, and never is worst. Both SVC and USVC/ERM-SVC usually out- perform Least Squares (LS). The linear machine case is more than a simple “baseline experiment”, because very often input data is already in a high dimensional space, training patterns are scarce and, therefore, a lin- ear machine is simple to train and provides good results [22, 23]. We believe that this algorithm could be very useful in those data mining scenarios. To implement the kernelized USVC algorithm we applied firstly the Frequency Sensitive Competitive Learning (FSCL) algorithm [13] to find the R centroids needed in every case (parameter R has also been found using 5-fold cross-validation). The final performance of RBF kernel machines trained with USVC is shown in Table 3, as well as results with SVMlight, ERM-SVC and Adaboost. We observe, in this nonlinear case, how the USVC is also able to improve some of the QP-SVC results,
• 232 Navia-Va´zquez et al. Table 2. Results for the real world datasets, using a linear model trained with LS, SVMlight, and USVC/ERM-SVC algorithms. LS SVMlight USVC Diabetes 24.7 ± 2.6 24.2 ± 2.2 24.2 ± 3.3 Image 16.2 ± 1.2 15.8 ± 0.8 12.1 ± 1.1 Ringnorm 23.3 ± 1.1 23.4 ± 1.1 22.9 ± 1.1 Thyroid 17.7 ± 5 9.1 ± 4.5 7.6 ± 3.8 Twonorm 2.2 ± 0.3 2.2 ± 0.3 2.2 ± 0.2 Waveform 12.0 ± 0.7 11.5 ± 0.6 11.4 ± 0.5 CE is shown (mean ± std. dev.). Table 3. Results for the RBF classifiers trained with SVMlight, Adaboost and unbiased algorithms. SVMlight Adaboost USVC Diabetes CE(%) 24.5 ± 2.3 27.2 ± 3.7 22.4 ± 3.1 Mach. size 352 ± 12 3000 ± 0 40 Image CE(%) 3.7 ± 0.6 3.6 ± 0.9 3.7 ± 0.8 Mach. size 285 ± 14 17116 ± 1370 120 Ringnorm CE(%) 1.8 ± 0.4 1.9 ± 0.4 1.8 ± 0.6 Mach. size 615 ± 25 2400 ± 0 90 Thyroid CE(%) 3.2 ± 2.9 3.7 ± 3.2 2.6 ± 3.2 Mach. size 25 ± 4 1534 ± 143 40 Twonorm CE(%) 2.2 ± 0.9 2.9 ± 0.4 2.2 ± 0.8 Mach. size 565 ± 12 2400 ± 0 12 Waveform CE(%) 8.9 ± 0.8 9.1 ± 0.9 8.1 ± 1.1 Mach. size 1286 ± 25 2000 ± 0 30 CE is shown (mean ± std. dev.), as well as the machine size in every case. namely, cases ‘Diabetes’, ‘Thyroid’ and ‘Waveform’, obtaining similar (never worst) results in the remain- ing datasets. Furthermore, note how it usually outper- forms Adaboost, since the generalization criterion that Table 4. Results for the biomedical datasets, using a linear model trained with LS, SVMlight, and USVC/ERM-USVC algorithms. LS SVMlight ERM-SVC USVC Breast CE(%) 4.16 ± 2.1 4.17 ± 2.2 3.74 ± 2.9 3.73 ± 2.8 Dermat CE(%) 0.28 ± .7 0.28 ± 0.7 0.28 ± 0.79 0.28 ± 0.7 Haberman CE(%) 25.9 ± 8.7 27.3 ± 6.1 24.7 ± 6.28 24.3 ± 6.8 Postoperative CE(%) 29.5 ± 17.3 28.4 ± 18.5 27.3 ± 18.8 27.2 ± 18.46 Heart disease CE(%) 16.55 ± 3.3 16.8 ± 3.4 17.6 ± 5.6 15.5 ± 3.1 Hepatitis CE(%) 12.5 ± 7.4 15.1 ± 6.5 19.1 ± 6.2 12.5 ± 5.4 Liver disorder CE(%) 30.5 ± 4.1 31.1 ± 3.7 34.3 ± 10.8 30.2 ± 7.4 CE is shown (mean ± std. dev.). SVC theory imposes on the solution is really benefit- ing the USVC algorithm. Finally, note that, in some cases with Adaboost and SVMlight, we obtain very large machines (1200, 2400, or even 17000 RBF ker- nels in every classifier), which does not happen with USVC, where the semiparametric nature of the model imposed guarantees a certain (moderate) complexity of the solution (from 12 to 120 RBF kernels), obtain- ing thereby machine complexity reductions up to two orders of magnitude. 4.2. Unbiased SVCs for Biomedical Data Finally, we have chosen several data sets with medical or biological content that are characterized by present- ing very few examples, which makes even harder the use of USVC. At the same time we have restricted our- selves to using linear machines because, we wanted to facilitate the explainability of the achieved results. The results are shown in Table 4, where we have also in- cluded the ERM-SVC case (unbiased SVC procedure that yields a non-parametric solution and uses a sig- moidal loss-function [15]). The USVC is equal or best in most data sets because in this case the complexity of the machine is fixed and in spite of the small number of patterns in the training sets. In the ‘dermat’ case all algorithms attain the same solu- tion, since the training set is linearly separable and no error penalty terms are present. The ERM-SVC should lead to the same solution as the SVC but the chosen parameters for ‘Heart disease’, ‘Hepatitis’ and ‘Liver disorder’ were such that it obtained a lower empiri- cal (training) error but a higher testing error. This can be explained because the parameters of the ERM-SVC loss-function (sigmoidal type) lead to a much closer