The tangent kernel SVM for calibration-stable histogram discrimination
Information and Communication Theory Group
Faculty of Electrical Engineering, Mathematics and Computer Science
Mekelweg 4, 2628 CD Delft, The Netherlands
Keywords: Similarity; Invariant representation; Classification; Kernels; SVM Abstract
tween objects. More specifically one can use tan-gent distance [1, 2] (dissimilarity) or tangent kernel
In many practical situations the precise calibra-tion of a sensor is difficult or impractical. If the his-
In this paper we study how the prior knowledge
tograms of initial measurements are used as an input
about the noise in a sensor helps to build a better
of a recognition system, it leads to noisy affine trans-
classifier. We consider a histogram-like class of data
formations of their arguments. The straightforward
which consists of, but is not limited to histograms
way of histogram standardization is not suitable be-
themselves (biomedical data [5], color histograms of
cause of the presence of different histogram classes.
the conventional images or channel histograms of
We studied the tangent kernel approach to build a sim-
multi-spectral images, etcetera) and any other data
ilarity measure robust to imprecise calibration. This
represented by non-negative vectors with fixed sum
technique is tested on both artificial and real-world
of elements (say, equal to one), e.g. normalized spec-
data. Advantages and difficulties are discussed.
tra. It is assumed that the imprecisions in the calibra-tion are small and can be considered as global shiftand scaling, i.e. affine imprecisions. The problem
Introduction
is solved by developing an appropriate tangent kernelSVM classifier.
In solving pattern recognition problems one may
A short review on the tangent kernel technique is
consider at least two possibilities to improve the clas-
presented in the next section. Then, in section 3 we
sification performance: larger training sets and the
define the problem in question and possible ways of
use of a prior knowledge. The first way is the uni-
versal one and the second depends on the problem at
datasets. It is followed by the explanation and discus-
sion of numerical experiments (section 5) and conclu-
Prior knowledge helps to eliminate irrelevant vari-
ability of data and thus lightens the classification task. Examples of prior knowledge: class-conditional dis-tributions are Gaussian, hand-written letters are in-
Review on the tangent kernel tech-
variant to (not too large) rotations, some shape prop-
erties are invariant to affine transformations.
There are different ways of taking prior knowledge
This section is a short account of ideas from [3, 4]
into account. At first, one may build a relevant prob-
adapted to our goals. Suppose that we have a two-
ability density model. Another approach consists of
class classification problem: having an object x ∈
extracting features which are invariant or robust to
Rd, a label y ∈ {−1, +1} should be assigned defining
transformations which have nothing to do with class
to which one of two classes it belongs. In other words,
the task consists of building a classification function
This approach can be rethought of as a modi-
fication of a definition of a similarity measure be-
Let us assume that f (x) = sign(g(x)), where
g : Rd → R is the smooth discriminant function. Suppose also that we are given the prior knowledge
Rγ(g; X, y) = (1 − γ)R(g; X, y) + γr(g; X)
Note that r does not depend on object labels y. Thus,
the additional not labeled objects may be used for its
does not change the membership of the object. Thus it
The formulation given by Eq.(3) is theoretically
seems natural to demand that the classification func-
suitable for any learning criterion R and any smooth
function g. To make the problems tractable one needsto select some specific form of them. We selected the
To simplify the task one can demand that this propertydoes not hold only for f but also for g, i.e
which is stronger. Now, for the smooth function g it
is possible using properties of the transformation Lt
to reformulate this invariance property as
As a learning algorithm the SVM classifier [6] was
After rewriting the left side, it transforms into
selected. After substitution of conventional SVM cri-terion R = 1 w 2 and Eq.(4) into Eq. (3) the new
for all x. Note, that M is not necessarily a linear
The typical pattern recognition setup of finding a
discriminant function consists of solving the mini-
we recover the original form of the SVM regularizer
w 2. The discriminant function is also in-
where R is the learning criterion (e.g. inverse margin,
noise to signal ratio, etc.), X = (x1, . . . , xN )T is thetraining data set and y = (y
Thus, in order to implement the technique we need
bels of training objects. In order to incorporate prior
only to redefine the matrix of inner products (kernel
knowledge into this approach one should perform the
minimization using the additional restriction (2). On
the other hand, it is possible instead of requiring the
strict equality (2) to minimize the weighted sum of
So, this approach can be interpreted as a selection of a
the original criterion and the regularizer
new similarity measure that is robust to the unwanted
If there is overlap between classes, then a soft-
margin version of SVM algorithm should be used.
The modifications are straightforward and do not
change Eq. (5). For the sake of brevity we use LTK-
SVM (Linear Tangent Kernel SVM) abbreviation to
1Actually, set of equations (1) defines one-parametric Abelian
name the soft margin SVM algorithm with kernel de-
(commutative) group of transformations.
It is possible to derive a similar result for the non-
and only one class of histograms were present. Af-
linear discriminant function g which can be repre-
ter applying standardization, it may in general hap-
sented in a kernelized form (and thus for SVM with
pen that all histograms from the same class will have
non-linear kernels) [3]. In such situation one must
approximately the same scaling (if the ”biological”
typically deal with the subspace of an infinite dimen-
variability is not so large) but histograms from the
sional space. It means that kernel PCA [3] should be
different classes will be scaled differently. Two possi-
used for the calculation of matrix inversion.
ble ways to deal with these problems are 1) select thecommon mean and variance for all classes in such a
Robustness to linear imprecisions in
way that the inter-intra distance or some other separa-
sensor calibration
bility criterion is optimized 2) combine scaling with(the preliminary) classification, i.e decide to which
Suppose that the input of the recognition system is
class these histogram belongs before standardization
a normalized one-dimensional histogram x ∈ Rd
(what in fact implies a vicious circle). Another ap-proach [7] makes use of a dissimilarity representation
namically selected to guarantee the smallest distance
Here {η1, . . . , ηN } are samples of the random vari-
Extraction of invariant features
0, . . . , ed} are the edges of histogram bins
It is possible to eliminate calibration dependence
from histogram by mapping it to the feature space in-
variant to the unwanted changes in scaling and shift-ing (the above mentioned histogram standardization is
Let us assume that η can be expressed as
the particular case of such mapping). It can be donee.g. by the calculation of the mathematical expecta-
tion of a function of the normalized argument
where ξ is a measurement made on a perfectly cal-
ibrated device, θ is a set of parameters of ξ distri-
bution, and and a > 0 and b define possible affinecalibration imprecision. Parameters a, b and θ ran-
where µ and σ are the mean and the standard devia-
domly change from one histogram to another intro-
tion of the distribution defined by the probability den-
ducing calibration noise (a and b), ”biological” intra-
class and inter-class variability of histograms (θ).
The main problem of this method is how to choose
For the sake of simplicity, we assume that the num-
the appropriate functions pi. Moreover, all possible
sizes are selected in a such way that we can treat this
are invariant to the calibration. Improper selection of
the particular set of {u1, .} can lead us to a problemformulation in terms of features for which we have no
It means that there are undesirable shift and scaling
induced by a and b. Below we consider possible waysof dealing with this problem. Derivation of the tangent kernel Histogram standardization
In this article we apply the alternative approach
based on the tangent kernel technique. Unlike in the
Histograms with different (a, b) values can be put
histogram standardization we do not change the sup-
on the top of each other if we scale them in a such
port of the histogram and also we do not need explic-
way that means and variances of the distributions rep-
resented by them are equal to some predefined values
In our case there are two invariant transformations.
This approach would solve the problem if there wereno ”biological” variability between the histograms
the first peak will be decreased while the height of thesecond peak will be increased. Also, the division of
tetraploid cells produces the peak at the position four
times larger than the first one. These changes specifythe tetraploid class of hystograms. It is also possi-
In reality we deal with histograms not with densi-
ble that two or three cells may stick together. This
ties. Thus we should redefine the M operators
aggregation process is responsible for the changes inthe peak heights and the appearance of the peak at a
position three times larger than the first one.
Here, Z is the diagonal matrix which diagonal ele-ments contain the positions of histogram bin cenbters
The data set used was kindly provided by Dr.
and ∆ is the operator calculating the (smoothed) fi-
Marius Nap and Nick van Rodijnen from Pathology
nite differences of the histogram x as a function of
Department of ”De Wever” Hospital, Heerlen, The
the histogram bin centers. Thus, we have the com-
Netherlands. The data were obtained from multipa-
rameter flow cytometric (MP-FCM) DNA analysis ofbreast cancer samples that were collected consecu-
tively during routine diagnostic procedures from Jan-
uary 2000 until November 2004. The resection spec-
imens of the breast cancer samples were all routinely
fixed in formalin and embedded in paraffin for diag-
nostic workup. One paraffin block of representative
tumor tissue was selected for MP-FCM and used to
The parameter δ defines which one of the two trans-
create a single cell suspension according to a proce-
dure described in [9]. The DNA content of the sin-
Finally, the new similarity measure reads as
gle cell suspension was stained with propidium iodide(PI) and contamination with RNA was prevented by
k(x, y) = xT{(1 − γ)I + γ[(1 − δ)Ctr + δCsc]}−1y
pretreatment of the solution with RNase. The mea-surements were performed on a DAKO-galaxy flow
Data description
cytometer with 4 photomultiplier tubes (PMT’s), pro-vided by DAKO Cytomation. The data were stored in
Real-world data: flow cytometry
FCS2.0 format [10]. The resolution of the DNA his-tograms was scaled from 4096 values to 256 by apply-
In our study we used a real-world flow cytomet-
ing the unbiased summation method [5]. All the cor-
ric DNA content data set (RD). The flow cytometric
rections for signal crossover (from two other dyes),
DNA analysis is performed on cells stained with an
aggregation and debris in raw FCS2.0 files were man-
appropriate fluorochrome specific for DNA. The pop-
ually made by an experienced operator, using the soft-
ulation of such cells passes an excitation source and
ware package Flomax 2.2 (PARTEC). The labeling
the fluorescent light emitted by stained cells is regis-
was performed on the compensated data in Modfit LT
tered by a photomultiplier tube (PMT). The result of
2.0 (Verity Software House, Inc.) on the DNA his-
an experiment is a histogram of these signals recorded
tograms. All histograms were first analyzed in an au-
for a population of cells. Provided an appropriate de-
tomatic mode and manually re-analyzed if the opera-
vice calibration, this histogram can be interpreted as
tor considered the outcome of the automated model
a distribution of the DNA content over the cells.
The typical DNA content histograms are shown on
intention to measure 20 000 counts per FCS2.0 file,
Figure 1. One can observe a number of peaks. In
this was not always possible due to variations in the
healthy tissues or when a disease does not change the
amount of single cells obtained from each tumor sam-
distribution of the DNA content the first peak cor-
responds to diploid cells (normal state) and the sec-ond peak is produced by cells with a doubled DNAamount because of the division process (tetraploidcells). The position of the second peak is two times
In this paper we used the output of the third PMT.
larger than the position of the first one. This describes
The first and the last histogram bins which contain
the diploid type of histogram. It may happen that dur-
only debris and noise were set to zero. No outlier de-
ing division the DNA content of the cell is doubled
tection was applied. For experiments 446 histograms
but cell division does not occur. Thus, some part of
were used: 205 from the diploid class and 241 from
the cells stays in the tetraploid state and the height of
The weights of mixture components πk stay the
same for all histograms of the same class and actually
define the difference between classes.
Each of the four components represents one of the
peaks of the real-world flow cytometric data. We tried
to set parameters in order to imitate both diploid (class
A) and tetraploid (class B) histograms: only the first
two components are present in class A histograms and
the second and forth components are given more im-
portance in class B histograms. To simplify the prob-
lem we completely omitted the third peak for both
In our experiments we used two types of synthetic
datasets: SD0 a synthetic dataset having a perfect cal-ibration and SDσ - a dataset with σt = σb = σ. All
Figure 1: Flow cytometric real-world data, selected
the synthetic datasets consist of a large test dataset
(5000 examples of each class) and 50 training datasets(1000 examples of each class). All the parametersused for the generating of synthetic datasets are sum-
Synthetic data: mixture of lognor-
The real-world data may contain many sources of
variability such as possible mislabelings and devicefaults. Also, heavy-tailed histograms may be clipped.
Because of this we test proposed technique on the
synthetic data set SD which mimics basic properties
of the real world data described above. Objects of this
data set are normalized histograms (see Eq. (6)) of a
random variable η which is a linear mixture of K log-
normally distributed random variables {ζ1, . . . , ζK}with shift and scaling imprecisions. Namely, the ran-
Figure 2: Selected samples from the synthetic data
set without calibration imprecision (SD0). The differ-ence between histograms of the same class is caused
The number of samples Ns per histogram and calibra-
only by the intrinsic intra-class variation.
tion imprecision parameters t and b were randomlydrawn for each histogram from the distributions
Experiments
We studied the difference in the performance of
the conventional linear SVM classifier and its tan-
gent kernel version. To see how useful the usage of
prior knowledge can be for different sizes of training
set the learning curves were computed. For synthetic
To introduce the ”biological” variation we added
datasets we used one large test set (5000 objects per
Gaussian noise to the component parameters (µ∗, σ∗)
class) and 50 independently generated training sets.
The results for the real-world dataset were obtained
by averaging over 200 hold-out experiments: for each
experiment we randomly took out 10% of all objects,
the rest 90% objects were used as the training set.
The learning curves were obtained for each indepen-
dent training set (for synthetic data) or hold-out ex-
Table 1: Synthetic datasets parameters. For lognormal distributions (sample size Ns and components ζk) we havespecified not means and standard deviations of the underlying normal distributions (µ∗, σ∗) but their own (µ, σ)which can be expressed as µ = exp(µ∗ + 0.5σ∗2), σ = exp(2µ∗ + 2σ∗2) − exp(2µ∗ + σ∗2).
nel SVM were treated as follows: δ parameter was al-
ways set to 0.5; in one part of experiments classifier
was calculated at constant γ values, in the rest of ex-
periments γ was optimized along with the ν parame-
ter by the same procedure. For the derivative smooth-
ing we used Savitsky-Golay filter of the first order.
The size of the filter window w was set to 5 in the ex-
periments with constant γ values and was optimized
by grid search together with ν and γ in the other ex-
Figure 4 shows the learning curves of the conven-
tional linear SVM and LTK-SVM. It can be seen that
with increasing imprecisions in calibration the classi-
fication performance of conventional SVM becomesworse. The applying of modified similarity measure
Figure 3: Selected samples from the synthetic data
alleviate this effect. It is interesting that the usage
set with calibration imprecisions included (SD0.05).
of LTK-SVM for slightly deteriorated data SD0.01
The intra-class variation is larger than the one of the
resulted in the smaller errors than the application of
conventional SVM to the perfectly calibrated dataSD0. It can be probably explained by the fact thatthe intra-class ”biological” variations are component
periment (for real-world data) by training classifiers
specific shifts and scalings. It means that our tech-
on the sequence of the nested training sets generated
nique is not only able to soften the effect of the im-
precise calibration but also to reduce specific types of
In all experiments the ν regularization parameter
intra-class variations. This topic deserves additional
[11] of SVM/LTK-SVM classifier was optimized by
grid search over the set of predefined values. For each
The influence of the selection of parameters γ and
candidate value the preliminary classifier was trained
w of the modified similarity measure on classification
on 75% data selected for the training procedure. The
error was studied. For the data set with small im-
other 25% were used to measure the performance and
precisions SD0.01 (see Fig. 5) the best results were
select the actual ν value. Using this value the final
achieved at γ = 0.9. At small sizes of training set the
classifier was trained on the whole current training
learning curve with γ = 0.99 has the same behaviour
but its performance becomes worse at large training
The additional parameters of the linear tangent ker-
data sets. Thus we observe an over-regularization. On
Figure 4: Comparison of the learning curves of SVM
and LTK-SVM on synthetic data sets. In all LTK-SVM algorithms parameters were set to γ = 0.9, w =5.
amount of imprecisions. The probable reason for thisbehaviour is that the linear kernel was used, althoughcalibration transformations are not linear. Thus the
the other hand, the learning curve with optimized γ
use of an appropriate nonlinear kernel may improve
becomes only at large training data sets significantly
better than the one of conventional SVM. This re-sult suggests that the procedure used for the similarity
measure optimization heavily depends on the sample
size. Although the method itself is able to compensate
calibration imprecisions even at the beginning of thelearning curve, the automated procedure of parameter
selection requires further improvement. One of thepossible approaches to try is kernel-target alignment
[12]. For data with larger imprecisions SD0.05 thesame behaviour can be observed (see Fig. 5). How-
ever, the effect of the over-regularization at γ = 0.99
Figure 7: Classification error at fixed training set size(45 objects per class) as a function of amount of cali-
The results of experiments on the real-world flow
cytometric data set (see Fig. 8) are in agreement withthe ones on synthetic data. The value of the parameter
γ of the best learning curve (0.99) suggests that thecalibration imprecisions in the data are rather high.
However, for this data set we have only a limited num-
ber of samples. Because of that it is impossible to as-sert (in contrast to synthetic data sets for which very
large test sets are available) that γ = 0.99 is opti-mal not only for this data but also for future measure-
Fig. 7 shows how the amount of imprecisions in-
fluences the the performance of SVM and LTK-SVMclassifiers. In ideal case we would expect that the per-
Conclusions
formance of the LTK-SVM does not change with theincrease of calibration imprecisions. But in our ex-
We studied the possibility of the usage of the tan-
periments the classification error increases with the
gent kernel approach to stabilize the recognition of
[3] B. Sch¨olkopf. Support Vector Learning. PhD
[4] B. Sch¨olkopf, P. Y. Simard, A. J. Smola, and
V. N. Vapnik. Prior knowledge in support vec-
tor kernels. In M. I. Jordan, M. J. Kearns, andS. A. Solla, editors, Advances in Neural Infor-mation Processing Systems 10, pages 640–646,Cambridge, MA, 1998. MIT Press.
C. D. Bray, and M. Langweiler. Effects of res-
olution reduction on data analysis. Cytometry,53A(2):103–111, 2003.
Figure 8: Learning curves for the real-world flow cy-tometric data set
[6] V. Vapnik. The Nature of Statistical LearningTheory. Springer, New York, USA, 2000.
histogram-like data liable to affine calibration impre-
[7] E. Pekalska and Lecis P. Duin, R. P. W. Dis-
cision. It has been demonstrated on the synthetic data
similarity bases breas cancer classification. in
set that the application of the tangent kernel approach
to deteriorated data allows to reach a performancewhich is close to the one of conventional SVM algo-
[8] E. Pekalska and R. P. W. Duin. Dissimilarity
rithm on perfectly calibrated data. The experiments
representations allow for building good classi-
on the real-world data show that the tangent kernel ap-
fiers. Pattern Recognition Letters, 23(8):943–
proach helps to decrease the classification error. The
tangent kernel approach has significant advantages at
[9] M.P.G. Leers, B. Schutte, P.H.M.H. Theunissen,
smaller sample sizes. On the other hand, the auto-
F.C.S Ramaekers, and M. Nap. Heat pretreat-
matic procedure of selecting the trade-off parameter
ment increases resolution in dna flow cytometry
γ becomes inefficient with a decreasing size of the
of paraffin-embedded tumor tissue. Cytometry,
training dataset. It is also expected that the use of
non-linear kernels can improve the performance fur-ther. The studied approach seems to be a promising
[10] Data file standards committe of the Society for
creases resolution in dna flow cytometry of
Acknowledgments
Authors would like to thank Dr. Mauris Nap and
Nick van Rodijnen for the possibility to use the flow
[11] B. Sch¨olkopf, A. Smola, R.C. Williamson, and
cytometric data and for their contribution to section
P.L. Bartlett. New support vector algorithms.
4. We also thank E. Pekalska for fruitful discussions. Neural Computation, 12:1207–1245, 2000.
This research was supported by the Technology Foun-
[12] N. Cristianini, A. Elisseeff, J. Shawe-Taylor,
dation STW, applied science division of NWO and the
and J. Kandola. On kernel-target alignment. In
technology program of the Ministry of Economic Af-
T. G. Dietterich, S. Becker, and Z. Ghahramani,
editors, Advances in Neural Information Pro-cessing Systems 14, Cambridge, MA, 2002. MIT
References
[1] P. Simard, Y. Le Cun, and Denker J. Efficient
pattern recognition using a new transformationdistance. In S. Hanson, J. Cowan, and Gile L.,editors, Advances in Neural Information Pro-cessing Systems 5, pages 50–58, San Mateo CA,1993. Morgan Kaufmann.
[2] P. Simard, Y. Le Cun, J. S. Denker, and B. Vic-
Transformation Invariance in PatternRecognition - Tangent Distance and TangentPropagation, volume 11, pages 181–197. 2000.
ASSESSORATO DELLA PROGRAMMAZIONE, BILANCIO, CREDITO E ASSETTO DEL TERRITORIO (Allegato A Dt. N° 4318/175 del 13.05.2009) Elenco delle Domande di Agevolazione POSITIVE P.O.R. Sardegna 2000-2006 - P.O. 2007-2013 Bando di gara per gli interventi di sostegno pubblico alle imprese in attuazione delle Direttive PIA Pacchetti Integrati di Agevolazioni " Industria, Artigianato e Servizi "
Fundação de Apoio à Tecnologia e Ciência FUNDAÇÃO DE APOIO À CIÊNCIA E TECNOLOGIA EDITAL DE CONCORRÊNCIA N° 03/2012 (SRP) A Fundação de Apoio à Ciência e Tecnologia - FATEC, por meio de sua Comissão de Licitações, torna público para conhecimento dos interessados, que realizará Licitação na Modalidade Concorrência do Tipo Menor Preço Unitário, REGISTRO DE P