Mod-06 Lec-17 Fisher Linear Discriminant
Hello and welcome to this next talk in the
course and pattern recognition. We have been considering linear classifiers for last so
many lectures. Specifically, in the last few lectures we have been looking at the linear
least squares approach both for classification and regression. Least square approach minimizes
some of square of errors and we seen how we can use that to learn linear models and will
continue with that now. So, the least square method is based on the
criterion of minimizing mean squared error. And we see how we can derive linear least
squares algorithm for classifiers for learning regression functions. And we also looked at
logistic regression of learning classifiers. There is one other approach to learn linear
classifier which we very briefly mentioned at the end of the last class and that is the
fisher linear discriminant while, least squares is also a good way of learning linear models
and is a very very standard way. Fish linear discriminant actually somewhat predates the
least squares approach, it is a very interesting approach to learning a linear classifiers.
And today we will be looking at fisher linear discriminant as a method for constructing
linear classifiers. Any linear discriminant function base classifiers
essentially, given a feature vector X; it decides that X belongs to class 1 if, W transpose
X plus w naught greater than 0 for some given W and w naught. If W transpose X plus w naught
is greater than 0 is one class less than 0 another class. that is a That is what a linear
classifier linear discriminant function bayes classifiers is about. Now, because W is a
vector, we can take it as a direction in the r d, the appropriate space. Then, W transpose
X is nothing but the projected value of X onto the direction W. So, if I think of this
inequality W transpose greater than minus w naught and if it is a good classifier. What
it means is, if I take any X that belongs to class 1 and projected it onto W, the projected
value, W transpose X is greater than minus w naught. And on the other hand, if I take
any X in class C 0 and project it on to the direction W then, the projected value W transpose
X less then minus w naught. So, essentially what a linear discriminant
function is doing is looking for a direction W. Such that, if I project the features vectors
on to the direction then, along that direction the 2 class are well separated. On that direction
there will be some threshold in this case minus w naught, some point along that direction.
Such that, the projected versions of all features vectors are 1 class will be on one side of
this threshold whereas, the projected versions are all the features vectors are the other
class will be on the other side of this threshold along the direction W. so, we can think of
the best W for a linear classifier to be a direction along which the 2 classes are separated.
So, we are looking for a particular direction feature space. Such that, if you project all
the vectors onto the directions, all the feature vectors onto direction, I get a very good
separation between the feature vectors of the 2 classes. So, the idea is that we project the data along
the direction W. And separation between points of different classes in the projected data
is a good way to rate how good W is. So, now to learn a good W I can take different W’s,
project the data onto these W’s, and I am asking which W is there which will maximize
the separations between the classes. This is the basic idea fisher linear discriminant. Before we go into the mathematically details
let us look at a simple example. Let us say this is the 2-class problem: Reds are 1 class
and blues are another class. If I project all data for example, onto the X axis, the
reds and blues get inter mingled in X axis. Similarly, if I project it on to the y axis
also they get inter mingled. But, we can easily find a direction namely,
this line. If I project the data along this line, as you can see; if I recode my 2 dimensional
features vectors as 1 dimensional feature vectors by projecting them along this direction.
Then, all 1-class patterns are one side and all other class patterns are other side. The
actual separating hypo plane is perpendicular to this direction. So, If I think of W as
this direction; obviously, W is perpendicular is the normal to the separating hypo plane.
So, that will be separating hypo plane. So, I can think of finding the best W as finding
a direction along which you have to project the data so that all the feature vectors have
1 class in the projected space or well separated from the feature vectors of the other class.
So, fisher linear discriminant is a way of formalizing this notion of finding the best
direction to project so that the 2 classes get well separated. So, what does fisher linear discriminant do?
We want to find a direction W, such that the training data the 2 classes are well separated
if projected onto this direction. So, to do this off course we have to find a way of formalizing
this notion of what you mean by well separated. So, essentially we need to have a figure of
merit some some number that I can assign to W. So, I want some j of W which assigns a
number to a W to figure of merit. Such that, the number tells me how well projection on
that W will result in proper separation. So, will do that, we will formalize this by
first considering a 2-class case. So, let us get a notion right first as we been looking
at our data is always X i, y I, this is our training data, we have n samples. X i are
the feature vectors there in r d and y i is the class label. We are considering 2-class.
So, let us assume y i 0 and 1. And this, 2 classes we will denote by C 0 and C 1.
So, what it means is when y i is 0 then, I can say X i belongs to C 0 and when y i is
1, I can say X i belongs to C 1.Let us say, out of the n samples, n 0 training samples
are in class 0 that is C 0 and n 1 training samples from class C 1. Obviously, n is equals
to n 0 plus n 1. For given any W, let z i denote W transpose X i. So, z i are the projected
data along any direction W. So, z i are the one dimensional data that we get after projection.
This is our notation. Let us say M 0 and M 1 denotes the means of
the data from 2 classes. So, that essentially sample means so, M 0 is simply the mean of
all X i that are in class 0 and there in M 0. So, 1 by M 0 summation X i, X i belonging
to C 0 will give me the sample mean of the class 0 data vectors. Similarly, 1 by M 1
summation X i belongs to C 1 gives me the sample mean of class 1.
So, the corresponding means of the projected data, would be what? If I project X i onto
W, I get W transpose X i. So, if I take mean of W transpose of X i will be same as W transpose
M 0. So, on the projected data let us denote the mean by small m 0. So, the mean small
m 0 will be W transpose capital M 0 and small M 1 will be W transpose capital M 1. So, these
are the means of the projected data when data is projected onto direction W. So, if I want good separation, what I mean
is the difference between M 0 and M 1 should be large. So, we can say M 0 minus M 1 or
M 1 minus M 0 is really does not matter gives us a good idea of the separation between the
samples of 2 classes when the data is projected on to the direction W. So, we can say a good
direction W is one which maximizes the separation. So, we may want W that maximizes M 0 minus
M 1 whole square because that gives me large separation. But, there are 2 caveats here;
just maximizing M 0 minus M 1 is not a good idea. Why? What is M 0 minus M 1? W transpose
capital M 0 minus W transposes capital M 1 whole square. So, is W transpose capital M
0 minus capital M 1 whole square. So, just by scaling W, I can increase this
difference but, that is meaningless. A scale factor, a positive scale factor makes no difference
to the linear classifier. So, first we have to find some function that is scale independent.
But, more importantly what is good separation; depends on how much data spread is there.
We are only looking at the distance between the means as we already seen we can consent
bayes classifiers normal class conditional densities. Essentially, the difference between
the mean relative to the variances is what tells us whether the classes are well separated
or not. So, to look at the distance between the means M 0 minus M 1 in relative to the
variances of the 2 classes. So, that is how we have to formulate our objective function
W. So, let us first calculate that is something
like variance. As define as 0 square to be W transpose X i minus m 0 whole square. W
transpose X i is the projected data that is z i. This is summation or X i belongs to C
0, for all X i in C 0 the mean of the projected data is m 0. So, W transpose X i minus m 0
whole square, sum dot X i belongs to C 0, is like the variance of the projected data
of class C 0. Say like because I dint put 1 by m 0, I could have put 1 by m 0 but, we
did not. So, this essentially just accept for a multiplicate, for a constant multiplicate
factor; these essentially variance of the projected data from class C 0.
Similarly, if I define s 1 square to be summation or X i belongs to C 1 W transpose X i minus
m 1 whole square because m 1 is the mean of the projected data of class 1. This is accept
for multiplicative factor if the variances is the projected data in the for class 1.
So, what we want? We want large separation between m 0 and m 1 relatively to the variances.
That is our objective. So, I can formulate this as a objective function for W as follows. So, we say we want to maximize J of W, with
J of W defined as m 1 minus m 0 whole square by s 0 square plus s 1 square. So, we want
large separation m 1 minus m 0 whole square, relative to the variances s 0 square plus
s 1 square. By the way even though this equation looks as if there is no W at the right hand
side, you know all m 0, m 1, s 0, s 1 everything depends on W; s 0, s 1 square dependent on
W. Right; s 0, s 1 square dependent on my m 0
depends on W. So, this is a function of W. So, we want to maximize the difference between
the means in the projected space relative to the variances. This off course is also
scale independent that is not really evident right now but, will see it by rewriting this
in a more convenient form. To rewrite this in a more convenient form; let us start with
a numerator. What is m 1 minus m 0 whole square? m 1 is
W transpose capital M 1 minus W transpose capital M 0 whole square. This I can write
as W transpose into M 1 minus M 0 the whole square. So, because W transpose M 1 minus
M 0 is a vector, I can always write it as W transpose M 1 minus M 0 into M 1 minus M
0 transpose W. So, m 1 minus m 0 whole square can be written as W transpose M 1 minus M
0, M 1 minus M 0 transpose W. M 1 is M 1 and M 0 are the means of the data, if data is
r d these are d vectors. Recall that our notation is all vectors are column vectors. So, m 1
minus m 0 is a d by 1 vector, M 1 minus M 0 transpose is a 1 by d vector. So, this is
a matrix. So, this often called the outer product matrix for any 2 vectors. So, I can write; m 1 minus m 0 whole square
as W transpose some matrix into W. So, I can write m 1 minus m 0 whole square as W transpose,
thus called the matrix S subscript B. So, W transpose S B W. Whereas, B is M 1 minus
M 0, M 1 minus M 0 transpose. As I said this is a d by 1 vector, this is a 1 by d vector
so, this is a d by d matrix. Given any vector x x x transpose is called a outer product.
An outer product is always a symmetric matrix. Also, I hope all of you remember that the
outer product is a rank 1 matrix because all columns are multiple by the columns. So, is
a rank 1 matrix. So, such a matrix is d, this is a d by d matrix
because our features vectors are on d. This matrix S B is often called between classes
scatter matrix. So, I can write m 1 minus m 0 whole square as a quadratic form involving
the matrix S B which is the between class scatter matrix which is defined by this. In
the similar way, I can write s 0 square and s 1 square also as quadratic forms.
So, let us try to do that now. s 0 square
by definition is W transpose X i minus small m 0 whole square. We know small m 0 is W transpose
capital M 0. So, s 0 square is W transpose X i minus M 0 whole square summed over X i
will learn to C 0. So, I can take or I can rewrite this as: W transpose into X i minus
M 0 whole square. Now, once again just like what we did for earlier, I can write what
is inside this summation in terms of outer products. So, I can write this as W transpose
X i minus M 0 into X i minus M 0 transpose W. Now, the summation is over X I. So, W can
come out of this summation. So, if I pull W out of this summation, I get W transpose
summation X i belongs to C 0, X i minus M 0, X i minus M 0 transpose W.
See, M 0 is the mean of all X i in class 0. So, if I did do this and divided by 1 by M
0; if you remember this is the maximum likely hood estimate for the co variance matrix.
This is the m l estimate for the coefficient matrix of class 0. So, I can write s 0 square
as W transpose X i minus into C 0, X i minus M 0, X i minus M 0 transpose W. This matrix
off course is not at the coefficient matrix, if the coefficient matrix X i accepts for
a multiplicative factor because I need 1 by M 0 to make it an estimate of the coefficient
matrix. If I do the same thing for s 1, the only difference will be the summation will
be C 1 and M 0 will become M 1. So, similarly s 1 square will become W transpose
the same matrix but, summation C 1 X i minus M 1 X i minus M 1 transpose W. Which now means,
I can write s 0 square plus s 1 square as W transpose S w W. Where, S w is another symmetric
matrix given by this sum of this 2 matrixes X belongs to C 0, X i minus M 0, X i minus
M 0 transpose plus X belongs to C 1, X i minus M 1, X i minus M 1 transpose. So, this matrix,
this S w is also the d by d matrix is called within class scatter matrix. Essentially,
the first term here is propositional to the coefficient matrix of the first class of class
C 0 and second term is propositional to the coefficient matrix of the class C 1 by matrix
we mean the sample mean estimate of the coefficient matrix. So, S w is called the within class
scatter matrix. So, we can write s 0 square plus s 1 square as a quadratic form on S w,
M 1 minus M 0 whole square at the quadratic form on S B. So, now J of W becomes W transpose S B W by
W transpose S w W is a ratio of 2 quadratic forms. So, first thing to note so, basically
this is the same J as earlier. So, we just read it on a originally J W is M 1 minus M
0 whole square by s 0 square by s 1 square because both the numerator and denominator
can be written as quadratic forms on some matrixes we written them like that. So, this
is the J that we want to maximize. Now, is very clear that is not effect by scaling.
If W replaced by k W, the k will be cancelled from both numerator and denominator.
So, this W is not affected by, this J W is not effect by scaling of W as it should be.
And it is easy enough to calculate, given the data I know how to calculate S B, I know
how to calculate S w. So, I can calculate both the matrixes and hence given any W, I
can calculate J W. And narration, this is very nice optimization problem. Essentially,
you want given 2 matrixes as S w, you want to find a vector capital W which maximizes
the ratio of quadratic forms. Maximizes ratio of quadratic forms say every standard optimization
problem and we can solve it generally easily enough. So, let us see how to do that. So, this is
what we want to maximize. So, if you want to maximize, we differentiate or find the
gradient and equate to 0. So, if I differentiate with respect to W and equate to 0, what will
I get? I can first do, 1 by W transpose S w W into derivate of the numerator that will
give me 2 times S B W plus the numerator W transpose S B W into derivative of 1 by W
transpose S w W, I can write it as minus 1 by W transpose S w W whole square into the
derivate of the S w W which is 2 S w W. So, just by differentiating this and equating
0, I got this. In this, this is a quadratic form, this is denominator so, it is a scalar;
this a quadratic form, that is a as a scalar; this is a quadratic form; that is a scalar.
So, I have 1 vector here S B W, another vector here S w W. So, it is some constant to the
vector S B W minus some constant vector S w W is equals to 0 that means, this vector
and this vector are in the same direction because constant times 1 is the other. So,
what this implies is that, the vector S B W is in the same direction as respective.
So, the W that maximizes J is such that the matrix S B times W is a vector in same direction
of the matrix S w times W. Thus, any maximize of J has to satisfy S w
W is some constant times S B W. Where, lambda is the constant. This is reminiscent of the
Eigen value problem. In Eigen value problem you get a X is equals to lambda X. So, some
matrix into a vector, is lambda times the same vector. Here, I am not gaining the same
vector but, some other matrix multiplies W. But, this is also very similar to the Eigen
value problem. So, this is known as the generalized Eigen value problems.
The standard ways of solving the analyzed Eigen value problem, there are methods based
on LU decomposition so on. So, once we know is a analyzed Eigen value problem; for example,
your mat lab will have a very standard protein for solving this. But, anyway we will not
get into the details of special methods to solve general Eigen value problems because
often we can solve it easily but, right now we know that because the generalized Eigen
value problem. By solving generalized Eigen value problem I can always find the best direction
W. But as I say often, I may not have to solve
the generalized Eigen value problem. This is because the real symmetric matrix S w is
often invertible. And, why is that so? S w is this matrix, X i minus M 0, X i minus M
0 transpose, sum door X i belongs to C 0; X i minus M 1, X i minus M 1 transpose some
door X i belongs C 1. There are atleast 2 reasons why we can suspect that this matrix
is invertible. One is, we know for each i, X i minus M 0 into X i minus M 0 transpose
in outer product is a rank one matrix. So, this adds lot of rank one matrix. When you
add rank one matrixes suppose, I add x 2 minus M 0 into x 2 minus M 0 transpose to x 1 minus
M 0 into x 1 minus M 0 transpose. If x 1 and x 2 are linearly independent then, this becomes
a rank 2 matrix. So, if I keep adding many rank one matrixes
because the rank one at most go by 1 but, if we should the sufficiently many. So, the
first reason is that this is a sum of a large number of rank one matrixes. If the number
is large and X i are in general position, it is unlikely that all the X i will be in
some lower dimensional subspace. Since, if all the X i are not in lower dimensional subspace
then, adding all these rank one matrixes should give me a full rank matrix. Normally, the
number of samples is much larger than the dimension d and hence, adding that many rank
one matrix should give me a full rank matrix. Another reason, why we think why often this
as a invertible is that as you seen, each term in this is a good estimate especially
the data is large or prepositional to a good estimate of the covariance matrix X i and
the covariance matrix will be invertible. So, by in both ways we accept S w to be invertible.
Suppose, S w is invertible then, it is very much easier to solve for d. Let us say, S w is invertible. So, we have
the W has to satisfy S w into W is equals to lambda S B W. So, if this matrix is invertible,
I can multiply by S w inverse. So, I get an equation for W or atleast relation W is equals
to something. So, let us look at that relation. Suppose, S w is invertible then, W can be
written as S w inverse S B times W. There is There has to be lambda here, I omitted
the lambda because ultimately constant factor is do not make much difference but, any way
there is a lambda here. Let us not worry about that right now. This looks like off course
an Eigen value problem. Putting a lambda here, it is the Eigen value of S w inverse S B.
But, we do not even have to look at the Eigen values because S B W, what is S B? M 1 minus
M 0, M 1 minus M 0 transpose into W. We know M 1 minus M 0 transpose W is nothing but little
M 1 minus little M 0 which is scalar. So, this is some scalar k times M 1 minus M 0
as I wrote here, k is equals to M 1 minus little M 1 minus M 0.
So, S B W is a vector is propositional to M 1 minus M 0. So, I can replace S B W by
some constant times M 1 minus M 0. So, which gives me W is S w inverse M 1 minus M 0. There
are so many constants; there is a lambda here, there is a k here. So, ultimately there will
be some constant times S w inverse M 1 minus M 0. But, as I said constant do not make difference
in a linear classifier. Right. We can accordingly k W on b to anything that we want.
So, essentially we can find the W x surface scale factor to S w inverse M 1 minus M 0
and I do not want to worry about the scale factor, I can take this to be W because I
am in a linear classifier context. So, this is the W that I can use as the fisher linear
discriminant, if the matrix S w transpose to be invertible which as we just now said
very often that would be the case. So, let us sum-up how we obtain fisher linear
discriminant. With given the data, given the data, we first form the scatter matrix S w
and also calculate the mean M 0 and M 1. Once the within class can I matrix S w is invertible,
I can directly find W. As S w is invertible, I can calculate W by S w inverse M 1 minus
M 0, that is the fisher linear discriminant direction. Or it is normal to the hypo plane
classifier given by the fisher linear discriminant. Just for completeness let us remember that,
even if S w is not invertible there are techniques to find maximizes as a W by solving some generalized
Eigen value problem. There are some techniques, even the generalized Eigen value problem.
If S w is not invertible, there are some issues. I will just show that, so that you will appreciate
that there are some issues that means just go back to.
If S w is rank deficient, what it would not mean is there can be some W at which W transpose
W, W you can go to 0. Right. Thus, there would be some W for which this will be infinite
but, thus not the W we want. So, we have to define this little more carefully. But, it
does not matter there are techniques to take care of it. So, we will just we would not
go there but, will just remember that. If S w is invertible, W is given by this.
Even, otherwise one can find a maximize by solving some generalized Eigen value problem.
So, that is how we obtain the best direction W; very often this is what is no given as
the fisher linear discriminant, as said in most pattern recognition problems S w would
be invertible. So, this is the fisher linear discriminate. But, obviously we have found
only the best direction W. We have not yet found a classifier. Right. Let us remember that a discriminant based
classifier is sign of W transpose X plus b. So, I have to find the b also. Right. All
that we have done so far is how to find the best W. And have been found best W, to convert
it to a classifier I still have to find W. But, this is a must simple of problem right.
We have to tell on the best b but, this simple problem just learning a threshold. We are
saying this is the best direction now, we have one dimensional data and I want the best
threshold. So, now given one dimensional data and class labels, I just want a threshold
bayes classifiers. I am asking which threshold is good, there are many methods.
For example, we considered ROC, how we can experimentally calculate ROC and come to a
threshold; that is one thing. We can do we can do any that kind of such for a threshold.
We essentially have to along W, we keep putting the threshold at various points and keep asking
is my accuracy and the training data improving. I need only one threshold. So, I will start
from some end of some point and keep moving, just like a line search in the optimization
problem. So, we can do a simple line search to find the best threshold, to maximize the
probability of a correct classification and the training data.
Or, what we do is, the projected one dimensional data z i that is, what we have with class
labels. We can simply take it as a one dimensional pattern recognition problem and is very easy
to solve one dimensional pattern recognition problem. For example, we can take the class
conditional density to be normal, estimate the associated normal densities very easy
to do for one dimensional as you have seen. And then, take the classifiers based on the
estimated densities. This is also often done. So, in whichever way we do that gives us the
final fisher linear discriminant. So, this is how I would obtain the fisher linear discriminant. Officially, a discriminant is also a popular
classifier, just like the least squares. Now, the way we derived fisher linear discriminant
it looks quite different from the least squares method. As I had told you, a fisher linear
discriminant came from also, came from statistic but, not it came from different direction.
So, essentially linear least squares and fisher linear discriminant evolved historically as
different algorithms. However, how different it may look, there are close connections.
As the matter of fact we can think of fisher linear discriminant as a special case of linear
least squares. The reason we did not start there is that because historical is important,
it is very often used in a applications and is nice to know another way of formulating
what is a good linear classifier. But, having said all that, one can actually mathematically
show the equalence of linear least squares and fisher discriminant in a in a very specialized
sense. What is specialized sense; is the following. We start with some data X i, y i. This is
our given data X i is in r d, y i is 0 1; this is our classification data. Given this
data suppose, we form some new training data where, X i are same but, y i’s change it
to y i prime. How do I form the new y i primes? Originally,
y i’s are 0 and 1. So, if y i is 0 that is, X i is coming from class 0, I take y i
prime to be n by n 0. So, I am just changing the targets for all class 0 patterns, the
correct value to predict is n by n 0. Where, n is the total number of samples and M 0 is
the samples of class 0. On the other hand if y i happened to be 1, I take y i prime
to be minus n by n 1. So, now I got new data X i, y i prime where, y i prime take some
real numbers not 0 1 because n by n 0 and minus n by n 1 are some real numbers. Even
though we take some real numbers we will think of y i prime as some generic real numbers.
Then, we can treat this X i, y i prime data as a regression problem. X i is an r d, y
i prime are a naught; I want a predictor, I want a linear predictor where, I has is
W transpose x plus b. Simply, viewing X i, y i as some data for
a regression problem, I can certainly use least squares to find a linear predictor.
What I have is W transpose X plus b. Using the outstand linear least square method that
you have done in the last couple of lectures. One can show algebraically the algebra is
a little tedious but it is there in one of the prescribed text book namely, Bishops book.
But, it can be shown through somewhat long variant algebra that, if I done that, if I
take this data recoded X i y i prime and then, think of this as a data for regression problem
and then, find a linear regression. Then, one can show that the least square solution
W that I obtain for fitting this model to this changed data will be exactly same as
the W that we get out of fisher linear discriminant. I am not prove this I am not proving this,
as I said the algebra is tedious and it possibly does not add much score through the algebra,
it is also there in bishop book. But, it should know that, it can be shown that the least
square solution have obtained here would be this, would be same as the fisher linear discriminant.
Thus, one can show or one can see that fisher linear discriminant can be viewed as he special
case linear least squares. So, it is not really different from linear least square but, at
the same it gives you another very important useful idea of how one looks at what is a
good linear classifier. There is other ways of looking at how good
fisher linear discriminant is. For example, we take simplest 2 class problem; let us say
we have a 2 class problem with both the class conditional densities to be normal unless
they say they have the same covariance matrix. Why same covariance matrix? If the 2 class
of the same covariance matrix we know that the based optimal classifier is a linear classifier.
Right. So, now one can ask, will fisher linear discriminant
give me the optimal linear classifier because fisher linear discriminant can give me only
linear classifiers. We can only ask whether will give the optimal classifiers when the
optimal classifier itself is a linear. So, let us take a case class conditional density
is normal with same covariance matrix and we know that the based optimal is a linear
classifier and will show that fisher linear discriminant will give me the same classifier
as the based classifier. So, let us say mu 0 and mu 1 are the 2 means of the classifiers
classes and because the covariance matrix is same, there is only 1 sigma for both classes.
Thus, sigma is the common covariance matrix. Suppose, given this data if we want to actually
implement bayes classifiers we would estimate class conditional densities; let us say using
maximum likely hood method. If we give maximum likely hood method, we know mu 0 and mu 1
will be estimated in sample means. Similarly, a similar sample mean estimate exists for
the covariance matrix. So, then we know that M 0 and M 1 would be
the sample mean estimate is for mu0 that is how we obtain. M 0 is 1 by M 0, 1 by M 0 summation
X i so, the sample mean estimated for mu 0 similarly, M 1. So, I would have obtained
the same M 0, M 1 as the sample mean estimate for mu 0 and mu 1 had a try to do this estimation
of class conditional densities under maximum likely hood. Similarly, I would I would have
got some sigma hat as the estimate with the covariance matrix. Then, what will be the bayes classifiers,
we implement with this estimated quantities. If you still remember the bayes classifiers
that we derived for common covariance matrix k. It is a linear classifier W transpose X
plus b where, the W is given by sigma inverse mu 1 minus mu 0. So, what I would have implemented
is sigma hat inverse M 1 minus M 0. Now, we know that this has to be same as the FLD because
we know that the S w inverse that comes there is like covariance matrix. So, very very quickly
will show that this W would be same as the W given by the fisher linear discriminant. Coming back; this is the S w matrix. Because
both classes are having the same covariance matrix, I could estimate the covariance matrix
either from class 1 or from class 0. So, if I estimate this from class 0, that estimate
would be propositional to this accept for a propositional constant. Simply, if I estimate
it from class 1 it will be propositional to this accept propositional constant. And the
classes are the same covariance matrix; each of the 2 terms above would be proportional
with the same sample mean estimated. Right. So, the first term will be some a times sigma
hat and the second term will be some b times sigma hat. So, essentially S w will be propositional
to sigma hat. So, since S w propositional to the sample
mean estimate for sigma, we know that the fisher linear discriminant is given by S w
inverse M 1 minus M 0 is same as the bayes optimal classifiers. So, fisher linear a fisher
linear discriminant is an interesting classifiers because atleast in this kind of cases it gives
you the bayes bayes optimal classifiers. So, it has a variant testing geometric interpretation
namely, finding a direction along which classes are well separated. That is that is in itself
is a nice interesting geometric interpretation. In addition, as you seen it can be viewed
as a special case of least squares. And also, you know in this in the cases that bayes optimal
is linear, I am iam getting atleast in the normal class conditional densities case. Fisher
linear discriminant gives me the same classifier as the bayes optimal. So, that is the story
of the fisher linear discriminant. So, to move on; we consider various methods
of learning linear classifiers and linear regression models starting with perceptron
and then, we spent a lot of time on linear least squares methods and now, fisher linear
discriminant. But, most of the time we are restricting ourselves when we doping regression
we are talking of targets being in r in the regression case where we assume the training
data as X i, y i where, y i belongs to r. What it means is, we are only estimating the
real valued functions. So, one generalization that we may want to
do is to generalize this to vector valued functions. Right. Similarly, whenever we consider
classifiers the classification problem we always restrict our self to do class problems
so we have to generalize this to multi class problem. These are the 2 issues still left.
So, in this class will atleast look at the issues involved and will tell you what the
things are and pretty much that is all the least for multi class classifier. We may just
briefly touch upon it next class and leave it at that. So, let us do it one by one. So, let us first look at the regression problem.
First consider estimating vector-valued functions. What does vector-valued function mean? Now,
my training data is given by X i, y i. As earlier, X i is a feature vector it belongs
is an r d, y i is the target. Now, bacause I am learning vector value function let us
say the functiom map from R d to R m. So, y i will be m vector. I am sorrry this should
not be r d, this should be r m. I am sorry about that. But, anyway y is a m vector. So,
because we already put i for denoting the i’th sample.
Let us put, let us denote the components of y i by y i 1, y i 2 upto y i m. So, the main
difference in the vector value case is that y i is now a vector of some m components for
some orbitary value of m. So, given any X, we want a model that can prdict the target
which is a m vector. So, given any X we want to predict y which is actually an m vector
given whose components are y 1 to y n. And we want to do it using our basic linear methods.
So, what it means is that, we want to learn m W’s and b’s such that, the j’th component
can be predicted as some a fine function that is, W to j transpose X plus b j. So, essentially
the problem falls down to find m vectors W j and m scalars b j.
So, that is a good model, my good model my prediction model would be the j the estimate
of the j’th component in the target y hat j is W a traspose X plus b j. What is mean?
If i just take this data and split it into many data’s X I, y i 1 is 1 data set, X
i y i is another data set, X i y i is another data set and so on. For each data set I find
a standard least squares methods. Right. So, this learning a vector valued function is
no different from simply solving m number of linear least squares regression problems.
So, simply by learning running m linear least squares regression problems, we can solve
the vector vector case. Of course, we can all put it into a better formulism. We can
actually put all these W j’s as columns of a W matrix in d right a the entire least
squares solution in a vector matrix notation. But, except for some complecation notation,
there is nothing really conceptually linear over here because we are simply effectively
running m number of linear least square regression problems. So, which means in principle, we
know how to solve the ah linear model problem even if we have to lane a target function
which is vector value. Now, let us come to the multi clas problem.
If i know so far we know many methods for learning 2 class problems. So, If i know how
to learn 2 class problems, can i find? Can i use the 2 lane multi class problem? The
issue is just a little more complicated here, we have earlier considered this. So, let us
say we have k classes C 1, C 2, C K. So, will come to decide how the data will be. For knowledge
let us assume that we get X i y i then, y i takes any any any value between 1 and k.
Note that earlier we always been looking at our classes some C 0, C 1 so on. At this time
we look at a C 1, C 2, C K. Now, as i said some time ago essentially 2 class problem
is the more fundamental problem because we can solve multi class problems simply by if
you know how to solve 2 class problem. So, we can in principle solve number of 2 class
problems to solve a multi class problem. There are atlest 2 very generic techniques of a
learnig in a multi class problem using 2 class methods. One is that i can learn K 2 class
classifiers where, K is the number of classes as follows. I learn C i verses not-C i.
So, the i’th classifier I learn is learning to classifier C i verses not-C i. What you
mean by C i verses not-C i? I have the data, a training data of the K class K’s so, I
take all the data, all the X i that belongs to a particular classes say C i or say all
data that belongs to C 1 as 1 class all the rest of the data as another class. Then, I
solve a 2 class problem whose objective is to say given an X is it is in C 1 or is not
in C 1. Similarly, is it in C 2 or not in C 2. So, I learn key a number of 2 class classifiers,
each classiffier is learnig to distinguish between C i or not-C i. This apporach is often
called one verses rest and is a very standard approach in learning multi class case. Especially,
in when you one one does all in a clasifier very often one goes for on verses rest apporach. Another way of doing it is inster learning
K, I learn K C 2 number of 2 class clasifiers. Now, for every distincit pair i and j I am
learning 1 classifier to distinguish here C i from C j. The main of course, I am learning
more number of clasifiers here. Why should I learn more number of clasifiers, why cant
i use the previous method where I have to learn only K classes? One problem with this
C i verses not-C i learning could be that for this problem, the effective training set
could be very. Let us say I have 100 classes and I have let us say 50 examples of each
class making a total of 5000 examples. Now, when I want to do C 1 verses not-C 1,
I have only 50 examples of C 1 whereas, 4950 examples of not-C 1. fine. Then, if one of
the classes has much less much less represent the training set then the another 2 class
problem. Predominantely, all traning data belongs to to 1 class, that is very diffucult
to learn the classifier. For example, in this case if i have 50 of 1 class and 5000 minus
50 of the other class. Then, a classifier always says not- C 1 has only a error rate
of 50 by 5000 on the traning data which is very small right. So, stupid classifiers can
have very low error on the training set and that always makes it difficult. Often, what
one does is when 1 class as too many samples then, you resample from there. So, out of
this 4050 I may be I take some 50 or 100 randomly selected samples and use that to do C i verses
not-C i. But, when I do that may be that does not represent all the other not C 1 not-C
1 clases. So, if the number of classes is very large
then, one versese rest can be difficult. Because the training data squte will be very squte
for the resulting 2 class problems. If I have only 3 or 4 classes then it may not be so
bad. But, if i have hundred or 200 class 100 classes or 50 classes then learning C i verses
not-C i could be a squte data set problem and I have to properly balance the data set.
It is in those casses that learning C i verses verses C j classifiers could be because if
essentially when I have got K class problem and I have training data, one accept that
all class all clases are represented about equal measure. So, for a 2 class problem C
i verses C j, I will have roughly equal number of example for both clases there is no squte
traning training say distrubution. So, this is a little more easier, is more
stable 2 class clasifier learning and that the reason even though I am learning more
number of classifiers, I may want to do this. So, these are essentially 2 ways of using,
there are others but, these are the 2 main ways of using an algorithm that can learn
a 2 class classifier to classify K classes. But, however there are issues with both of
these. Right. I will we will We will see look at some problem, there is not to say that
there are not followed. But, there are pit falls with both the apporaches
in as much as any set of such classifiers, this set of K classifiers or this set K into
K minus 1 by 2 clasifiers do not make a perfect, a well defined classifiers. Perfect is wrong
word to use because we are not saying it has to have 0 error or anything like that. But,
atleast it should be a well defined classifier. What do you mean by this? A classifier is
a black box to which if I give an X. There is no doubt in the classifiers mind as to
what class X should be goin because this is an algorithm, it has to be an algorithm. Whether,
it is right or wrong is a different issue but, something to be call a classifier atleast
structurelly, given an X it should be assigned to one unique class.
But, if I learn this many 2 class clasifiers or this many 2 clas clasifier, together they
do not really define a very a proper classifier in this sense. Thus, what will see next that
neither this apporaches are realy satisfactory generalizing the linear discriminant function
to multiple classes. So, before I will show you some examples of why this happens and
tell you how we can solve this. But, before I go there I spent a lot of time
telling you about this 2 ways of dealing with multi class problems because inspite of the
problems I am going to tell you, sometimes people use this apporach for solving multi
class problem. But, we should still know what the pit falls in the apporach are. So, let us first consider one verses rest
apporach. The problem is that there are regions of feature space where, the classifiaction
is ambiguous. Right. Let us say, this is the linear classifier that separates class C 1
on one side and not C 1 on the other side. Let us say this is a linear classifier that
separates class C 2 one side and not C 2 on other side. This is the region 3 where, it
is class 3 as it should be not C 1, not C 2. But, there will be region like this where,
this classifier will say C 1 and that classifier will say C 2, may be my third classifier will
may say not C 3. That is all right. But, what should I do here; because if I take any X
here one classifier say C 1, other classifier say C 2, my third classifier say not C 3.
So, that is okay, not C 3 is fine. But, is it C 1 or C 2 or no way of knowing. So, this
does not even define a proper function that maps every feature vector to a class. Right. The same this is to even, if you use C i verses
C j apporach. Once again here is an example. So, in 3 class case. So, this is a C 1 verses
C 2 line, Right. That is, a C 1 verses C 3 line, this is C 2 versesc C 3 line. In the
centre triangle where, I put the question marks if I take any x; one classifier will
say C 1, another clasifier will say C 2, the third classifier will say C 3. A perfect disagrement
between the 3 classifiers. What should i assign it as the class label?
So, there is always this issue when I have that many 2 class classifiers. If the responses
given them are consisted then, given any x, I can decide on the class label for that x.
But, there will always be the regions in the feature space like this. Where, the responses
are in consistence and one does not have any simple solution of how to put together all
those responses into a perfect or into a proper class labels. This is the problem of generalizing
2 multiple classes. Of course, there is a better way of formulating
this. We have seen this when we considered multi class bayes clasification for minimizing
this. The same thing is to do in the linear discriminant functions. Essentially, the way
to generalize 2 multiple classes is will have K functions, I represent them as g subscript
s, s going from 1 to k and each function is a standard linear function. That is a fine
function rather, j s of X is W s transpose X plus b s. Right.
Now, I define a way of making this into a classifier for all such functions. Now, once
I have all these functions, if i if you give me an X, I will assign X to class C j, if
g j of X is greater than g s of X for all s. So, I calculate the value of X for each
of this functions and which every function has the maximum value of X at that point X,
I will assign X to that class. of course, this is not at complete because there will
be ties. Right. So, there might be say g 1 X and g 2 X might have the same value and
that is bigger than the value of g j X for all other j. Now, should I put X in class
1 or class 2. But, once we come this for we can have a very simple and arbitrary rule
for breaking ties. For example, I can say that if the maximum is attained at more than
1 j then, I will put it to the j to the least j.
So, if j 1 X and j 2 X are the 2 maximum of function values for x then, I will put it
in one. Right. So, I can have any arbitrary but fixed rule for breaking ties so that it
becomes a proper function. Given any X it unquely defines a class label. So, essentially
this is how one would like to generalize linear discrminant functions to multiple classes.
Recall that, this is the we have generalize the bayes classifiers. Now, to learn a linear classifier for K-class
case, we need to learn all the K functions g s. The best way to do this is to use or
vector valued idea of regression functions. So, what we do is, we make the class label
to be a vector of K components. What is that mean? If X i belongs to C j then, in the training
data the class label y i instead, of simply taking value j, it actually will be a K vector
with the j’th component 1 and all others 0.
Now, y i will be K vector now. And to denote j’th class j, I put 1 in the j‘th component
and 0 will be. So, each y i will be a unique one of the co-ordinate vectors. So, we seen
similar coding when we consider this K random variable as M l estimation. So, the same codind
we can use to make the class labels K vectors. Now, learning K functions is same as a linear
regresion with vector valued targets. I am given X i, y i; y i are vectors. I want to
learn a linear function. So, I can learn K functions using linear regression with vector
value targets. So, this is one standard way in which linear analysis can be extended to
multiple classes. There are of course some slighty different ways and this by itself
does not tell you how y can generalize logistic regresion or fisher linear discriminant because
they there the structure of landing is slightly different there.
But, similar ideas can be used for generalizing logistic regresion fisher linear discriminant
also. So, with the next class we will just briefly look at how to generalize logistic
regression to K classes. And then, we will step back we since some algorithms now for
learning a classifiers. We will just take a look at all the algorithms and then, take
a step back and ask, is there some therotical ways in which we can say whether one classifier
is good or not good; how do we decide whether a classifier is optimal or good; is there
a a statistical way of defining the goal of learning in learning a clasifier. So, we will
do a better statistical learning theory and then, we returned to learning non-linear classifiers.
Now, that we finished all linear clasifiers, we will step back to do some statical learning
theory and then, come back and do the non-linear classifiers.