Speech Transcript - SENSING-AWARE CLASSIFICATION WITH HIGH-DIMENSIONAL DATA

0:00:34	this work was supported in part by grants from the
0:00:37	you states uh so it was a as of side it is so much and the it's national science foundation
0:00:43	only go back
0:00:48	okay so um
0:00:50	the topic of the talk is on classification
0:00:53	so
0:00:53	in a a model based classification as you all of there
0:00:57	yeah given
0:00:58	a a a a a prior distribution on the classes and uh
0:01:02	the and D like to function of the observations given class
0:01:05	and given these two things we can come up with the uh minimum probability of error decision rule
0:01:10	which i the on noise a maximum a posterior probably who
0:01:13	but simplifies to the maximum likelihood rule for you likely classes
0:01:17	so that that's model based
0:01:19	uh classifications not a better the model is from specified then you can in principle come up with the optimum
0:01:24	decision
0:01:25	in contrast to the sum of our uh be a the it in what is gonna of learning based classification
0:01:29	read everything is get a driven
0:01:32	so you only given examples of the two classes say
0:01:35	and you wanna come up with an a got of them which separates these class
0:01:39	the the channel have a lower that T wish that this is in this scenario is
0:01:43	that
0:01:44	very you often you encounter situations that you a high dimensional data for example you have a
0:01:48	so the billions video which lines be big by a by to get a
0:01:51	you have hyperspectral images you have you know synthetic aperture radar images and so forth
0:01:56	so get a high dimensional data on the one hand and you very few examples
0:02:00	compared to the dimensionality of the data on the other hand
0:02:03	now you might say well why not just use a generic uh
0:02:07	i did a reduction technique like
0:02:09	say pca A or L L E or so map
0:02:12	well on the one hand these are really generate methods
0:02:16	which are you know not
0:02:17	but it really do device so the classification problems so the optimized sort another generate method mow uh measures of
0:02:24	oh such as an preserving get about distances and so forth on the one hand
0:02:27	and don't know other hand
0:02:29	they haven't been designed with the view to high dimensionality prob the problem but that if you example
0:02:34	so our approach is to sort of exploit
0:02:37	what i shall call as the
0:02:38	latent in low dimensional sensing structure
0:02:41	now to make
0:02:42	this clear let's take a the cartoon example
0:02:45	let's suppose that
0:02:47	you given examples of each class only two classes here
0:02:50	and a learning based as a vision got in the such as svm but a kernel svm
0:02:54	which simply take the data and a lot a classification rule
0:02:57	in completely ignore
0:02:58	if any as C structure was present or not
0:03:02	in contrast to the this is
0:03:04	what i would call sensing of and classification where let's say we know that that these observations came from some
0:03:09	the sensing process
0:03:11	say for example the blurring operator
0:03:13	or of we may have either full or partial information about the blurring operator
0:03:17	and to the with some noise
0:03:19	and the question is can exploit knowledge of the fact that these observations came from someone underlying sensing structure
0:03:25	oh a the classification performance
0:03:28	no
0:03:29	yeah actually the it in a to the study of one is on the fundamental asymptotic limits of classification in
0:03:34	the
0:03:34	so audio of high dimensional was and very few samples
0:03:38	them to make things more concrete let's we assume that the the uh the did i mention and possibly the
0:03:43	number of samples
0:03:44	uh goes to infinity
0:03:45	while these samples per dimension
0:03:48	most of you
0:03:49	so that this a not of a you have that if you samples are very high dimensional data
0:03:54	but
0:03:55	in contrast to a a number of studies in the literature which has focused on
0:03:59	and S imported easy situation be want fixed the problem difficult D S imported you meaning that even if the
0:04:04	dimension increases to infinity
0:04:06	as it's not going to be easy to classify
0:04:09	and
0:04:10	for what is essentially means is that have fixing the signal to noise ratio as the problem schemes and this
0:04:15	would be considered as i do the mathematical more
0:04:18	a fundamental issues that be vision as is one is yes and it that's if you can performance
0:04:23	uh i that this asymptotic G
0:04:25	does it probably that are good to half which means
0:04:27	is it no better and random guessing
0:04:29	or does it go to the optimum base
0:04:31	probably that are which by the is not equal to half which is what i mean by fixing the problem
0:04:36	we do not equal to zero which is what i mean by fixing the problem difficulty or to something else
0:04:41	now
0:04:42	to make things more concrete i have two
0:04:44	i i is a model so that that's of the talk is based on only this is of a specific
0:04:48	model so
0:04:49	because need to understand the you got of these issues be side of the base simple model
0:04:53	a model is a simple in that
0:04:55	the observations are made up of um
0:04:58	uh are that some uh the uh the mean location which is lying and some of sensing subspace of think
0:05:04	of H as the sensing subspace
0:05:06	and even get in last one you are of this look at the you a mean location and one
0:05:10	and that
0:05:11	you are having a scalar gaussian perturbation along the edge axis
0:05:16	a big but for by a vector gaussian noise perturbation which to take your side this
0:05:21	subspace into to the gender the P dimensional space
0:05:24	so that the uh sensing a model we have a uh and lies the performance
0:05:28	and that's and what condition each class of the means are different so be know that the means
0:05:32	the are line a subspace and that
0:05:35	that's a scalar of but vision component along the subspace for or by but the gaussian perturbation it's takes the
0:05:40	subspace
0:05:43	so that the uh simple model not and the goal of was is that you are given uh menu of
0:05:47	many P dimensional vectors a and P dimensional vector some each class
0:05:51	and you to come up with a classifier
0:05:52	i understand the asymptotic classification performance for different uh sonatas
0:05:57	now
0:05:58	a be was a model to be simple to keep things tractable we are does not an article understanding not
0:06:02	even though it's fairly simple
0:06:04	that as not is that does make sense for example you have a sense an adults and audio
0:06:08	but you could have a the use so it's be the dimension of the observation in in the previous slide
0:06:13	uh each component being a sense on this case
0:06:15	observing some kind of a the line each signal few you
0:06:18	and and that last longer observing edge which is a signal
0:06:21	i the noise
0:06:23	and don't of the different class you of the negative of H i the noise
0:06:27	and the board of course is that
0:06:28	you a given and observations of the weak signal or sensor
0:06:31	i the each class and
0:06:33	the question is
0:06:34	yeah to come up with a classifier with decides
0:06:36	uh a the next observation is but to be which does it a long as a long the last class
0:06:40	the negative class
0:06:42	no moving ahead a that kind of classifies as for the rest of the talking would do
0:06:46	consider are the following
0:06:48	we like look at the baseline uh classifier which are is the full based which means you know everything about
0:06:53	the models so what is
0:06:54	a what is the that's which implements that we were fixed it
0:06:57	but gonna get familiar with the notation they're
0:06:59	then you wanna look at what a what i the and stuff sure
0:07:02	uh classifier which means that i know that it's of the conditionally gaussian observations but i don't know the means
0:07:07	that and all the variances quite is as
0:07:09	i would i them to estimate everything
0:07:11	using maxim like good estimates
0:07:13	how to that form
0:07:14	and then
0:07:15	a finally that look at structure based uh
0:07:17	a that additional problems
0:07:19	then the first case we look at the structure of it and what exact sensing subspace
0:07:22	how does the things behave in those cases
0:07:25	the second case i to for a structured maximum likelihood
0:07:28	which means that
0:07:29	of the estimate a tomatoes
0:07:31	annoying uh that is a little low dimensional subspace but i don't know the subspace
0:07:35	and finally um
0:07:36	you see that
0:07:37	yeah have negative results in this case is and the will of more T so a structured sparsity uh more
0:07:44	oh as a baseline model
0:07:45	so that a likelihood ratio test Q can you can john to the at and you can come up to
0:07:50	what is one of the up one like decision rule
0:07:52	it's it's gonna be a linear discriminant rule and is based on these parameters that i and mu uh it's
0:07:57	not important to know exactly what expressions are
0:08:00	that that stands for the difference in the class conditional mean
0:08:03	new is the have to the class conditional means and signal i Z equal that ends of the observations so
0:08:08	the that can rule depends on these parameters
0:08:11	and uh ms in can probably you can about added in closed form
0:08:14	it is and of the Q function which is nothing but the T and probably of a standard normal
0:08:18	and in terms of these uh a it is M to and what except which were a bit up your
0:08:22	here
0:08:23	only the important thing is that yeah is you a fixed the difficulty of the problem as the dimension scale
0:08:28	which means that i have to fix the argument of the Q function
0:08:31	that's that amount essentially fixing on most everything you are in particular the energy of these sensing a a vector
0:08:37	H
0:08:38	so we wanna keep the norm of edge fixed as things scale and that's an important uh a part of
0:08:42	this work
0:08:44	so that one of the the full based about looks like
0:08:48	oh that's one of the case better we know that it's conditionally got but you know we don't know any
0:08:52	of these parameters so
0:08:54	this of what the base classifier looks like
0:08:56	but i don't know a i don't know the model
0:08:59	so i have to estimate all these parameters from the get i given
0:09:03	so one approach a actual approach is to use a plug-in estimator which means estimate all these and it does
0:09:08	using the did a given
0:09:10	and like it into the optimum decision rule
0:09:13	that you are you you get a what as well as the uh of the medical fisher rule
0:09:17	and you can have a analyse the uh probably do better or you can get a close form expression and
0:09:22	look at what happens so that probably at as
0:09:25	these samples but dimensions go down to you'll the dimensions english to infinity
0:09:28	but you fixed the difficulty of the level
0:09:31	lot
0:09:32	turns out
0:09:33	not surprisingly that be probably a error goes to have
0:09:36	which means a no but than random guessing
0:09:38	now do not surprising because
0:09:40	you're trying to estimate for more parameters than you have data for
0:09:44	so asymptotically a you you don't catch up with the uh or or load of information that to estimate
0:09:50	so we in the structure in estimating all "'em" it is not a good idea and your
0:09:54	uh let's want to
0:09:56	structured uh approaches
0:09:58	so that's a minus so that does the sensing model
0:10:01	and let's suppose and the one extreme not been more tie sensing structure which means that i know the subspace
0:10:06	in which the observations lie
0:10:08	okay the underlying one dimensional subspace
0:10:11	so not natural thing to do in this case of wine not project everything down to the one dimensional subspace
0:10:15	right is it was scalar are learning based classification problem
0:10:19	estimate all the parameters
0:10:21	in that a reduced one some problem using the data you have the maximal some estimates and
0:10:26	C of what's
0:10:27	okay
0:10:28	that leads you do the uh what i what as projected empirical fisher rule
0:10:32	and that's the uh i an exact expression iteration at the exact expression is a set was not very important
0:10:37	but idea is that you
0:10:38	you know the sensing subspace we put giving down to that and reduce is it a one dimensional problem
0:10:43	and and the uh the probably did N are shown here
0:10:46	asymptotically as the number of samples goes to infinity
0:10:49	the out not surprisingly again that
0:10:51	i to keep the difficulty level of the problem fixed and a
0:10:54	as a the number of samples to infinity
0:10:57	the probably of uh N or goes to the base or it probably are which means of the optimum thing
0:11:02	you can do
0:11:03	now there is a uh it's can expect it is because
0:11:06	you know it's one i'm it so that it lit in one and some structure uh in one in this
0:11:10	problem and you know it exactly so when you project it down to that problem
0:11:14	that that the at the actually dimension of the but data of relevant
0:11:17	so P doesn't appear to this equation at all
0:11:20	your your the scale classification problem and as we know that when you uh do a mass and that the
0:11:25	estimation but in number of an uh a number of samples you can asymptotically get
0:11:29	optimal performance
0:11:30	but the did a dimension is fixed
0:11:32	so in this case the effectively the demonstrated option
0:11:35	uh by it takes into account a them a reduction in this or element to this problem
0:11:41	now
0:11:42	but the the idea of what of that we don't even know in general the sensing structure
0:11:46	okay we don't know the sensing subspace so when i is one to estimate the sensing subspace from the data
0:11:50	you have
0:11:52	so what would be one approach to estimate the sensing subspace
0:11:55	but we know what is that if a look at the difference in the class conditional means that are
0:11:59	it's actually a aligned with edge
0:12:01	"'kay"
0:12:02	so it was a lot of that and natural thing to do is to use a maximum that to estimate
0:12:05	of the that which was done before
0:12:08	and use that of the proxy for edge
0:12:10	then produce then project thing down to that that up
0:12:14	and then you're back to square the previous situation
0:12:17	and uh i again to get a project anybody "'cause" we shouldn't X of that the that action a which
0:12:21	project thing is not on the edge because it's not known to you but it's the estimated H
0:12:26	what you expect to get here
0:12:28	turns out that if you analyse the probability of mouth-position ever
0:12:32	as examples for dimension goes to zero
0:12:35	and the uh difficulty level is fixed
0:12:37	the probability of classification error goes to have
0:12:41	which means that even though you knew that was an underlying wind amazon something structure and you know that that
0:12:45	that was aligned with that
0:12:47	trying to estimate using using a matching like to kind of an estimate
0:12:50	didn't
0:12:51	doesn't do the job
0:12:52	okay you know but and random guessing asymptotically
0:12:55	but also it's it's all suggests that you need additional sensing structure to exploit here
0:13:00	no although this was not presented in our icassp able um yeah since then be able to show that this
0:13:04	fundamental meaning that
0:13:05	for this particular problem to analysing here
0:13:08	without any additional structure on edge
0:13:10	it's impossible for any uh learning a lot of them
0:13:13	to do any better than random guessing some importantly
0:13:16	so that's not present it an i cast to be appearing elsewhere but it's actually a fundamental be able lower
0:13:20	bound of the does of in probably which actually goes to have
0:13:24	and if you don't make any assumptions on these sensing structure
0:13:27	so that lead more T is the need for a id no structure don't edge
0:13:30	and one of the structures but be like to study is of course uh is a is a popular thing
0:13:35	these days
0:13:36	is uh as possibly okay
0:13:38	so
0:13:38	uh
0:13:39	that's say that the signal that uh that subspace is back the direction is sparse meaning that
0:13:44	uh the energy and edge
0:13:46	is look lies leave it if you components
0:13:48	compared to the number of dimension
0:13:50	so in particular let's see that the daily energy of a to this of the effect that edge the man
0:13:54	of the vector to the components
0:13:56	and their P components
0:13:58	and uh let's a pick a truncation point D um and look at the energy G this truncation and the
0:14:02	tail of the
0:14:04	uh edge vector here
0:14:05	as E N P will go to infinity you want the a in a to do to zero
0:14:10	so that a certainly a a a statement about the sparsity
0:14:13	as simple ks possibly all the signal
0:14:17	so in this case uh a natural thing to do is to use a uh so only have used the
0:14:22	maximal like to the estimate that a a of the top
0:14:25	and that didn't work
0:14:27	but not you know something more about edge namely that it's still energy goes to zero so one one interesting
0:14:32	to do you can try is why not and K that estimator
0:14:35	the component of the estimator
0:14:36	and use that as a proxy for that instead
0:14:39	the idea is to keep the estimate team i'd only are for all components less and some implication parameter T
0:14:44	and then set to you everything beyond
0:14:47	so that leads to what condition bayes estimate of the direction along H
0:14:50	and i and used i
0:14:52	that's the L how how things be
0:14:54	a big for show that as the that mentions the number of samples and the truncation point goes to infinity
0:14:59	but the truncation one is chosen in such a way
0:15:02	that the it goes slower than the number of sample
0:15:05	then
0:15:06	as important D can estimate
0:15:08	this is signal subspace perfect mean that in the mean a sense there are between the a truncated estimate and
0:15:14	the true data goes to zero we can as a a to the estimated i mention the subspace and of
0:15:19	course if we can estimate the subspace perfectly some got it it's on surprising then that
0:15:22	uh as things scale and you could the difficulty level fixed
0:15:25	the probably of class of never goes to the base probably
0:15:28	another the what's not is the sensing structure
0:15:30	but additional sparsity assumptions or some additional structure information
0:15:34	can a simple really yeah give you the uh a bayes pro uh probably of
0:15:40	he has a little simulation does not uh reinforce some of these insights
0:15:43	so here we have fixed the uh is probably other the the difficulty to be point one is fixed throughout
0:15:48	as a dimension scale
0:15:50	the energy use fixed to not some value to and you're some parameters to than in the model
0:15:55	and the number of samples is going slower than the other dimension as shown here
0:16:01	um
0:16:01	the truncation point
0:16:03	uh uh chose into go slower than the number of samples that shown here
0:16:08	and yeah one assume a polynomial D K for edge
0:16:11	and joint you're up for example of the beam line is the H
0:16:14	or the of one pretty localisation of edge
0:16:16	and on
0:16:18	the uh D D uh the red line is actually the noise the um at some like to to estimate
0:16:23	that the had
0:16:24	they are normalized to have unit energy
0:16:26	sure you're
0:16:27	and a blue one is a point conversion of the red one
0:16:30	the truncation point the i-th exactly twenty or so
0:16:34	on the right side is the probability of error on the vertical axes most of the dimension ambient dimension
0:16:39	so that the dimension scales
0:16:41	uh the unstructured uh approach where you don't know anything about the sensing structure you try to estimate all the
0:16:48	parameters using mac selected estimates
0:16:50	we'll approach to be that they probably about it being
0:16:52	you could have
0:16:54	on the other hand uh if you if you if you knew the sensing subspace but you estimated using nightly
0:16:59	using
0:17:00	simply that had
0:17:01	which is a max um that to estimate
0:17:03	then also you get a have
0:17:05	but if use the truncation based estimate
0:17:08	you are a pros the bayes optimal performance
0:17:12	so the control my talk
0:17:13	uh the
0:17:14	the you take points out that
0:17:16	for possible to many problems where you encounter situations where the number of samples that far fewer than the
0:17:21	i'm being uh get a dimension
0:17:24	in addition that is often exists a lead in sensing structure of the low-dimensional which can be exploited
0:17:30	you try to totally ignore the sensing structure and nine to try to estimate everything using mac selected estimates uh
0:17:35	you would probably be no better than random guessing in many scenarios
0:17:39	and even having a general knowledge of sensing structure like knowing that it's a one dimensional signal edge but i
0:17:43	don't know what they choose
0:17:44	and trying to estimate a nightly
0:17:46	can be it cannot do the job
0:17:49	so but only covers if you have a general or something structure plus some additional structure and edge
0:17:54	then you can often recover the optimum
0:17:56	asymptotically optimum estimation
0:17:59	the data into my
0:18:16	yeah i think
0:18:17	i know which i mean
0:18:19	was gonna be departing

SENSING-AWARE CLASSIFICATION WITH HIGH-DIMENSIONAL DATA

Classification and Pattern Recognition

Presented by: Prakash Ishwar, Author(s): Burkay Orten, Prakash Ishwar, William Clem Karl, Venkatesh Saligrama, Boston University, United States; Homer Pien, Massachusetts General Hospital, United States