Přepis řeči - SPATIALLY-CORRELATED SENSOR DISCRIMINANT ANALYSIS

0:00:14	or thanks
0:00:14	um um
0:00:15	so we have a title might not be obvious of what the stock is about but hopefully by the end
0:00:19	it will be um
0:00:21	so i'm from the uh B M and T J watson research center um
0:00:26	alright so
0:00:27	the topic of the paper is uh dealing with the sensor networks
0:00:31	so uh so
0:00:32	just to get everyone on the same page a um so these are collections of spatially distributed uh nodes to
0:00:37	take measurements and communicate
0:00:39	and we're interested in them for uh detection or classification task
0:00:43	so um
0:00:44	for example and environmental monitoring or surveillance so
0:00:48	the one question all of us are interested in this week "'cause" whether the uh iceland volcano as that stuff
0:00:53	no erupting or not
0:00:54	um
0:00:56	the measurements that um are of various types whatever that temperature sound pressure vibration that cetera
0:01:02	and um these measurements exhibit spatial correlation um
0:01:06	and the the sensor nodes are usually resource constrained
0:01:11	or um so in detection or classification now um so
0:01:14	they're basically the same problem but i one point out the differences um
0:01:18	so in the detection problem uh we're gonna use measurements to make decisions about binary hypotheses and in this case
0:01:24	the likelihood functions of the measurements and the prior probabilities are known
0:01:28	and uh the likelihood ratio test just be up in this case
0:01:32	um
0:01:33	for the problem when you have sensor and works so the distributed detection problem
0:01:36	so been studied quite widely
0:01:38	um with didn't with the classification problem that's specifically with supervised classification problem look can we won use measurements to
0:01:44	make decisions about binary
0:01:46	the C
0:01:47	um in this case the uh the likelihood functions of the measurements and the prior probabilities are known
0:01:52	uh what we are given a L is a labeled training set us so measurement vectors with their true that
0:01:57	C is
0:01:58	and uh machine learning algorithms are applied so as P or an example of linear discriminant analysis was don't talk
0:02:04	about in more detail um this
0:02:06	a very classical technique
0:02:08	um the difference kind of um between detections price classification is uh this idea of over so
0:02:15	you know one of fit do year um your training vectors
0:02:18	that you're given labeled training set too much um
0:02:21	and
0:02:22	uh this distributed suppose classification doesn't have as much brat previous work "'cause" the distributed detection problem
0:02:28	um so practically a um up times we don't have knowledge of that the pair
0:02:33	uh of the probability density a priori
0:02:36	and it's much easier to imagine a situation one which uh training samples can be acquired so that's why we
0:02:41	can sit
0:02:41	uh the classification problem
0:02:43	um so this is just kind of a high level thing in specific what i'm gonna talk but we're not
0:02:47	gonna really considered too much about uh network constraints or on communication or et cetera
0:02:53	um just more or less
0:02:54	the um what you can uh
0:02:56	what you can learn from uh data
0:02:58	and sensor
0:02:59	um so the outline of the talk is that um
0:03:03	or still a few words about the supervised classification problem and uh some words but more in to just learning
0:03:08	theory
0:03:09	um L
0:03:10	go through a linear discriminant analysis and and especially focus on
0:03:14	so approximations to the generalization error that have been developed um
0:03:18	and go over a a gauss markov random field sensor model that we're applying
0:03:22	um
0:03:23	a drive them a all an all this distance for that uh sensor model um and
0:03:28	use that to drive a generalization error um expression for sensor net
0:03:33	um uh provided a metric probability approximation to them in a the all this distance in order to um simplifies
0:03:38	and things and then uh get some simulation results the and
0:03:42	or is just going been going over the notation a bit um so we have a true hypothesis or class
0:03:47	label Y which is either plus one or minus one
0:03:50	so that you i some a noise either or in or not um
0:03:54	we have a noisy measurement vector at which is in R P
0:03:58	um and then we have a joint probability density function F of X Y
0:04:02	um so the axes are the measurements and wise label
0:04:05	uh we wanna learn a decision rule why white had i which is a mapping from the space of measurements
0:04:09	to close one and minus one now where we uh learned this
0:04:12	so a decision rule is in training samples
0:04:14	and one why want to X and Y and
0:04:16	but we don't apply to the training samples and but to unseen samples that X from the same distribution of
0:04:21	of X Y
0:04:23	um the thing were interested in characterising is the generalization error are um the probability that the decision rule will
0:04:28	be wrong on new unseen samples from the same distribution
0:04:31	uh the training air um
0:04:34	is uh we can measure empirically based on the training so um
0:04:38	which is a how many are wrong on the training set itself
0:04:41	um this idea of overfitting and the structural risk minimization principle is that um
0:04:46	as the complexity of the decision rule increases uh the training or we can take to zero but that over
0:04:51	fit of the generalization error the thing we wanna minimize
0:04:54	optimized for
0:04:55	some intermediate complexity level
0:04:58	um and in general uh what's just a school learning theory analysis at least small and out characterisations of a
0:05:03	a generalization error or or are very loose usually
0:05:07	so just one read this a quick quote um which gives this ideas so um one should not be concerned
0:05:11	about the quantitative value of the boundary been about its fundamental farm
0:05:15	but rather about the terms that appear in the bound and in that respect to useful bound is one which
0:05:19	allows us to understand which can be involved in the learning process
0:05:23	and as a result performance bounds would be used for what they're are good for the should not be used
0:05:27	to actually predict the value of the expected error
0:05:30	et cetera
0:05:31	so um
0:05:32	what if we actually are are interested in
0:05:34	developing some um approximations are bounds for generalisation there that we can use an as an optimization criteria
0:05:41	so we can use modern just learning theory results some things like uh V C dimension right a market complexity
0:05:46	et cetera
0:05:47	um and we wanna do this for uh
0:05:50	sensor network design optimization so we wanna have some sort of expression that we can optimize
0:05:55	uh for generalization
0:05:58	um so we turn to the uh former soviet union literature um so they have uh develop very high quality
0:06:04	approximations to generalisation error of uh specifically of linear discriminant analysis
0:06:09	and some of its variations
0:06:11	so um this work was motivated by a or of and the main contributor was uh true honest route
0:06:17	and these expressions that we're developed in this literature are can be used as optimization criterion as we'll see in
0:06:23	some of the simulation results later um
0:06:26	they're very very close a very tight approximations
0:06:30	um
0:06:31	so specifically linear discriminant analysis is this is uh decision rule as i have on the screen um
0:06:37	so why had of X this decision rule um is based on um basically
0:06:42	uh if you had a gaussian distribution um for the two um for the do classes with the different means
0:06:48	and the same or different covariance as well
0:06:51	i what you would do to get the bayes a little but in this case these their sample means a
0:06:55	sample covariance
0:06:57	um um
0:06:58	so
0:06:59	uh one the true likelihood functions are gaussian and the true prior prior probabilities are equal will um
0:07:05	but we don't know this so one we're learning why that from the N training samples than the generalization error
0:07:10	approximation given by row out that all of is um
0:07:14	so the generalization there um
0:07:16	is approximately uh so this five function is the uh C D F of the gaussian distribution and then
0:07:22	we have these terms involving a delta to which is uh the mahalanobis distance which a get to a second
0:07:28	P is the dimensionality um
0:07:31	and then our case it'll be the number of sensors and is the number of training samples
0:07:36	so this is a
0:07:37	is i mean it's not simple but it is very easy to evaluate evaluate
0:07:40	a expression and we'll see how close
0:07:43	um so this the the squared is the money not mahalanobis distance um
0:07:47	so basically a um U plus and me minus or are the means of the two classes uh J minus
0:07:52	in J plus are the inverse covariance matrices
0:07:56	um so that's what we is one of the terms in this expression
0:07:59	and that we want to use uh this
0:08:01	the generalization X uh i error
0:08:03	a approximation channel sensor networks with a special correlation
0:08:07	and as i said before we're not can be concerned in this talk about that were communication or hours
0:08:13	so i within the sensor network set specifically um we have P sensors each a measuring scalar valued measurements X
0:08:21	one through X P
0:08:22	so as a combined measurement vector we have this uh a big X which is an R P
0:08:27	a the sensors are deployed randomly on the plane according to some distribution F se via of V
0:08:32	so the sets of you V is supported on a square with area P
0:08:35	so the region grows as more sensors are place
0:08:38	um the at the generating a likely it functions are gaussian as i said before with um you minus and
0:08:45	um you plus as the means and signal minus and signal classes
0:08:48	yeah covariance
0:08:49	um we model a covariance structure um
0:08:52	so the spatial correlation between sensor measurements using a gauss markov random field model
0:08:57	um and uh got to that in a second um for simplicity will let them you minus P zero at
0:09:02	this you a vector and you plus P the all ones factor
0:09:05	and will like the two covariance is uh be equal um so
0:09:09	call them sit my than the inverse covariance matrices to be
0:09:12	J
0:09:14	so the uh actual um um mark a random feel that we have this with nearest neighbor dependency um
0:09:20	so we construct a euclidean under directed uh and nearest neighbor graph between the
0:09:24	uh sensors
0:09:25	so there's no between sensor I and sensor J or if is then your sabre of J or if J
0:09:30	is the nearest neighbor
0:09:31	i
0:09:31	and uh we do not the edge set of this graph as
0:09:34	E
0:09:35	so um then we use this uh um
0:09:37	graph to do that uh defined a mark a random field covariance um
0:09:41	so that's in three parts uh the diagonal elements of signal or are all equal to a little sigma square
0:09:47	uh the elements of uh signal corresponding to edges in the nearest neighbor graph are are um
0:09:53	basically this a little signal squared times uh G you have and
0:09:57	D which is the distance between sensor I and sensor J
0:10:01	so this G is a monotonically decreasing function to encode a a correlation decay
0:10:06	a and so the farther uh two sensors are the more the less correlated they are so this is known
0:10:11	as a semi variogram gram and
0:10:12	a used to just it
0:10:14	um and then the off diagonal elements of J um corresponding to non edges in the nearest neighbor graph are
0:10:20	all zero
0:10:22	and uh this is a model that has been used for example by uh i don't come are on and
0:10:26	so me and uh some of their work as well
0:10:29	um
0:10:30	so now we wanna get this dealt to square this my an this distance um
0:10:35	so
0:10:36	when we have these simplifying assumptions of um you minus being zero mean plus being one and uh the took
0:10:42	the inverse covariance matrices being equal
0:10:44	and it turns out the the mall in just stuff the squared is just uh the sum of the um
0:10:50	this inverse covariance matrix a
0:10:52	and trees
0:10:53	and uh if we substitute the covariance the expressions for the covariance uh matrices and it as a bit of
0:10:59	algebra than we find that uh
0:11:01	a small distance system to square is equal to up P over sigma squared um so P gone as the
0:11:06	number of sensors
0:11:07	mine is uh to over sigma squared times
0:11:09	the sum of the edge set uh
0:11:12	and uh we have the something of G over one plus G
0:11:15	where the arguments to G again or um
0:11:17	the distances between the sensor
0:11:21	so um
0:11:22	this is an that so we have the stop the squared expression we can plug that in to
0:11:26	the previous expression i had a pure for generalisation error so um
0:11:30	a gives is an exact um
0:11:32	uh a not exact but an approximation in a without the everything defined a for the generalisation there
0:11:39	but if we note is um it this expression for a the squared uh it depends on the actual realisation
0:11:45	of uh the locations of the sensors of you i and V J
0:11:49	um so we wanted kind of do something to integrate that out or to a kind of
0:11:54	i understand what's happening in gen um when we have
0:11:58	without specific we having an instantiation of the sensor
0:12:02	so um
0:12:03	as i said it depends on the particular realisation of the random deployment of sensor location so we want wanna
0:12:07	characterise some sort of average behavior of delta uh across the realisations we want a V P
0:12:13	um
0:12:13	so it turns out that the average behavior functionals of the nearest neighbor graph can be described a using a
0:12:19	average behavior of a margin as on point processes
0:12:22	in work got developed by pen rows and you kids
0:12:25	so as P goes to infinity the number of sensors is that goes to infinity um
0:12:30	a functional uh defined on the edge uh
0:12:33	um based on the nearest neighbor graph for distances uh converges to this other expression but
0:12:38	uh a much as point process not up a poisson point process
0:12:42	so um for us that five function on the left hand side is uh G over one plus G
0:12:47	and um
0:12:48	so if we let the right hand side of the limit be as a eight over to then
0:12:53	the delta squared expression simplifies to pure over sigma square times one minus it
0:12:58	and this uh as they a um can be approximated a using monte carlo simulation very easily because a poisson
0:13:05	point process
0:13:06	easy to generate
0:13:09	alright so um
0:13:11	we now we um have this other new expression for adult the squared which um
0:13:16	we can again can plug into the generalization error expression and get uh
0:13:20	a certain value out um
0:13:23	so basically if we substitute in that a mahalanobis distance approximation in the generalization error approximation we get this expression
0:13:30	up here
0:13:32	and
0:13:32	so um then the question might arise so was the optimal number of sensors for a given number of uh
0:13:39	of training samples let's say
0:13:41	so um we can find P that minimize
0:13:43	uh uh now um approximation so
0:13:47	uh we if we differentiate this with respect to P and set it equal to zero we find that uh
0:13:51	the optimal P is an over two
0:13:54	and it doesn't actually depend on sigma squared and data
0:13:57	um
0:13:58	so that what it tells us that it's not always beneficial to throw down more sensors uh with a fixed
0:14:03	number of training samples
0:14:05	and the reason for that is that uh
0:14:07	the uh phenomena of overfitting happen
0:14:10	um and the minimum minimum generalization error one we set people to over to is uh the other expression on
0:14:16	the slide
0:14:17	um so that a minimum generalization error or um
0:14:21	uh that that expression is monotonically increasing in sigma squared which is expected you'd have more errors there's more noise
0:14:27	monotonically decreasing in and uh as well which is expected so um
0:14:31	as the number of training samples increases the uh
0:14:34	the air yeah decreases
0:14:36	and the interesting thing to note is that it's monotonically increasing in say to
0:14:40	so what that tells us than Z the only depends on the place distribution of the sensors so um we
0:14:46	should choose the set of V of that uh in order to minimize data
0:14:49	so we have that choice one
0:14:51	uh deploying a sensor uh system should all the sensors be clustered in middle of the square should they be
0:14:56	at the edges should they be uniformly placed or
0:14:58	any question like that
0:15:00	um so do not me show you some simulations um
0:15:04	so the overall message that the simulations present is that these are a really good approximations that can be used
0:15:10	for system position
0:15:11	um
0:15:12	and you'll see that
0:15:13	uh the semi very gram G function that we use is a have a you to the minus
0:15:17	distance over two
0:15:18	uh the sensor location distribution that we consider is uh
0:15:22	again support it over the square with area P is um
0:15:25	appropriately scaled and shifted beta distribution that's i D and both of its uh uh spatial dimensions
0:15:32	and both parameters of the beta distribution are taken to be equal um so if you may eight was one
0:15:37	that's a uniform distribution over the square
0:15:39	if bit is greater than one then the sensors are on straight in the middle of the square and if
0:15:43	beta as less than one than at the sensors will be construed at the edges of the square
0:15:48	um so we do this for different values of P the number of sensors uh twenty realisations for of the
0:15:53	sensor location placements uh ten realisations of training set for realisation of the um
0:15:58	sensor locations for different values of the number of training samples and we have
0:16:02	a hundred thousand test samples per training
0:16:06	right so this is the uh mahalanobis distance approximation as a function of the number of sensors so there's two
0:16:11	lines here but
0:16:12	you can see them because the approximation is so good there's a blue line and a black line so the
0:16:16	black line is the approximation and
0:16:19	that um as i i actually didn't point out but um
0:16:23	here we see don't the squared is a linear function of P O
0:16:28	so can um so we see that the uh the approximation is a linear function of P and the uh
0:16:33	empirical a how this distance
0:16:36	is indistinguishable essentially so the approximation is uh
0:16:41	uh the so this is from the geometric probability the poisson point process approximation
0:16:46	uh then we plug that the approximation to the other approximation which was the uh are out a set all
0:16:50	approximation for a generalization error of uh a linear discriminant analysis
0:16:54	so here and the red line is the impure goal general say yeah the empirical that test are actually and
0:17:00	the black as the approximation
0:17:02	so um this is for and "'cause" one hundred us so the number of training samples is one hundred
0:17:08	so we see not first of all that uh
0:17:11	the air is minimised one a a P is a are approximately fifty which is what we wanted it uh
0:17:16	which we saw that it's uh and over two
0:17:19	and um the approximation is extremely good to here as well um
0:17:24	the red line and the black line or uh
0:17:26	almost the same
0:17:27	um
0:17:29	this is the same plot for an equals two hundred um
0:17:32	so again the minimum is that uh people's one hundred as we
0:17:36	but
0:17:37	and the approximation as
0:17:38	a quite
0:17:40	um here's the case for any equals one thousand uh so this is a a one thousand training samples um
0:17:47	if we wanted to empirically get these but a very tiny error um values that ten to the minus ten
0:17:53	error probabilities i would have to run our simulations for a long long time um
0:17:57	so this is the case where it's helpful to actually have the approximation um
0:18:03	in order to understand the uh the performance and these low uh there regimes as well and um
0:18:09	we have don't need to do a you um and P cool things uh
0:18:13	we can just use that black line in general for optimising a sensors
0:18:16	so that we have
0:18:18	um the other question was um
0:18:20	this is a to this um which depends on the beta to which was the uh
0:18:25	distribution of the sensors so it equals one again was a uniform placement beta
0:18:29	less than one was a a clustered at the at isn't it it
0:18:32	was
0:18:33	the middle
0:18:34	so here this uh uh they it as a function of bit are we see that one where and that
0:18:38	we have this uniform distribution is best and that's also um
0:18:42	uh what minimize as the generalization error so
0:18:45	in this case we want the sensors to be placed uniformly when we have this nearest neighbor dependency
0:18:50	um so in conclusion a so time is always a limited resource so uh training sets are always uh a
0:18:56	finite
0:18:58	um it's all optimal to use process we have the uh number of sensors as training samples in a this
0:19:03	uh sensor network with local gauss marco dependency
0:19:06	and the fact that um there's a finite rather than infinite number of sensors that's optimal follows from the from
0:19:11	an of overfitting
0:19:13	uh we drive a generalization error approximation that involves um a obvious distance and we give an exact statement of
0:19:20	that model a little was just as for
0:19:22	are are gas mark of sensor measurements
0:19:24	um
0:19:25	and we approximate it using a a a a a to a a probability expression and uh
0:19:30	we saw that the uh combined of approximations both um
0:19:34	the uh
0:19:35	did you much probability one and generalization error one
0:19:38	uh together closely matched impair pair coal results and we saw that uniform sensor placement is
0:19:42	a a good in this
0:19:44	so i'll be happy to take uh some questions
0:20:03	uh_huh
0:20:04	yeah
0:20:06	using
0:20:07	i
0:20:07	and it is to yeah
0:20:09	uh_huh
0:20:10	i
0:20:11	what i see uh_huh
0:20:14	make it
0:20:15	uh_huh it
0:20:17	she
0:20:18	i
0:20:19	she she two ish
0:20:20	so i uh
0:20:22	i you assuming that
0:20:24	a all these uh
0:20:26	my D V not a if not
0:20:28	like its effect on the generalization
0:20:31	right so as i mentioned um
0:20:35	but for um
0:20:36	so this approximation uh this generalization error approximation is true when
0:20:42	the uh actual generating data that things that distributions to generating the data are actually gaussian and they have equal
0:20:48	prior probability
0:20:50	um so
0:20:51	that's what we used um in following a but um
0:20:56	if the true uh likelihood functions we're not gaussian then there would be a different expression for the generalisation error
0:21:02	um and usually that's very hard to can come up with something that matches the actual empirical test or very
0:21:08	well
0:21:09	um so the soviet union but need should i mention there are a few variations on this but
0:21:14	usually the assumption is that the actual generating data is gaussian otherwise it's very hard to come but expressions
0:21:22	and people been trying to find these expressions for thirty forty years and it's very hard to
0:21:27	uh find this expression
0:21:30	all and making an exemption of E coli
0:21:32	comedians yeah we don't have to do that um that's just further simplify notation um i think you you know
0:21:39	this paper or or an extended version has for um where the two covariance as are different
0:21:45	thank
0:21:45	i

SPATIALLY-CORRELATED SENSOR DISCRIMINANT ANALYSIS

Classification and Pattern Recognition

Přednášející: Kush Varshney, Autoři: Kush Varshney, IBM T.J. Watson Research Center, United States