0:00:14or thanks
0:00:14um um
0:00:15so we have a title might not be obvious of what the stock is about but hopefully by the end
0:00:19it will be um
0:00:21so i'm from the uh B M and T J watson research center um
0:00:26alright so
0:00:27the topic of the paper is uh dealing with the sensor networks
0:00:31so uh so
0:00:32just to get everyone on the same page a um so these are collections of spatially distributed uh nodes to
0:00:37take measurements and communicate
0:00:39and we're interested in them for uh detection or classification task
0:00:43so um
0:00:44for example and environmental monitoring or surveillance so
0:00:48the one question all of us are interested in this week "'cause" whether the uh iceland volcano as that stuff
0:00:53no erupting or not
0:00:54um
0:00:56the measurements that um are of various types whatever that temperature sound pressure vibration that cetera
0:01:02and um these measurements exhibit spatial correlation um
0:01:06and the the sensor nodes are usually resource constrained
0:01:11or um so in detection or classification now um so
0:01:14they're basically the same problem but i one point out the differences um
0:01:18so in the detection problem uh we're gonna use measurements to make decisions about binary hypotheses and in this case
0:01:24the likelihood functions of the measurements and the prior probabilities are known
0:01:28and uh the likelihood ratio test just be up in this case
0:01:32um
0:01:33for the problem when you have sensor and works so the distributed detection problem
0:01:36so been studied quite widely
0:01:38um with didn't with the classification problem that's specifically with supervised classification problem look can we won use measurements to
0:01:44make decisions about binary
0:01:46the C
0:01:47um in this case the uh the likelihood functions of the measurements and the prior probabilities are known
0:01:52uh what we are given a L is a labeled training set us so measurement vectors with their true that
0:01:57C is
0:01:58and uh machine learning algorithms are applied so as P or an example of linear discriminant analysis was don't talk
0:02:04about in more detail um this
0:02:06a very classical technique
0:02:08um the difference kind of um between detections price classification is uh this idea of over so
0:02:15you know one of fit do year um your training vectors
0:02:18that you're given labeled training set too much um
0:02:21and
0:02:22uh this distributed suppose classification doesn't have as much brat previous work "'cause" the distributed detection problem
0:02:28um so practically a um up times we don't have knowledge of that the pair
0:02:33uh of the probability density a priori
0:02:36and it's much easier to imagine a situation one which uh training samples can be acquired so that's why we
0:02:41can sit
0:02:41uh the classification problem
0:02:43um so this is just kind of a high level thing in specific what i'm gonna talk but we're not
0:02:47gonna really considered too much about uh network constraints or on communication or et cetera
0:02:53um just more or less
0:02:54the um what you can uh
0:02:56what you can learn from uh data
0:02:58and sensor
0:02:59um so the outline of the talk is that um
0:03:03or still a few words about the supervised classification problem and uh some words but more in to just learning
0:03:08theory
0:03:09um L
0:03:10go through a linear discriminant analysis and and especially focus on
0:03:14so approximations to the generalization error that have been developed um
0:03:18and go over a a gauss markov random field sensor model that we're applying
0:03:22um
0:03:23a drive them a all an all this distance for that uh sensor model um and
0:03:28use that to drive a generalization error um expression for sensor net
0:03:33um uh provided a metric probability approximation to them in a the all this distance in order to um simplifies
0:03:38and things and then uh get some simulation results the and
0:03:42or is just going been going over the notation a bit um so we have a true hypothesis or class
0:03:47label Y which is either plus one or minus one
0:03:50so that you i some a noise either or in or not um
0:03:54we have a noisy measurement vector at which is in R P
0:03:58um and then we have a joint probability density function F of X Y
0:04:02um so the axes are the measurements and wise label
0:04:05uh we wanna learn a decision rule why white had i which is a mapping from the space of measurements
0:04:09to close one and minus one now where we uh learned this
0:04:12so a decision rule is in training samples
0:04:14and one why want to X and Y and
0:04:16but we don't apply to the training samples and but to unseen samples that X from the same distribution of
0:04:21of X Y
0:04:23um the thing were interested in characterising is the generalization error are um the probability that the decision rule will
0:04:28be wrong on new unseen samples from the same distribution
0:04:31uh the training air um
0:04:34is uh we can measure empirically based on the training so um
0:04:38which is a how many are wrong on the training set itself
0:04:41um this idea of overfitting and the structural risk minimization principle is that um
0:04:46as the complexity of the decision rule increases uh the training or we can take to zero but that over
0:04:51fit of the generalization error the thing we wanna minimize
0:04:54optimized for
0:04:55some intermediate complexity level
0:04:58um and in general uh what's just a school learning theory analysis at least small and out characterisations of a
0:05:03a generalization error or or are very loose usually
0:05:07so just one read this a quick quote um which gives this ideas so um one should not be concerned
0:05:11about the quantitative value of the boundary been about its fundamental farm
0:05:15but rather about the terms that appear in the bound and in that respect to useful bound is one which
0:05:19allows us to understand which can be involved in the learning process
0:05:23and as a result performance bounds would be used for what they're are good for the should not be used
0:05:27to actually predict the value of the expected error
0:05:30et cetera
0:05:31so um
0:05:32what if we actually are are interested in
0:05:34developing some um approximations are bounds for generalisation there that we can use an as an optimization criteria
0:05:41so we can use modern just learning theory results some things like uh V C dimension right a market complexity
0:05:46et cetera
0:05:47um and we wanna do this for uh
0:05:50sensor network design optimization so we wanna have some sort of expression that we can optimize
0:05:55uh for generalization
0:05:58um so we turn to the uh former soviet union literature um so they have uh develop very high quality
0:06:04approximations to generalisation error of uh specifically of linear discriminant analysis
0:06:09and some of its variations
0:06:11so um this work was motivated by a or of and the main contributor was uh true honest route
0:06:17and these expressions that we're developed in this literature are can be used as optimization criterion as we'll see in
0:06:23some of the simulation results later um
0:06:26they're very very close a very tight approximations
0:06:30um
0:06:31so specifically linear discriminant analysis is this is uh decision rule as i have on the screen um
0:06:37so why had of X this decision rule um is based on um basically
0:06:42uh if you had a gaussian distribution um for the two um for the do classes with the different means
0:06:48and the same or different covariance as well
0:06:51i what you would do to get the bayes a little but in this case these their sample means a
0:06:55sample covariance
0:06:57um um
0:06:58so
0:06:59uh one the true likelihood functions are gaussian and the true prior prior probabilities are equal will um
0:07:05but we don't know this so one we're learning why that from the N training samples than the generalization error
0:07:10approximation given by row out that all of is um
0:07:14so the generalization there um
0:07:16is approximately uh so this five function is the uh C D F of the gaussian distribution and then
0:07:22we have these terms involving a delta to which is uh the mahalanobis distance which a get to a second
0:07:28P is the dimensionality um
0:07:31and then our case it'll be the number of sensors and is the number of training samples
0:07:36so this is a
0:07:37is i mean it's not simple but it is very easy to evaluate evaluate
0:07:40a expression and we'll see how close
0:07:43um so this the the squared is the money not mahalanobis distance um
0:07:47so basically a um U plus and me minus or are the means of the two classes uh J minus
0:07:52in J plus are the inverse covariance matrices
0:07:56um so that's what we is one of the terms in this expression
0:07:59and that we want to use uh this
0:08:01the generalization X uh i error
0:08:03a approximation channel sensor networks with a special correlation
0:08:07and as i said before we're not can be concerned in this talk about that were communication or hours
0:08:13so i within the sensor network set specifically um we have P sensors each a measuring scalar valued measurements X
0:08:21one through X P
0:08:22so as a combined measurement vector we have this uh a big X which is an R P
0:08:27a the sensors are deployed randomly on the plane according to some distribution F se via of V
0:08:32so the sets of you V is supported on a square with area P
0:08:35so the region grows as more sensors are place
0:08:38um the at the generating a likely it functions are gaussian as i said before with um you minus and
0:08:45um you plus as the means and signal minus and signal classes
0:08:48yeah covariance
0:08:49um we model a covariance structure um
0:08:52so the spatial correlation between sensor measurements using a gauss markov random field model
0:08:57um and uh got to that in a second um for simplicity will let them you minus P zero at
0:09:02this you a vector and you plus P the all ones factor
0:09:05and will like the two covariance is uh be equal um so
0:09:09call them sit my than the inverse covariance matrices to be
0:09:12J
0:09:14so the uh actual um um mark a random feel that we have this with nearest neighbor dependency um
0:09:20so we construct a euclidean under directed uh and nearest neighbor graph between the
0:09:24uh sensors
0:09:25so there's no between sensor I and sensor J or if is then your sabre of J or if J
0:09:30is the nearest neighbor
0:09:31i
0:09:31and uh we do not the edge set of this graph as
0:09:34E
0:09:35so um then we use this uh um
0:09:37graph to do that uh defined a mark a random field covariance um
0:09:41so that's in three parts uh the diagonal elements of signal or are all equal to a little sigma square
0:09:47uh the elements of uh signal corresponding to edges in the nearest neighbor graph are are um
0:09:53basically this a little signal squared times uh G you have and
0:09:57D which is the distance between sensor I and sensor J
0:10:01so this G is a monotonically decreasing function to encode a a correlation decay
0:10:06a and so the farther uh two sensors are the more the less correlated they are so this is known
0:10:11as a semi variogram gram and
0:10:12a used to just it
0:10:14um and then the off diagonal elements of J um corresponding to non edges in the nearest neighbor graph are
0:10:20all zero
0:10:22and uh this is a model that has been used for example by uh i don't come are on and
0:10:26so me and uh some of their work as well
0:10:29um
0:10:30so now we wanna get this dealt to square this my an this distance um
0:10:35so
0:10:36when we have these simplifying assumptions of um you minus being zero mean plus being one and uh the took
0:10:42the inverse covariance matrices being equal
0:10:44and it turns out the the mall in just stuff the squared is just uh the sum of the um
0:10:50this inverse covariance matrix a
0:10:52and trees
0:10:53and uh if we substitute the covariance the expressions for the covariance uh matrices and it as a bit of
0:10:59algebra than we find that uh
0:11:01a small distance system to square is equal to up P over sigma squared um so P gone as the
0:11:06number of sensors
0:11:07mine is uh to over sigma squared times
0:11:09the sum of the edge set uh
0:11:12and uh we have the something of G over one plus G
0:11:15where the arguments to G again or um
0:11:17the distances between the sensor
0:11:21so um
0:11:22this is an that so we have the stop the squared expression we can plug that in to
0:11:26the previous expression i had a pure for generalisation error so um
0:11:30a gives is an exact um
0:11:32uh a not exact but an approximation in a without the everything defined a for the generalisation there
0:11:39but if we note is um it this expression for a the squared uh it depends on the actual realisation
0:11:45of uh the locations of the sensors of you i and V J
0:11:49um so we wanted kind of do something to integrate that out or to a kind of
0:11:54i understand what's happening in gen um when we have
0:11:58without specific we having an instantiation of the sensor
0:12:02so um
0:12:03as i said it depends on the particular realisation of the random deployment of sensor location so we want wanna
0:12:07characterise some sort of average behavior of delta uh across the realisations we want a V P
0:12:13um
0:12:13so it turns out that the average behavior functionals of the nearest neighbor graph can be described a using a
0:12:19average behavior of a margin as on point processes
0:12:22in work got developed by pen rows and you kids
0:12:25so as P goes to infinity the number of sensors is that goes to infinity um
0:12:30a functional uh defined on the edge uh
0:12:33um based on the nearest neighbor graph for distances uh converges to this other expression but
0:12:38uh a much as point process not up a poisson point process
0:12:42so um for us that five function on the left hand side is uh G over one plus G
0:12:47and um
0:12:48so if we let the right hand side of the limit be as a eight over to then
0:12:53the delta squared expression simplifies to pure over sigma square times one minus it
0:12:58and this uh as they a um can be approximated a using monte carlo simulation very easily because a poisson
0:13:05point process
0:13:06easy to generate
0:13:09alright so um
0:13:11we now we um have this other new expression for adult the squared which um
0:13:16we can again can plug into the generalization error expression and get uh
0:13:20a certain value out um
0:13:23so basically if we substitute in that a mahalanobis distance approximation in the generalization error approximation we get this expression
0:13:30up here
0:13:32and
0:13:32so um then the question might arise so was the optimal number of sensors for a given number of uh
0:13:39of training samples let's say
0:13:41so um we can find P that minimize
0:13:43uh uh now um approximation so
0:13:47uh we if we differentiate this with respect to P and set it equal to zero we find that uh
0:13:51the optimal P is an over two
0:13:54and it doesn't actually depend on sigma squared and data
0:13:57um
0:13:58so that what it tells us that it's not always beneficial to throw down more sensors uh with a fixed
0:14:03number of training samples
0:14:05and the reason for that is that uh
0:14:07the uh phenomena of overfitting happen
0:14:10um and the minimum minimum generalization error one we set people to over to is uh the other expression on
0:14:16the slide
0:14:17um so that a minimum generalization error or um
0:14:21uh that that expression is monotonically increasing in sigma squared which is expected you'd have more errors there's more noise
0:14:27monotonically decreasing in and uh as well which is expected so um
0:14:31as the number of training samples increases the uh
0:14:34the air yeah decreases
0:14:36and the interesting thing to note is that it's monotonically increasing in say to
0:14:40so what that tells us than Z the only depends on the place distribution of the sensors so um we
0:14:46should choose the set of V of that uh in order to minimize data
0:14:49so we have that choice one
0:14:51uh deploying a sensor uh system should all the sensors be clustered in middle of the square should they be
0:14:56at the edges should they be uniformly placed or
0:14:58any question like that
0:15:00um so do not me show you some simulations um
0:15:04so the overall message that the simulations present is that these are a really good approximations that can be used
0:15:10for system position
0:15:11um
0:15:12and you'll see that
0:15:13uh the semi very gram G function that we use is a have a you to the minus
0:15:17distance over two
0:15:18uh the sensor location distribution that we consider is uh
0:15:22again support it over the square with area P is um
0:15:25appropriately scaled and shifted beta distribution that's i D and both of its uh uh spatial dimensions
0:15:32and both parameters of the beta distribution are taken to be equal um so if you may eight was one
0:15:37that's a uniform distribution over the square
0:15:39if bit is greater than one then the sensors are on straight in the middle of the square and if
0:15:43beta as less than one than at the sensors will be construed at the edges of the square
0:15:48um so we do this for different values of P the number of sensors uh twenty realisations for of the
0:15:53sensor location placements uh ten realisations of training set for realisation of the um
0:15:58sensor locations for different values of the number of training samples and we have
0:16:02a hundred thousand test samples per training
0:16:06right so this is the uh mahalanobis distance approximation as a function of the number of sensors so there's two
0:16:11lines here but
0:16:12you can see them because the approximation is so good there's a blue line and a black line so the
0:16:16black line is the approximation and
0:16:19that um as i i actually didn't point out but um
0:16:23here we see don't the squared is a linear function of P O
0:16:28so can um so we see that the uh the approximation is a linear function of P and the uh
0:16:33empirical a how this distance
0:16:36is indistinguishable essentially so the approximation is uh
0:16:41uh the so this is from the geometric probability the poisson point process approximation
0:16:46uh then we plug that the approximation to the other approximation which was the uh are out a set all
0:16:50approximation for a generalization error of uh a linear discriminant analysis
0:16:54so here and the red line is the impure goal general say yeah the empirical that test are actually and
0:17:00the black as the approximation
0:17:02so um this is for and "'cause" one hundred us so the number of training samples is one hundred
0:17:08so we see not first of all that uh
0:17:11the air is minimised one a a P is a are approximately fifty which is what we wanted it uh
0:17:16which we saw that it's uh and over two
0:17:19and um the approximation is extremely good to here as well um
0:17:24the red line and the black line or uh
0:17:26almost the same
0:17:27um
0:17:29this is the same plot for an equals two hundred um
0:17:32so again the minimum is that uh people's one hundred as we
0:17:36but
0:17:37and the approximation as
0:17:38a quite
0:17:40um here's the case for any equals one thousand uh so this is a a one thousand training samples um
0:17:47if we wanted to empirically get these but a very tiny error um values that ten to the minus ten
0:17:53error probabilities i would have to run our simulations for a long long time um
0:17:57so this is the case where it's helpful to actually have the approximation um
0:18:03in order to understand the uh the performance and these low uh there regimes as well and um
0:18:09we have don't need to do a you um and P cool things uh
0:18:13we can just use that black line in general for optimising a sensors
0:18:16so that we have
0:18:18um the other question was um
0:18:20this is a to this um which depends on the beta to which was the uh
0:18:25distribution of the sensors so it equals one again was a uniform placement beta
0:18:29less than one was a a clustered at the at isn't it it
0:18:32was
0:18:33the middle
0:18:34so here this uh uh they it as a function of bit are we see that one where and that
0:18:38we have this uniform distribution is best and that's also um
0:18:42uh what minimize as the generalization error so
0:18:45in this case we want the sensors to be placed uniformly when we have this nearest neighbor dependency
0:18:50um so in conclusion a so time is always a limited resource so uh training sets are always uh a
0:18:56finite
0:18:58um it's all optimal to use process we have the uh number of sensors as training samples in a this
0:19:03uh sensor network with local gauss marco dependency
0:19:06and the fact that um there's a finite rather than infinite number of sensors that's optimal follows from the from
0:19:11an of overfitting
0:19:13uh we drive a generalization error approximation that involves um a obvious distance and we give an exact statement of
0:19:20that model a little was just as for
0:19:22are are gas mark of sensor measurements
0:19:24um
0:19:25and we approximate it using a a a a a to a a probability expression and uh
0:19:30we saw that the uh combined of approximations both um
0:19:34the uh
0:19:35did you much probability one and generalization error one
0:19:38uh together closely matched impair pair coal results and we saw that uniform sensor placement is
0:19:42a a good in this
0:19:44so i'll be happy to take uh some questions
0:20:03uh_huh
0:20:04yeah
0:20:06using
0:20:07i
0:20:07and it is to yeah
0:20:09uh_huh
0:20:10i
0:20:11what i see uh_huh
0:20:14make it
0:20:15uh_huh it
0:20:17she
0:20:18i
0:20:19she she two ish
0:20:20so i uh
0:20:22i you assuming that
0:20:24a all these uh
0:20:26my D V not a if not
0:20:28like its effect on the generalization
0:20:31right so as i mentioned um
0:20:35but for um
0:20:36so this approximation uh this generalization error approximation is true when
0:20:42the uh actual generating data that things that distributions to generating the data are actually gaussian and they have equal
0:20:48prior probability
0:20:50um so
0:20:51that's what we used um in following a but um
0:20:56if the true uh likelihood functions we're not gaussian then there would be a different expression for the generalisation error
0:21:02um and usually that's very hard to can come up with something that matches the actual empirical test or very
0:21:08well
0:21:09um so the soviet union but need should i mention there are a few variations on this but
0:21:14usually the assumption is that the actual generating data is gaussian otherwise it's very hard to come but expressions
0:21:22and people been trying to find these expressions for thirty forty years and it's very hard to
0:21:27uh find this expression
0:21:30all and making an exemption of E coli
0:21:32comedians yeah we don't have to do that um that's just further simplify notation um i think you you know
0:21:39this paper or or an extended version has for um where the two covariance as are different
0:21:45thank
0:21:45i