0:00:13 | but have to known every |
---|
0:00:15 | and thank you very much |
---|
0:00:16 | for your patience |
---|
0:00:17 | sitting on two |
---|
0:00:19 | the time |
---|
0:00:21 | with a delay |
---|
0:00:22 | for its final tool of the day |
---|
0:00:26 | today day not talk about |
---|
0:00:28 | and new base in approach to solve the multi target tracking using audio date |
---|
0:00:37 | in this talk |
---|
0:00:39 | after a background |
---|
0:00:40 | i will introduce you to |
---|
0:00:43 | a random finite set approach to the general problem of multi object estimation |
---|
0:00:49 | by multi object estimation i mean |
---|
0:00:52 | if free problem in which you are dealing with multiple objects |
---|
0:00:56 | each one having their own states |
---|
0:00:59 | problems where |
---|
0:01:00 | there is a any not only in the state of the object |
---|
0:01:04 | but |
---|
0:01:05 | in the number of of |
---|
0:01:09 | and |
---|
0:01:10 | then a switch to a spatial type of random finite set |
---|
0:01:15 | called uh multi band newly sets |
---|
0:01:18 | and we go through a a be now you've thousands member filter |
---|
0:01:25 | then i will explain the main contribution of the paper which is what you visual sure tracking |
---|
0:01:31 | and some simulation results |
---|
0:01:33 | and conclusions we finish to stop |
---|
0:01:41 | the problem that we are focusing in this paper and presentation is |
---|
0:01:46 | tracking of |
---|
0:01:47 | multiple |
---|
0:01:48 | location shouldn't speaking |
---|
0:01:50 | targets |
---|
0:01:53 | but me if you an example |
---|
0:01:56 | uh this is a an example of a |
---|
0:01:58 | pay be data |
---|
0:02:00 | again |
---|
0:02:02 | oh |
---|
0:02:03 | yeah O |
---|
0:02:05 | we have a bit of sound |
---|
0:02:07 | uh a the people are speaking location only so the audio data is |
---|
0:02:12 | in and |
---|
0:02:14 | the people can get out of the camera seen there for location a we don't have a |
---|
0:02:19 | a sure the information coming on |
---|
0:02:21 | but |
---|
0:02:22 | we are interested in |
---|
0:02:24 | detecting |
---|
0:02:25 | and tracking the multiple card |
---|
0:02:30 | as we will see in this example |
---|
0:02:34 | the target of interest |
---|
0:02:36 | i can sleep here |
---|
0:02:37 | while |
---|
0:02:38 | still |
---|
0:02:39 | talking |
---|
0:02:41 | i |
---|
0:02:42 | and uh we want to design a filter that can detect |
---|
0:02:48 | and try simultaneously |
---|
0:02:50 | oh existing active |
---|
0:02:52 | target |
---|
0:02:54 | there might be |
---|
0:02:55 | in a give people are of J |
---|
0:02:58 | in this scene i will tell you what is the definition of a active target |
---|
0:03:03 | and how a mathematically be formulated |
---|
0:03:06 | oh |
---|
0:03:08 | a |
---|
0:03:08 | so |
---|
0:03:11 | hmmm |
---|
0:03:13 | in such problems there are a few main challenges |
---|
0:03:16 | we have occasionally silent targets |
---|
0:03:19 | and location the invisible target |
---|
0:03:22 | and also |
---|
0:03:24 | we can have clutter measurements |
---|
0:03:26 | but |
---|
0:03:27 | in visual visual Q |
---|
0:03:30 | and in or do features that will extract from the role what you visual information |
---|
0:03:39 | a contribution is |
---|
0:03:41 | a principled approach to combine audio and video data |
---|
0:03:45 | in a bayesian framework |
---|
0:03:49 | okay |
---|
0:03:51 | all of you are familiar with the |
---|
0:03:54 | nonlinear filtering approaches |
---|
0:03:56 | single target tracking method |
---|
0:04:00 | there is a single target |
---|
0:04:02 | which correspond to |
---|
0:04:04 | single measurement |
---|
0:04:06 | with a single state |
---|
0:04:08 | and from K mine use want to K |
---|
0:04:11 | it trends it's to the new state |
---|
0:04:13 | and in a |
---|
0:04:14 | a general bayesian filtering scheme |
---|
0:04:17 | we have |
---|
0:04:18 | a prediction was that |
---|
0:04:20 | and and update step |
---|
0:04:22 | in prediction is that we use |
---|
0:04:25 | the information that we have a about the dynamic |
---|
0:04:28 | of the object |
---|
0:04:30 | in the update state |
---|
0:04:32 | we use the information that we have a |
---|
0:04:35 | provided by the measure |
---|
0:04:39 | if we assume that the distribution of the state of the single target is cool C N |
---|
0:04:45 | and the dynamics and measurement models are linear |
---|
0:04:49 | then i'm not approximation is corpsman filtering |
---|
0:04:51 | in nonlinear cases |
---|
0:04:53 | particle filters |
---|
0:04:55 | are you |
---|
0:04:59 | a up a multi object filtering problem |
---|
0:05:03 | is something like that |
---|
0:05:04 | spots with |
---|
0:05:05 | spatial complexity |
---|
0:05:08 | and challenge |
---|
0:05:12 | we can have the number of objects |
---|
0:05:14 | randomly changing |
---|
0:05:17 | the number of measurement cues available random be changing |
---|
0:05:21 | we can have |
---|
0:05:22 | some objects undetected |
---|
0:05:24 | on a needing detections |
---|
0:05:27 | we can have clutter |
---|
0:05:29 | and also data association it's |
---|
0:05:32 | another challenge that needs to be |
---|
0:05:35 | tech |
---|
0:05:42 | a relatively be sent |
---|
0:05:45 | approach |
---|
0:05:46 | to tackle the multi object |
---|
0:05:48 | filtering problem |
---|
0:05:50 | is |
---|
0:05:51 | um |
---|
0:05:52 | the random finite set we using the random finite set theory to double a principled solutions to tackle these problems |
---|
0:06:03 | in this approach |
---|
0:06:06 | the objects |
---|
0:06:08 | are modelled as a set |
---|
0:06:10 | as a random find |
---|
0:06:13 | in which |
---|
0:06:14 | the onset a need you both |
---|
0:06:16 | in the states |
---|
0:06:17 | and in the number |
---|
0:06:19 | of the targets or objects are |
---|
0:06:21 | mathematically model |
---|
0:06:24 | they have four |
---|
0:06:26 | instead of multiple objects or targets we will be dealing with a single target that is modelled |
---|
0:06:33 | as a set |
---|
0:06:35 | and |
---|
0:06:36 | will be dealing with |
---|
0:06:38 | mathematics |
---|
0:06:40 | off |
---|
0:06:40 | sets |
---|
0:06:41 | integration |
---|
0:06:43 | uh and and derivation |
---|
0:06:46 | and |
---|
0:06:47 | statistical properties of the set |
---|
0:06:50 | however |
---|
0:06:51 | the problem is encapsulated |
---|
0:06:55 | as a single target |
---|
0:06:58 | tracking or a single object estimation problem |
---|
0:07:03 | in the mathematical formulation of very this solutions that have been double opt |
---|
0:07:08 | in this |
---|
0:07:10 | framework |
---|
0:07:12 | random finite set theory framework |
---|
0:07:16 | detection on a need T |
---|
0:07:17 | clutter |
---|
0:07:18 | and association onset a needy T |
---|
0:07:20 | are principal in a a a are um mathematically formulated in a principled man |
---|
0:07:30 | mean wind to you to go through some of these solutions including the well-known phd filter |
---|
0:07:35 | and C phd filter |
---|
0:07:37 | and member filter |
---|
0:07:39 | and |
---|
0:07:40 | uh |
---|
0:07:42 | cardinality duality balance number filter |
---|
0:07:46 | which is the filter that that i'm using to solve the big focus problem in this presentation |
---|
0:07:55 | spatial kind of random finite sets are multi but only random finite set |
---|
0:08:01 | there are the ensemble a |
---|
0:08:05 | and |
---|
0:08:06 | which is |
---|
0:08:07 | known but can be determined iteratively |
---|
0:08:11 | capital um |
---|
0:08:14 | but newly set |
---|
0:08:16 | each but newly sets is |
---|
0:08:18 | prescribe by |
---|
0:08:20 | two |
---|
0:08:21 | parameters |
---|
0:08:23 | and a are which is existent and probability |
---|
0:08:26 | all a possible object |
---|
0:08:29 | and the P which is the pdf of state of that object |
---|
0:08:33 | and the union of all these |
---|
0:08:35 | very newly sets |
---|
0:08:37 | uh form a multi but only random finite set |
---|
0:08:42 | a multi only R F S or random finite set can be fully prescribe point ensemble of are |
---|
0:08:48 | all are i are and |
---|
0:08:51 | P |
---|
0:08:55 | so |
---|
0:08:56 | ask you see |
---|
0:08:58 | yeah whole on set a needy in the number of a |
---|
0:09:01 | you objects that exist in the scene |
---|
0:09:04 | and the distribution of the state |
---|
0:09:06 | can be mathematically modelled |
---|
0:09:09 | having these are lies |
---|
0:09:11 | and the P a my functions |
---|
0:09:18 | and we'd |
---|
0:09:19 | general bayesian filters |
---|
0:09:21 | we have a prediction and update that |
---|
0:09:25 | and when we model the random finite set of targets |
---|
0:09:29 | as a multi breed new only random finite set |
---|
0:09:34 | it's is the are i |
---|
0:09:36 | and P a which are predicted and update data |
---|
0:09:47 | mentor filter |
---|
0:09:49 | and it six version card you know keep balance or C B member filters |
---|
0:09:54 | are more useful than phd D filters |
---|
0:09:57 | in practical implementations because of the computational requirements and |
---|
0:10:02 | also their accuracy |
---|
0:10:07 | so |
---|
0:10:10 | similar to a general bayesian filter in prediction |
---|
0:10:15 | the are eyes and P is are predicted |
---|
0:10:17 | and the predicted are are i and P I equations involve |
---|
0:10:24 | a survival probability of each |
---|
0:10:26 | function |
---|
0:10:28 | and a transitional density |
---|
0:10:30 | state state transition density of each sorry object |
---|
0:10:35 | transition density of each object |
---|
0:10:38 | these are the dynamics |
---|
0:10:40 | information that we have about the movements |
---|
0:10:44 | or the state |
---|
0:10:45 | changes of the object |
---|
0:10:49 | in addition |
---|
0:10:50 | in prediction |
---|
0:10:52 | in new set |
---|
0:10:54 | all |
---|
0:10:55 | but new sets |
---|
0:10:57 | is the |
---|
0:10:58 | in should used to the system has the result of |
---|
0:11:01 | new coming |
---|
0:11:03 | objects |
---|
0:11:04 | to the C |
---|
0:11:08 | in prediction |
---|
0:11:09 | in a a in the updates that |
---|
0:11:12 | the ensemble of a a i is and P is the bear new newly said |
---|
0:11:16 | are updated to the union of two set S |
---|
0:11:20 | one set includes the legacy try |
---|
0:11:24 | this sets that that are there |
---|
0:11:26 | because there might not be detected |
---|
0:11:29 | there might not have been detected in that |
---|
0:11:32 | frame |
---|
0:11:33 | and the sets |
---|
0:11:34 | that are there and they are updated using the measurement |
---|
0:11:38 | date |
---|
0:11:41 | in these equation i want to draw your attention to |
---|
0:11:45 | two important parameters |
---|
0:11:48 | detection probability |
---|
0:11:50 | and measurement likely |
---|
0:11:52 | P D |
---|
0:11:53 | and G K |
---|
0:11:55 | these are |
---|
0:11:57 | define for single option |
---|
0:12:00 | single objects |
---|
0:12:02 | you have measurements |
---|
0:12:03 | the relationship between the measurement |
---|
0:12:05 | and the object |
---|
0:12:06 | state |
---|
0:12:07 | is defined |
---|
0:12:09 | make the dependent on |
---|
0:12:12 | the your sensor uh performance and your equipment |
---|
0:12:16 | dynamics |
---|
0:12:17 | and also some timber mental |
---|
0:12:19 | a a a a a parameters |
---|
0:12:21 | such as the clock to rate |
---|
0:12:23 | a try it's a the whole measurement pro |
---|
0:12:29 | and the detection probability is another parameter using which |
---|
0:12:34 | we can to you the performance of the system |
---|
0:12:37 | and |
---|
0:12:38 | define |
---|
0:12:39 | our definition of |
---|
0:12:41 | active |
---|
0:12:42 | speakers were active targets in the scene |
---|
0:12:45 | you see how |
---|
0:12:48 | so for all do you we should tracking |
---|
0:12:53 | in our implementation we dig state includes the icsi image white image and X start and white dots |
---|
0:12:59 | and this size |
---|
0:13:01 | off |
---|
0:13:02 | uh a rectangular |
---|
0:13:04 | but souls that's we will get |
---|
0:13:07 | as a result of tracking |
---|
0:13:08 | in the image |
---|
0:13:12 | video all measurements are obtained by performing a no based background subtraction followed by morphological image operations |
---|
0:13:20 | the result would be a set of rectangular brought in each frame at |
---|
0:13:26 | if we denote the results as a random finite set |
---|
0:13:30 | Z |
---|
0:13:32 | which in which each element includes the |
---|
0:13:35 | X Y W and hey H |
---|
0:13:38 | then the likelihood |
---|
0:13:39 | can be defined by this function |
---|
0:13:44 | which is a coarse can like |
---|
0:13:48 | with audio your measurements |
---|
0:13:50 | i have taken the simplest approach |
---|
0:13:54 | assuming that there are two microphones on two sides of the camera |
---|
0:13:59 | the time difference of arrival or tdoa |
---|
0:14:02 | is calculated using cross correlation |
---|
0:14:05 | uh generalized cross-correlation function face transform or gcc-phat |
---|
0:14:11 | because of dereverberation effects |
---|
0:14:13 | there are several peaks in the gcc-phat curve |
---|
0:14:17 | when it plotted versus time difference |
---|
0:14:21 | in our experiments |
---|
0:14:22 | we have considered at most five large errors |
---|
0:14:25 | peaks of the gcc-phat values |
---|
0:14:28 | and |
---|
0:14:29 | we consider them as the tdoa measurements in each frame |
---|
0:14:35 | so |
---|
0:14:38 | in order to |
---|
0:14:40 | prime it tries and calculate relationship between these tdoa measurements |
---|
0:14:45 | and the state |
---|
0:14:47 | the the object state which is it |
---|
0:14:49 | why W and hey H |
---|
0:14:53 | first |
---|
0:14:54 | there is a practical consideration |
---|
0:14:56 | the distance of targets from the microphones |
---|
0:15:00 | is |
---|
0:15:00 | relatively large compared to the distance between the two microphones |
---|
0:15:04 | therefore we can practically assume |
---|
0:15:07 | but um there is a linear relationship between it |
---|
0:15:12 | and the corresponding tdoa |
---|
0:15:14 | you know to to find a parameter of this linear relationship |
---|
0:15:18 | i have used |
---|
0:15:20 | the ground truth state that i have |
---|
0:15:22 | in one of the |
---|
0:15:23 | case says one up to be D use in this paper be database |
---|
0:15:29 | in each frame i have calculated five peaks or five tdoa ace |
---|
0:15:36 | and |
---|
0:15:37 | the red points |
---|
0:15:38 | and then |
---|
0:15:39 | uh i wanna find out |
---|
0:15:41 | because of |
---|
0:15:42 | many of them are out wires and on these some of them are in the liar |
---|
0:15:47 | and using the robust estimation technique you can detect and remove the outliers and then use regression to find out |
---|
0:15:54 | that linear your |
---|
0:15:55 | a a a a relationship that exists |
---|
0:15:58 | between the tdoa and X |
---|
0:16:01 | of |
---|
0:16:02 | uh |
---|
0:16:03 | each of the two persons that are active in the scene in that case the study |
---|
0:16:10 | and i have a |
---|
0:16:12 | um can see two persons |
---|
0:16:15 | for |
---|
0:16:16 | uh uh |
---|
0:16:17 | comparison purposes |
---|
0:16:19 | because if the two equations are very close |
---|
0:16:22 | to each other in terms of their parameters |
---|
0:16:25 | that put a proof that uh |
---|
0:16:28 | uh this assumption is practically core wrecked |
---|
0:16:31 | and our estimates |
---|
0:16:32 | or accurate |
---|
0:16:33 | and that once the case |
---|
0:16:36 | okay |
---|
0:16:37 | now |
---|
0:16:39 | we have a |
---|
0:16:40 | two measurement likelihoods |
---|
0:16:42 | in each frame |
---|
0:16:44 | we have |
---|
0:16:45 | what what you data |
---|
0:16:46 | and we have to frame coming up |
---|
0:16:49 | we have what you measurements as the set of tdoa ace |
---|
0:16:52 | and we have a |
---|
0:16:54 | we T or |
---|
0:16:55 | image measure |
---|
0:16:57 | as a result of background subtraction followed by morphological equation uh operations |
---|
0:17:02 | how do we use them how to be fused these information |
---|
0:17:06 | to find |
---|
0:17:07 | active |
---|
0:17:09 | targets |
---|
0:17:12 | we define active targets |
---|
0:17:15 | in terms of the probability of detection |
---|
0:17:19 | values |
---|
0:17:20 | system |
---|
0:17:22 | for example if an active speaker is considered to be the person who is expected to be |
---|
0:17:27 | visible visible to the camera in no less than ninety five percent of the time |
---|
0:17:32 | and to be speaking in have at least forty percent of the time |
---|
0:17:37 | then |
---|
0:17:37 | we set |
---|
0:17:39 | the detection probability for a usual data are as ninety five percent |
---|
0:17:44 | and |
---|
0:17:44 | for what you a data as forty |
---|
0:17:47 | like increasing |
---|
0:17:48 | and decreasing these |
---|
0:17:50 | detection probabilities |
---|
0:17:52 | sure |
---|
0:17:54 | we can take you wanna |
---|
0:17:56 | how long we expect |
---|
0:17:59 | uh a a um and active target to be speaking or to be visible |
---|
0:18:04 | it is application to ten that dependent and it can be tuned by the user |
---|
0:18:10 | then |
---|
0:18:16 | then |
---|
0:18:17 | sensor a fusion happens |
---|
0:18:20 | by |
---|
0:18:20 | repeating the update |
---|
0:18:22 | step |
---|
0:18:23 | twice |
---|
0:18:23 | that simple |
---|
0:18:25 | fast |
---|
0:18:26 | we do the to state |
---|
0:18:28 | and i remind you that in the update |
---|
0:18:31 | step of the filter are using the measurement likelihood |
---|
0:18:34 | functions |
---|
0:18:36 | which we have |
---|
0:18:37 | so |
---|
0:18:37 | we do to be update step first |
---|
0:18:40 | using the visual measurements and then using the old your mission |
---|
0:18:45 | a |
---|
0:18:46 | in each of these repetitions we use the corresponding |
---|
0:18:49 | detection probability |
---|
0:18:51 | and again i'd remind you that in each step |
---|
0:18:54 | we have |
---|
0:18:55 | the legacy tracks |
---|
0:18:57 | and we'll have measurement correct track |
---|
0:18:59 | which |
---|
0:19:00 | are to you want buy these detection probability |
---|
0:19:05 | so here are some results |
---|
0:19:07 | for example in this case |
---|
0:19:16 | yeah |
---|
0:19:17 | a class |
---|
0:19:20 | as you see people are talking |
---|
0:19:22 | and they are uh detected and |
---|
0:19:28 | this is not |
---|
0:19:29 | the sound of my sure i a there for the |
---|
0:19:34 | a |
---|
0:19:36 | but i |
---|
0:19:37 | let me run it outside |
---|
0:19:38 | probably its |
---|
0:19:40 | a |
---|
0:19:41 | a |
---|
0:19:49 | but |
---|
0:20:00 | okay |
---|
0:20:03 | a |
---|
0:20:07 | five to the have like a large C there are being right |
---|
0:20:12 | um |
---|
0:20:13 | the left frame |
---|
0:20:15 | yeah |
---|
0:20:15 | shows that i image result of tracking |
---|
0:20:18 | the right frame shows the ensemble of particle that i have used to implement three |
---|
0:20:24 | uh uh i a belief that |
---|
0:20:26 | the that the of the they sure |
---|
0:20:28 | it then of each or newly component in in a random finite set of real |
---|
0:20:34 | okay |
---|
0:20:35 | uh uh uh i two three four five six |
---|
0:20:38 | and and the results that use see on the |
---|
0:20:42 | left to right yeah |
---|
0:20:44 | yeah are actually the have a age of all the winning particles corresponding to each talk |
---|
0:20:54 | one and and is was the results of think well |
---|
0:20:59 | and one thing i it if is one from the number of a whole |
---|
0:21:04 | well uh we for one two three four |
---|
0:21:10 | wow |
---|
0:21:14 | and here is another example |
---|
0:21:18 | an example of a smart from all of them are from a |
---|
0:21:21 | space data phase |
---|
0:21:22 | uh i |
---|
0:21:24 | and um |
---|
0:21:25 | a |
---|
0:21:26 | i |
---|
0:21:27 | i |
---|
0:21:28 | i |
---|
0:21:29 | just |
---|
0:21:29 | i i i finish my for |
---|
0:21:31 | with some quantitative results because i think we are closing to the and of time |
---|
0:21:36 | in ninety eight point five percent of all the frames the existing targets |
---|
0:21:41 | for all detected in this freak |
---|
0:21:43 | and uh cases |
---|
0:21:44 | they were correctly labeled and track |
---|
0:21:48 | like bills were never switched after or during occlusion |
---|
0:21:53 | and then in target was successfully tracked using the or do you Q |
---|
0:21:58 | and |
---|
0:21:59 | a false negative ratio false alarm ratio and label switching shoes |
---|
0:22:04 | we out would you and read want you are here as you see |
---|
0:22:08 | these false alarm rate to and label switching ratio as are |
---|
0:22:14 | almost zero or cut a zero and none available in this and uh |
---|
0:22:19 | right |
---|
0:22:20 | and and are less than |
---|
0:22:23 | uh the case when we are not using the audio data |
---|
0:22:30 | and i will script conclusions |
---|
0:22:32 | thank you and i will be answering your question |
---|
0:22:36 | a are very much okay okay we |
---|
0:22:38 | oh i would like to thank all of you for remaining until this time and fact of the speakers for |
---|
0:22:43 | error |
---|
0:22:44 | very good the uh uh uh box |
---|
0:22:46 | i think you can do breast are also separately over wise the noise that in your we'll uh |
---|
0:22:51 | increase |
---|