0:00:18 | oh one |
---|
0:00:19 | my name is an idea was to don't |
---|
0:00:21 | and i'm here to present |
---|
0:00:23 | is work |
---|
0:00:24 | well |
---|
0:00:24 | right |
---|
0:00:26 | what |
---|
0:00:27 | yeah |
---|
0:00:27 | or |
---|
0:00:28 | a a a a lot of attention |
---|
0:00:30 | yeah one that of that |
---|
0:00:31 | we we have N |
---|
0:00:35 | so far is that we start with a given direction of a of the results in a processing is we |
---|
0:00:39 | have a |
---|
0:00:40 | a video signal |
---|
0:00:42 | here |
---|
0:00:43 | uh a the results signal which is composed |
---|
0:00:45 | of two modalities that the video part |
---|
0:00:48 | and the audio by |
---|
0:00:49 | the be their body recorded with that common and the L be a or with a microphone |
---|
0:00:53 | there could be more common as a microphones |
---|
0:00:55 | but uh |
---|
0:00:56 | here we prefer to to uh that is the the most simple problem but it's also the most calm |
---|
0:01:02 | in these the money |
---|
0:01:03 | so the uh media and audio signal as they are very different |
---|
0:01:07 | but they show the temporal axes |
---|
0:01:09 | the resolution of these axes is |
---|
0:01:11 | different |
---|
0:01:12 | if we can you would have a much more obvious on sound done be of frames |
---|
0:01:16 | people the sound right of the thing on it |
---|
0:01:18 | you |
---|
0:01:19 | uh |
---|
0:01:19 | hi |
---|
0:01:21 | or or the main idea |
---|
0:01:22 | in general the with and signal processing is |
---|
0:01:25 | to combine |
---|
0:01:26 | or the on do well that it is in or that to extract |
---|
0:01:29 | a maximum amount of information |
---|
0:01:32 | a a a given scene |
---|
0:01:34 | so there are several applications like for example in speech recognition you can use |
---|
0:01:38 | that the media modality |
---|
0:01:40 | in other to better understand speech |
---|
0:01:43 | or or you can use |
---|
0:01:45 | and combined the information in both signals in out there to look at this one's for |
---|
0:01:50 | yeah are mainly in this domain |
---|
0:01:52 | uh the main assumption is that that really it events |
---|
0:01:55 | both channels as they happened more or less at the same that |
---|
0:01:58 | so in this case for example you have a |
---|
0:02:01 | i guy who's playing a that |
---|
0:02:03 | i'm the guitar sounds |
---|
0:02:05 | they are correlated |
---|
0:02:06 | with the movement of the |
---|
0:02:08 | or they cup and more or is at the skin then |
---|
0:02:11 | so that's the main assumption that we would use in |
---|
0:02:13 | what |
---|
0:02:14 | so knowledgeable to the goal what we want to do in this work |
---|
0:02:18 | used to extract the would is all objects in the scene |
---|
0:02:22 | uh |
---|
0:02:23 | looks like this we have |
---|
0:02:25 | i sequence where are that are |
---|
0:02:27 | two objects the for up to |
---|
0:02:29 | or |
---|
0:02:30 | speaker in this case |
---|
0:02:31 | uh is associated to the soundtrack |
---|
0:02:34 | and that is |
---|
0:02:35 | another person |
---|
0:02:36 | which is moving that beeps |
---|
0:02:38 | but we cannot listen to the sounds |
---|
0:02:40 | so these disease |
---|
0:02:41 | that is the vector |
---|
0:02:43 | and these ease |
---|
0:02:44 | though the visible lot |
---|
0:02:45 | what we want to do is to extract |
---|
0:02:48 | the be the a part of which is associated to the something |
---|
0:02:51 | without the interference of the race |
---|
0:02:53 | the but |
---|
0:02:54 | why do we want to do this |
---|
0:02:55 | because |
---|
0:02:56 | many applications in this the they then used then entire signal these that's used |
---|
0:03:01 | that part of the signal which is associated to that |
---|
0:03:05 | in deep reading for example you just need this speaker and beeps |
---|
0:03:08 | or their revision |
---|
0:03:09 | around the mouth |
---|
0:03:11 | um we don't get if that is another person on or the use a table |
---|
0:03:15 | that's |
---|
0:03:15 | by |
---|
0:03:17 | that's |
---|
0:03:18 | i mean |
---|
0:03:19 | just that there's you that we are going to for |
---|
0:03:22 | is these one first we have |
---|
0:03:24 | this |
---|
0:03:24 | sequence |
---|
0:03:25 | and what we want to do is to identify the regions |
---|
0:03:29 | was motion is correlated to the soundtrack |
---|
0:03:31 | the regions of interest for the with one not |
---|
0:03:34 | for this purpose was we would use a the year of the result if not pro |
---|
0:03:39 | once |
---|
0:03:40 | we have |
---|
0:03:40 | here |
---|
0:03:42 | and map |
---|
0:03:43 | of the the correlation between the |
---|
0:03:46 | the motion and divisions and the soundtrack |
---|
0:03:49 | that that is a region the most correlated is |
---|
0:03:51 | these uh as texture with the sound |
---|
0:03:54 | then we can extract that the uh media regions |
---|
0:03:58 | which are most but really |
---|
0:04:00 | chatting white here |
---|
0:04:02 | as you see in these sequence for example |
---|
0:04:04 | we had |
---|
0:04:05 | this person |
---|
0:04:06 | whose |
---|
0:04:06 | speech |
---|
0:04:07 | what |
---|
0:04:09 | uh |
---|
0:04:09 | we were ending |
---|
0:04:10 | and we can |
---|
0:04:11 | extract this |
---|
0:04:13 | and then once we have this starting point we want to use a uh |
---|
0:04:17 | segmentation approach |
---|
0:04:19 | we use a graph cuts |
---|
0:04:20 | you know that to extract of role the region of there are yeah |
---|
0:04:25 | which is |
---|
0:04:26 | uh |
---|
0:04:27 | more or less correlated to the sound |
---|
0:04:29 | so we want to extract that region which is |
---|
0:04:32 | a more was |
---|
0:04:33 | in core or and it's has a high synchrony with this |
---|
0:04:38 | that's the first |
---|
0:04:39 | but so we want to know where |
---|
0:04:41 | the sound sources are |
---|
0:04:43 | here we have |
---|
0:04:44 | we use these of the base |
---|
0:04:46 | be the fusion |
---|
0:04:47 | and |
---|
0:04:48 | that was presented in i "'cause" |
---|
0:04:50 | that year |
---|
0:04:52 | and a V is to to remove |
---|
0:04:54 | or or information which is not associated to the country |
---|
0:04:58 | we want to present a just that information which is interesting from one of the result point of |
---|
0:05:04 | in this case for example there is that hand playing |
---|
0:05:06 | a piano |
---|
0:05:07 | and that is an object moving in the on |
---|
0:05:10 | we would like to present these information |
---|
0:05:12 | and |
---|
0:05:13 | blur what is not in |
---|
0:05:16 | to do it |
---|
0:05:17 | what do we do is uh we define |
---|
0:05:19 | and of the result diffusion efficient |
---|
0:05:21 | which is a function of the synchrony between the motion and the sound |
---|
0:05:26 | yeah we see |
---|
0:05:27 | these people are coefficient is a function of these |
---|
0:05:30 | and |
---|
0:05:31 | these are the results synchrony measure |
---|
0:05:33 | the combination of |
---|
0:05:34 | or the energy |
---|
0:05:36 | and the temporal derivative for the media signal |
---|
0:05:39 | which is the motion |
---|
0:05:40 | the videos |
---|
0:05:42 | then we can more of these without motion in or the to to reduce the effect of no |
---|
0:05:48 | as you can see here at at if we don't fish |
---|
0:05:52 | is a function of this synchrony measures |
---|
0:05:54 | when the synchrony |
---|
0:05:56 | use high |
---|
0:05:57 | then that if we join these |
---|
0:05:59 | stop |
---|
0:06:00 | the diffusion coefficient is close to zero the diffusion is |
---|
0:06:03 | when the uh the uh synchrony slow |
---|
0:06:07 | so the be actually |
---|
0:06:08 | and the sounds are not correlate |
---|
0:06:11 | then |
---|
0:06:12 | that if we don't coefficient is constant and equal to one and the vision is blue |
---|
0:06:17 | and that that is that's |
---|
0:06:19 | on these sequence |
---|
0:06:20 | so |
---|
0:06:21 | i'm not sure if you see but here that the |
---|
0:06:23 | still the point there is still point sharp |
---|
0:06:26 | but it is difficult to see |
---|
0:06:28 | use diffusion a more |
---|
0:06:30 | the audio-visual visible object |
---|
0:06:32 | or these |
---|
0:06:34 | this direct |
---|
0:06:36 | so let's see what happens |
---|
0:06:39 | in the motion in this not |
---|
0:06:41 | here that we have |
---|
0:06:44 | i'm not sure if you see |
---|
0:06:45 | but we have the motion in |
---|
0:06:47 | frame |
---|
0:06:48 | which is |
---|
0:06:49 | you quality has equal might need to |
---|
0:06:51 | in that is distracting moving objects so in the head |
---|
0:06:54 | of the rocking horse |
---|
0:06:56 | and |
---|
0:06:56 | in the audio-visual object |
---|
0:06:58 | and after that if we can process and we are here |
---|
0:07:01 | and |
---|
0:07:02 | the the uh main intensity of the motion |
---|
0:07:05 | use you it in the the visible |
---|
0:07:08 | now if you one |
---|
0:07:10 | you we don't want to not eyes |
---|
0:07:12 | regions with low |
---|
0:07:13 | uh motion on what do we need to do is to compare |
---|
0:07:17 | the motion after the we don't |
---|
0:07:19 | to the motion before |
---|
0:07:20 | is be fusion |
---|
0:07:22 | so that |
---|
0:07:23 | we see how it's region has been uh a diffuse |
---|
0:07:27 | through |
---|
0:07:28 | the |
---|
0:07:29 | this but |
---|
0:07:30 | yeah is that is that |
---|
0:07:32 | and uh uh again you that's you might but if we plot |
---|
0:07:35 | just the high as as for these features |
---|
0:07:38 | we see that at the beginning |
---|
0:07:40 | we have the high values for the original motion |
---|
0:07:43 | which are equally distributed between the |
---|
0:07:46 | or the result these |
---|
0:07:47 | what that is that door and only result |
---|
0:07:50 | so we have more or less the same points |
---|
0:07:52 | in the hand |
---|
0:07:53 | and that in the head of the course |
---|
0:07:55 | after the push most of the |
---|
0:07:57 | hi as well are already ready it it than the hand |
---|
0:08:01 | which is generated in the sounds |
---|
0:08:03 | and finally when comparing both |
---|
0:08:05 | there are test |
---|
0:08:07 | to sports |
---|
0:08:08 | we |
---|
0:08:09 | the the |
---|
0:08:10 | the high as R |
---|
0:08:12 | miss class if |
---|
0:08:15 | are now let's see we have the points where we have the highest correlation |
---|
0:08:20 | i know what we want to do it's to |
---|
0:08:22 | extract |
---|
0:08:23 | then that region |
---|
0:08:24 | so what we do is we use a of the results segmentation approach |
---|
0:08:28 | we need some starting point |
---|
0:08:30 | from the segmentation process |
---|
0:08:32 | we three |
---|
0:08:33 | like |
---|
0:08:34 | that's see for the source |
---|
0:08:36 | so the starting point for the source |
---|
0:08:38 | i the point |
---|
0:08:39 | the the X it's we the high yes |
---|
0:08:42 | but the result we dance the peaks serves or the regions which move accordingly to the sample |
---|
0:08:47 | and scenes |
---|
0:08:48 | we don't want to make any assumption of the background |
---|
0:08:51 | we don't want to say |
---|
0:08:52 | the the with |
---|
0:08:54 | which are less correlated to the soundtrack |
---|
0:08:57 | at the back down |
---|
0:08:58 | we don't want to make these a shown to the know anything about the background down |
---|
0:09:01 | but we do |
---|
0:09:02 | to to it |
---|
0:09:03 | then the me |
---|
0:09:05 | the seats for the back down |
---|
0:09:07 | you |
---|
0:09:08 | and then |
---|
0:09:09 | we don't fix any in C seat inside and be that's because we don't want to condition that is that |
---|
0:09:14 | in and P that's with the no you what is this was because in fact that is most most |
---|
0:09:19 | oh that's but they're not to fake |
---|
0:09:21 | that is that |
---|
0:09:22 | and then a fixed |
---|
0:09:23 | and |
---|
0:09:25 | and does that the results |
---|
0:09:26 | with our method |
---|
0:09:28 | and now will explain |
---|
0:09:29 | now how from this |
---|
0:09:31 | starting point |
---|
0:09:32 | would reach to this is |
---|
0:09:35 | or what we use is a i could have cat |
---|
0:09:37 | segmentation approach |
---|
0:09:39 | and no we introduce an the result them |
---|
0:09:42 | spar pose is to keep to get it lesions with high for the resulting from me |
---|
0:09:46 | so we have |
---|
0:09:47 | typically we want to minimize is a question |
---|
0:09:50 | first we have the vision of then |
---|
0:09:52 | which compares the core of fixed |
---|
0:09:54 | speaks so the with the estimates of core for background |
---|
0:09:57 | i'm for no |
---|
0:09:58 | then we have |
---|
0:09:59 | the one that it then |
---|
0:10:01 | we check keeps to their excel |
---|
0:10:03 | neighbor peaks says which uh similar hold |
---|
0:10:06 | and we define |
---|
0:10:08 | these all the result then |
---|
0:10:10 | so that it keeps together visions |
---|
0:10:12 | which present a high |
---|
0:10:14 | for the results in |
---|
0:10:17 | so those |
---|
0:10:18 | first then and a commonly used |
---|
0:10:20 | and the last term |
---|
0:10:21 | you |
---|
0:10:23 | so let's see the let's study |
---|
0:10:26 | more deeply the the the use of then |
---|
0:10:28 | as this would be for innings to get a regions with high there is a synchrony |
---|
0:10:32 | but |
---|
0:10:33 | in contrast it doesn't affect regions with low synchrony |
---|
0:10:37 | so what we do is we define a like this |
---|
0:10:40 | and |
---|
0:10:41 | if is proportional to though the result coherence |
---|
0:10:44 | so when to a neighbour or sets have high and similar |
---|
0:10:48 | audio-visual synchrony then we keep them to better through the segmentation process |
---|
0:10:52 | it's like a block |
---|
0:10:54 | in contrast when two peaks its never in pixels as have different |
---|
0:10:58 | or do a lot of uh and an synchrony |
---|
0:11:00 | the yeah likely to be C to it you know of there and we don't know take them |
---|
0:11:04 | and when the are the result we and is low |
---|
0:11:07 | we don't do anything this term doesn't affect the segmentation |
---|
0:11:11 | and uh we let the other terms the original their mom on the right them |
---|
0:11:16 | to to that |
---|
0:11:17 | so here |
---|
0:11:19 | are the starting point of the segmentations of the seats |
---|
0:11:23 | for the source |
---|
0:11:24 | the for the but down the at least to with that everywhere else five then |
---|
0:11:28 | would be there |
---|
0:11:29 | but uh here we see that |
---|
0:11:31 | in this case this person is speaking most of the seats |
---|
0:11:34 | i don't in the mouth |
---|
0:11:36 | of these person |
---|
0:11:37 | then the right person on is speaking |
---|
0:11:40 | in the bottom line |
---|
0:11:42 | and that those that of that is that |
---|
0:11:44 | without though the result then |
---|
0:11:46 | so we extract just |
---|
0:11:48 | a part of the mouth of the speaker |
---|
0:11:51 | and one we are these T is what they're |
---|
0:11:53 | then that mouth of the speaker is classified as a block |
---|
0:11:57 | and we can extract |
---|
0:11:59 | oh bit of the region |
---|
0:12:00 | in this case extract then that face or |
---|
0:12:03 | then there |
---|
0:12:04 | mouth |
---|
0:12:06 | it's compare our method with previous method |
---|
0:12:10 | and be fun |
---|
0:12:10 | here |
---|
0:12:12 | the main difference between our method some previous methods |
---|
0:12:15 | is that |
---|
0:12:16 | the yeah they assume that uh |
---|
0:12:19 | peaks sets or regions presenting a low |
---|
0:12:22 | or the results synchrony |
---|
0:12:23 | they can not be long to the the for gram they cannot not don't to this works |
---|
0:12:28 | you our case the we don't make |
---|
0:12:30 | these assumptions since we want to extract |
---|
0:12:32 | uh i division which is |
---|
0:12:34 | or more generous |
---|
0:12:35 | in or |
---|
0:12:37 | so that we can extract visions for example like the for height |
---|
0:12:41 | of the speaker |
---|
0:12:42 | in that case |
---|
0:12:43 | as you see here that the for it is not extract that you can of extract the face |
---|
0:12:47 | because we assume that if the synchrony is slow |
---|
0:12:50 | this region can not be part of this was |
---|
0:12:53 | you know case |
---|
0:12:54 | since we do not use a do we them make this assumption |
---|
0:12:58 | the results are |
---|
0:12:59 | more set this fact |
---|
0:13:02 | more results |
---|
0:13:04 | with this specs motion here |
---|
0:13:06 | left person is speaking and the right person is just moving the lips |
---|
0:13:11 | uh we expect that for a |
---|
0:13:13 | speaker or then when we have |
---|
0:13:15 | general |
---|
0:13:16 | or the results sources |
---|
0:13:17 | that's why we with use user phase detector because we want to extract any kind of the results all sources |
---|
0:13:22 | but just |
---|
0:13:23 | speaker |
---|
0:13:24 | so you fact that had which is playing the O |
---|
0:13:28 | and the rock |
---|
0:13:30 | not extract |
---|
0:13:31 | and what happens |
---|
0:13:32 | if we have |
---|
0:13:33 | two persons that that |
---|
0:13:35 | speak at the same time |
---|
0:13:36 | in fact we don't know a force our our in to choose |
---|
0:13:40 | in between |
---|
0:13:41 | in from frames |
---|
0:13:43 | one person will be more synchronise for |
---|
0:13:46 | uh there would be more seats |
---|
0:13:48 | in the mouth of one person in the other frame |
---|
0:13:51 | there would be more seed in the out of the other person |
---|
0:13:55 | but |
---|
0:13:55 | in general we can extract |
---|
0:13:57 | the two of them |
---|
0:13:58 | with the make an assumption about |
---|
0:14:01 | just |
---|
0:14:01 | one |
---|
0:14:04 | some results some be there sequence |
---|
0:14:06 | here we have |
---|
0:14:07 | the first of them is with speakers |
---|
0:14:09 | we have at that nothing speak |
---|
0:14:12 | so for the red pairs and will speak |
---|
0:14:14 | and then |
---|
0:14:15 | the left person and will start speaking |
---|
0:14:18 | we like to is that first the face |
---|
0:14:20 | of the right |
---|
0:14:21 | their person |
---|
0:14:22 | and then the phase of the left purse |
---|
0:14:26 | let's see that result |
---|
0:14:28 | oh |
---|
0:14:30 | oh |
---|
0:14:31 | i |
---|
0:14:32 | so |
---|
0:14:33 | oh |
---|
0:14:34 | yeah |
---|
0:14:35 | two |
---|
0:14:35 | so |
---|
0:14:36 | i |
---|
0:14:38 | so when the right person or stop speaking |
---|
0:14:41 | you |
---|
0:14:41 | uh our method is able to |
---|
0:14:43 | top the extraction and |
---|
0:14:46 | we |
---|
0:14:47 | three |
---|
0:14:47 | and we expect that for a person that they |
---|
0:14:51 | it's in the results with the general the results for since than we've |
---|
0:14:54 | that is motion |
---|
0:14:56 | as you think and in the first frame |
---|
0:14:58 | that is |
---|
0:14:59 | well |
---|
0:14:59 | we have a hand which is playing a piano again |
---|
0:15:02 | and that is a a a a a fun moving in the but don |
---|
0:15:05 | during then they're speak |
---|
0:15:07 | then |
---|
0:15:08 | in the first frame |
---|
0:15:09 | we extract |
---|
0:15:11 | these |
---|
0:15:11 | well there is a object |
---|
0:15:12 | the the object which is interesting for us |
---|
0:15:15 | but we |
---|
0:15:16 | but a little bit of the fun |
---|
0:15:18 | but to see that |
---|
0:15:19 | these these of |
---|
0:15:21 | firstly lee and you |
---|
0:15:22 | keep on extracting the hand and that think |
---|
0:15:25 | i |
---|
0:15:26 | oh |
---|
0:15:29 | ah |
---|
0:15:32 | right do we extract also the the board |
---|
0:15:35 | the thing is that |
---|
0:15:36 | when the findings |
---|
0:15:37 | i think that key |
---|
0:15:39 | uh that is some shown in the case which is associate it |
---|
0:15:43 | to the sound it happened at the same time that the sounds of |
---|
0:15:47 | these is normal because they are pushing that that key |
---|
0:15:50 | and then the the sound or what that the same that |
---|
0:15:53 | so these |
---|
0:15:54 | there seats which are C to it and the keys |
---|
0:15:58 | of the cube |
---|
0:15:59 | since the region it's also a much is in terms of or |
---|
0:16:04 | we that extracting and we extract then that |
---|
0:16:07 | you bored |
---|
0:16:08 | not is however that we don't like the black keys because they are not present any time |
---|
0:16:15 | so a some discussions so i have presented that and uh a method to extract or the all objects from |
---|
0:16:21 | a scene |
---|
0:16:23 | these method it's based on the main assumption in this the main which states that |
---|
0:16:27 | motion |
---|
0:16:28 | happens at the same what we let it to in now the M B that channels are more less synchronous |
---|
0:16:34 | so them be the motion |
---|
0:16:36 | is more or less in run also with the appearance of that sounds in the soundtrack |
---|
0:16:41 | a a a i our method and you know with any kind of what the results for sources |
---|
0:16:46 | even with uh different what people of the results was is would be for activity but |
---|
0:16:52 | but able to extract more complete |
---|
0:16:54 | or the result objects |
---|
0:16:56 | since we don't know a require |
---|
0:16:58 | all the region |
---|
0:16:59 | to be to see it to the sound |
---|
0:17:01 | and the main limitation |
---|
0:17:03 | is that we can not scenes are approach is and supervised we cannot not |
---|
0:17:06 | control that the S like of the extracted vision |
---|
0:17:11 | so we want |
---|
0:17:13 | make it semi supervised |
---|
0:17:14 | and say we want the region to look |
---|
0:17:17 | like this but then we would compromise |
---|
0:17:19 | three that the uh and supervise P of these approach |
---|
0:17:24 | that thing that is that a |
---|
0:17:26 | with a some experiments approach what we could do |
---|
0:17:29 | do is for example two |
---|
0:17:31 | when we have a there is a sequence what they are multiple sources we put a low the user to |
---|
0:17:36 | choose |
---|
0:17:37 | the source |
---|
0:17:38 | to be extracted |
---|
0:17:39 | and then extract both the really a part of this was not that the around out like maybe you of |
---|
0:17:44 | the source |
---|
0:17:45 | and extract also though the a part of this |
---|
0:17:50 | so you have some questions |
---|
0:17:55 | oh |
---|
0:18:00 | i close |
---|
0:18:02 | first to |
---|
0:18:07 | or it's one |
---|