0:00:13 | a media signal processing section on joint about of it you are are you these low signal processing |
---|
0:00:19 | we have uh |
---|
0:00:21 | spain bass |
---|
0:00:22 | and uh for each one we have a more less twenty minutes mean that's work fall a last |
---|
0:00:27 | was |
---|
0:00:28 | uh uh we start with the first paper are but these on a out your visual synchronization recovery multimedia content |
---|
0:00:35 | represent a is a drawn grow say okay |
---|
0:00:38 | uh and they are all queries is G by me from where as we there of these group of technology |
---|
0:00:43 | in on sweets or one |
---|
0:00:45 | please |
---|
0:00:50 | and everyone uh mine is tools of the uh from U P F which then |
---|
0:00:54 | a that i don't might cop is |
---|
0:00:56 | audio-visual synchronization recovery in multimedia content |
---|
0:01:01 | this is the a line of my talk first i'm gonna introduce the colour problem of what visual synchronisation in |
---|
0:01:07 | multimedia content |
---|
0:01:09 | and then i'm going to explain the contribution of my work |
---|
0:01:12 | and then i will |
---|
0:01:14 | explain in detail what used the proposed method |
---|
0:01:18 | and some some details about the correlation measures to measure the correlation between audio and video signals |
---|
0:01:24 | and then i i sure use some experimental result and then |
---|
0:01:27 | yeah i |
---|
0:01:28 | going to conclude might talk with does some summary and |
---|
0:01:31 | future work |
---|
0:01:35 | so |
---|
0:01:35 | the problem of do we just synchronization in multimedia content is can be explained in this context |
---|
0:01:41 | so |
---|
0:01:43 | when you have some multimedia content it |
---|
0:01:45 | usually contain both audio and video |
---|
0:01:48 | so |
---|
0:01:49 | when you talk about the quality of multimedia |
---|
0:01:52 | we have this |
---|
0:01:53 | the the quality components |
---|
0:01:54 | from these two uh |
---|
0:01:56 | two modalities |
---|
0:01:57 | so for a audio |
---|
0:01:59 | we have a lower is noise for jitter |
---|
0:02:01 | component |
---|
0:02:03 | and in video we have |
---|
0:02:05 | reading use jerking knees |
---|
0:02:07 | picks so noise et cetera |
---|
0:02:09 | but another important part is that |
---|
0:02:12 | to quality the two signals have some maturing |
---|
0:02:16 | uh interaction |
---|
0:02:17 | so for example the quality all the two signals |
---|
0:02:20 | are mutually interact each other |
---|
0:02:23 | and also there is a problem of synchronization of the two modalities |
---|
0:02:27 | so this is the problem that i wanna talk to |
---|
0:02:31 | so |
---|
0:02:32 | usually we expect something can was audio video signal in our life |
---|
0:02:36 | this system also my wife and uh if we |
---|
0:02:39 | if she close my name the and then i suspect this |
---|
0:02:43 | shape of mouse |
---|
0:02:44 | uh at the same time |
---|
0:02:46 | and |
---|
0:02:47 | this is our expectation |
---|
0:02:49 | i of the synchronization in our daily life |
---|
0:02:53 | and there some some start there are some studies about this synchronization problem in uh audio and video signals |
---|
0:03:00 | and people found that there is there is also some tolerance in the synchronisation for sample |
---|
0:03:06 | there is a |
---|
0:03:08 | or an inter sensory integration do which is about two hundred millisecond wide |
---|
0:03:13 | during which the audio which your perception is not degrade |
---|
0:03:17 | when the synchronization is we in this uh error |
---|
0:03:20 | found |
---|
0:03:21 | so for example |
---|
0:03:22 | if you see this graph |
---|
0:03:24 | the |
---|
0:03:25 | the two signals |
---|
0:03:26 | even if they are not perfectly yeah uh synchronise |
---|
0:03:30 | the in this |
---|
0:03:31 | uh area |
---|
0:03:32 | the people to say |
---|
0:03:34 | perceive |
---|
0:03:35 | but the two signals are synchronise |
---|
0:03:38 | so based on many studies of all synchronization uh uh also some center document from i to you |
---|
0:03:45 | so i Q you this document specified of susceptibility a threshold |
---|
0:03:50 | uh as as a round uh plus minus one hundred millisecond |
---|
0:03:53 | and uh |
---|
0:03:54 | so so |
---|
0:03:55 | that |
---|
0:03:56 | the some send are or some some looking the systems |
---|
0:03:59 | should follow this guideline |
---|
0:04:02 | but i i have |
---|
0:04:03 | oh a word this boundary we |
---|
0:04:06 | people |
---|
0:04:07 | uh start to perceive the |
---|
0:04:10 | uh uh a line |
---|
0:04:12 | uh audio and video signals |
---|
0:04:15 | so |
---|
0:04:17 | but |
---|
0:04:17 | unfortunately we you may have some a synchrony in the audio |
---|
0:04:22 | signals in i |
---|
0:04:23 | and this |
---|
0:04:24 | may happen during the all |
---|
0:04:26 | all steps in the mote and the processing chain |
---|
0:04:28 | so for example in acquisition |
---|
0:04:30 | we know the speed |
---|
0:04:31 | all of the lights and the sound is different |
---|
0:04:34 | and doing the editing dating they may have different processing times |
---|
0:04:39 | or or people can make a mistake simply |
---|
0:04:41 | and during turn transmission they may suffer from different network |
---|
0:04:45 | transfer than delay |
---|
0:04:46 | what doing the right |
---|
0:04:48 | restitution maybe they have different uh delay in decoding |
---|
0:04:52 | oh the result of this uh a synchrony |
---|
0:04:55 | is first of all the qualities stick right at so maybe it people get angry about that fact |
---|
0:05:02 | and |
---|
0:05:03 | a for the more |
---|
0:05:04 | the people don't understand the content |
---|
0:05:06 | actually |
---|
0:05:08 | so to so this problem in |
---|
0:05:10 | our work |
---|
0:05:12 | we developed the old automatic algorithm and to detect |
---|
0:05:16 | where there is whether data there is that a synchrony audio and video signal |
---|
0:05:20 | and the cover original synchronization |
---|
0:05:23 | and it for this we exploit the what do we |
---|
0:05:26 | regional correlation structure |
---|
0:05:28 | which is in here and to there |
---|
0:05:30 | in that two signal |
---|
0:05:32 | so the features of the method is first we don't have any assumption |
---|
0:05:36 | on the content |
---|
0:05:37 | so |
---|
0:05:38 | therefore we don't need any training |
---|
0:05:41 | and also this can be of applied to any kind of content |
---|
0:05:45 | both speech and non-speech content |
---|
0:05:48 | as as long as there is a a being more the motion |
---|
0:05:51 | responsible for the for the |
---|
0:05:53 | sound |
---|
0:05:54 | and in part to large we |
---|
0:05:56 | did |
---|
0:05:57 | you use two different correlation measures to compare |
---|
0:06:00 | and we compared the results |
---|
0:06:05 | so let me explain data to in detail the proposed method |
---|
0:06:09 | the idea is quite simple |
---|
0:06:10 | so when we have the one |
---|
0:06:12 | two oh audio and video signals |
---|
0:06:15 | we we don't know whether they are uh |
---|
0:06:18 | a line well or not |
---|
0:06:20 | we shift of the audio signal a relative to the video signal a step by step |
---|
0:06:25 | and the measured the correlation |
---|
0:06:27 | and we find the maximum at the moment where |
---|
0:06:30 | we get the maximum correlation between the two signals |
---|
0:06:34 | so the algorithm can be summarised |
---|
0:06:36 | like this |
---|
0:06:38 | so the first that is to extract some features |
---|
0:06:42 | and then we divide the signal in |
---|
0:06:44 | to some some small uh unit |
---|
0:06:47 | that where we can apply some correlation analysis |
---|
0:06:50 | so first to be divide the host signal in two |
---|
0:06:54 | in in the temporal dimension so that we have some some small |
---|
0:06:58 | uh segment |
---|
0:06:59 | we called it as a temporal problem here |
---|
0:07:01 | and this is applied for both audio and video |
---|
0:07:04 | and then |
---|
0:07:05 | for their we segment the video signal at the image frames into smaller small |
---|
0:07:10 | tires |
---|
0:07:11 | which is uh in our case we use four by four pixels says |
---|
0:07:16 | and uh so that we |
---|
0:07:18 | we have to not |
---|
0:07:19 | for uh by doing this we |
---|
0:07:22 | find where the actually the sound is coming from |
---|
0:07:27 | so then |
---|
0:07:28 | for each hypothetical time shift |
---|
0:07:30 | so you this hypothetical time you've means we |
---|
0:07:34 | ship the audio signal step by step one by one |
---|
0:07:37 | and then for each temporal block we do some analysis |
---|
0:07:41 | and then get the correlation |
---|
0:07:43 | and the correlation is the maximum correlation |
---|
0:07:46 | all between audio time shift to audio and then |
---|
0:07:50 | the |
---|
0:07:51 | B you signal in the in this style |
---|
0:07:54 | and we'd |
---|
0:07:55 | we after we measure that |
---|
0:07:57 | the correlation or over the whole whole image frame and then we take the maximum |
---|
0:08:01 | and we expect |
---|
0:08:02 | the |
---|
0:08:03 | location |
---|
0:08:04 | well having this maximal maximum correlation is |
---|
0:08:07 | he's |
---|
0:08:08 | the sound source |
---|
0:08:10 | and then we have to this we |
---|
0:08:13 | a compute the average of this maximum correlation over the temporal problem so to from the beginning of the signal |
---|
0:08:18 | to the end of the signal |
---|
0:08:20 | and beep or from this |
---|
0:08:22 | and then we can |
---|
0:08:23 | now for each time shift we have the correlation measure |
---|
0:08:27 | of the two signals |
---|
0:08:28 | and then we choose a max some value |
---|
0:08:33 | and the finally |
---|
0:08:35 | after all this |
---|
0:08:36 | that's |
---|
0:08:36 | we we find the time shift in uh uh |
---|
0:08:40 | you know a smaller resolution |
---|
0:08:42 | the here the time shift is done at the resolution of the video frame rate |
---|
0:08:47 | so |
---|
0:08:47 | the we get the correlation measures at each uh a you |
---|
0:08:51 | frame rate |
---|
0:08:53 | so when you have this kind of correlation curve |
---|
0:08:56 | for different time she |
---|
0:08:58 | that's save V get the maximum here but actually |
---|
0:09:00 | we do the probably fitting over the three points |
---|
0:09:04 | and then you get the maximum value here so this is the fine O time shift that we can get |
---|
0:09:11 | so um |
---|
0:09:13 | yeah there's a quite clear here but the question is what kind of correlation measure we can use |
---|
0:09:20 | so i can are two different methods |
---|
0:09:22 | one the mutual information and the other ones a can relation |
---|
0:09:28 | the mutual information use |
---|
0:09:30 | as you know |
---|
0:09:31 | probably no well it's uh |
---|
0:09:33 | on any measuring the sure |
---|
0:09:35 | dependence between two signals |
---|
0:09:38 | and |
---|
0:09:38 | in particular |
---|
0:09:39 | a use the quadratic sure information proposed by |
---|
0:09:43 | more a on uh in to them for |
---|
0:09:46 | this uses and then use uh coder |
---|
0:09:48 | cathartic entropy |
---|
0:09:50 | and the the it |
---|
0:09:51 | also use the parzen pdf estimation for estimating the |
---|
0:09:55 | the marginal and the the joint pdf |
---|
0:10:00 | so the question is given by this |
---|
0:10:03 | so |
---|
0:10:04 | and here we need to |
---|
0:10:05 | well the the each pdf |
---|
0:10:08 | using uh some of the |
---|
0:10:10 | the and |
---|
0:10:11 | colours |
---|
0:10:12 | and so |
---|
0:10:13 | this |
---|
0:10:14 | got it |
---|
0:10:15 | since it's a person's F estimation |
---|
0:10:18 | this kind are set on each data point |
---|
0:10:21 | and |
---|
0:10:22 | and we have a parameter |
---|
0:10:24 | that have to be |
---|
0:10:26 | a fixed |
---|
0:10:27 | the which is a |
---|
0:10:28 | which of the couch in connors this is a user parameter that we have to set |
---|
0:10:33 | in our experiment to be you we did the some research search and then |
---|
0:10:37 | take to the the best one |
---|
0:10:42 | yeah that correlation measure is the can and calculation is a measure of correlation in the space where the project |
---|
0:10:48 | it |
---|
0:10:50 | the signals |
---|
0:10:50 | have |
---|
0:10:51 | or maximum correlation |
---|
0:10:53 | so |
---|
0:10:54 | finding this projection |
---|
0:10:56 | is uh equivalent to finding |
---|
0:10:58 | a common representation space all of the two signals |
---|
0:11:02 | so |
---|
0:11:03 | this is a question of the |
---|
0:11:06 | correlation can of calculations so as you can see |
---|
0:11:09 | we need to find this uh projection vector W here |
---|
0:11:13 | i have a which project the input vector X and Y which is clear which are correspond to |
---|
0:11:19 | which correspond to the audio and video |
---|
0:11:21 | yeah |
---|
0:11:22 | signals |
---|
0:11:23 | and we try to maximise to i |
---|
0:11:26 | that the correlation measure |
---|
0:11:28 | and this problem can be solved i that the problem |
---|
0:11:31 | she's she's uh available in many |
---|
0:11:34 | uh publication |
---|
0:11:37 | oh these are the two correlation measures that i use |
---|
0:11:40 | so |
---|
0:11:42 | now let me explain some experiment result |
---|
0:11:46 | so i tested the the algorithm in three what do we just sequence is to are |
---|
0:11:52 | speech and the other one is non-speech speech |
---|
0:11:55 | and i selected the of synchrony between zero to one plus my one second |
---|
0:12:01 | and for |
---|
0:12:02 | features are use uh quite simple method |
---|
0:12:04 | "'cause" uh i found this |
---|
0:12:06 | to work very well but |
---|
0:12:08 | of course the more complex and of can be also you |
---|
0:12:11 | for visual features i use the i take that the the and then uh take the i-th tip the |
---|
0:12:17 | there but |
---|
0:12:18 | along the time dimension and also for audio feature i used i i i collected the energy |
---|
0:12:24 | and then to the derivative in the temporal dimension |
---|
0:12:28 | and the analysis uh |
---|
0:12:30 | unit need in time |
---|
0:12:31 | it was uh fifty video frames which correspond to a around to run |
---|
0:12:36 | two seconds |
---|
0:12:37 | the betting on the the sequence |
---|
0:12:39 | and i |
---|
0:12:39 | as i mentioned the spatial pile |
---|
0:12:41 | was four by four picks says |
---|
0:12:43 | but this is after down sampling the image frames |
---|
0:12:47 | the image frame was down sick |
---|
0:12:49 | then |
---|
0:12:49 | two |
---|
0:12:50 | one |
---|
0:12:51 | uh sixteen |
---|
0:12:53 | so for |
---|
0:12:54 | one for in each time dimension |
---|
0:12:58 | oh here you can see the some the riddle some |
---|
0:13:02 | so these are the three sequences that i used the first one is uh monologue a by a guy |
---|
0:13:07 | and second one there are two guys but only disguise speaking |
---|
0:13:11 | the other guy move so bit bit uh |
---|
0:13:14 | is leaves stories head or use i |
---|
0:13:18 | and the third one is uh |
---|
0:13:21 | and make is in it includes the bumping sound by the pen on the table |
---|
0:13:27 | and |
---|
0:13:27 | this is the result |
---|
0:13:29 | so act |
---|
0:13:30 | axis means the the |
---|
0:13:31 | simulated of synchrony |
---|
0:13:33 | from zero to plus minus one |
---|
0:13:36 | i thousand millisecond and the Y |
---|
0:13:38 | the estimation error |
---|
0:13:40 | in millisecond |
---|
0:13:42 | and |
---|
0:13:44 | for the |
---|
0:13:45 | black are for me the results from using the which information and the white part means to |
---|
0:13:50 | results from can on the correlation |
---|
0:13:54 | so if you see that results |
---|
0:13:56 | a first if you |
---|
0:13:57 | see the result of which information |
---|
0:13:59 | normally is okay but |
---|
0:14:01 | there are |
---|
0:14:01 | some cases where the |
---|
0:14:03 | error is uh is not acceptable for them but this case is more than a hundred second this is a |
---|
0:14:09 | more than |
---|
0:14:09 | four hundred millisecond |
---|
0:14:10 | this is out of that |
---|
0:14:12 | a uh facial acceptability thresholds |
---|
0:14:15 | and the main each mainly typical main uh reason for this |
---|
0:14:19 | was that the |
---|
0:14:21 | you know i mentioned that in you sure information i we need to set the parameter |
---|
0:14:25 | of the couch and with |
---|
0:14:27 | and i tried all different kind of uh variance but |
---|
0:14:30 | this was the best |
---|
0:14:31 | and i couldn't five and this was a bit better results so |
---|
0:14:35 | for some cases the best |
---|
0:14:37 | with is some value but on the other and the for the other cases it's |
---|
0:14:41 | some different that is so that was the main difficulty |
---|
0:14:44 | in using each information |
---|
0:14:46 | uh a on the other hand if you look at the kind of calculation results uh it's |
---|
0:14:50 | always uh |
---|
0:14:53 | less than one hundred millisecond |
---|
0:14:55 | and i as i mentioned before the acceptability a first showed is around one hundred miliseconds so |
---|
0:15:01 | here we can say you can quote that |
---|
0:15:03 | for this sequence is the can and correlation |
---|
0:15:06 | was uh successful |
---|
0:15:12 | and this figure shows a briefly is simply C shows the the how the correlation measure changes according to the |
---|
0:15:19 | the hypothetical |
---|
0:15:20 | time ship |
---|
0:15:21 | a a for and this case i use the the perfectly synchronized so the just signals |
---|
0:15:26 | so that |
---|
0:15:27 | colour in the meter |
---|
0:15:28 | this is the correct hypothesis |
---|
0:15:31 | while this column than |
---|
0:15:33 | the right side column is so wrong |
---|
0:15:35 | but that's |
---|
0:15:36 | so this is the cool thirty one which means all round about one second |
---|
0:15:42 | oh here you can see that the correlation pager or can cannot co correlation measure |
---|
0:15:47 | is uh but |
---|
0:15:48 | is larger when they are synchronized |
---|
0:15:51 | in the middle column |
---|
0:15:52 | uh then in comparison to |
---|
0:15:54 | the the right side |
---|
0:15:55 | so |
---|
0:15:56 | for but |
---|
0:15:58 | zero point seven versus to a point eight and in this case |
---|
0:16:01 | the bottom case zero point six |
---|
0:16:03 | versus |
---|
0:16:04 | zero point nine |
---|
0:16:06 | and one or more thing you can see here is that |
---|
0:16:08 | the black |
---|
0:16:09 | area |
---|
0:16:10 | it when i calculate when i measure the correlation between different ties and the audio signal |
---|
0:16:16 | i i take i took only that ties which have the motion you side |
---|
0:16:20 | so i |
---|
0:16:21 | when there the motion is uh a negligible then i didn't do the analysis to |
---|
0:16:26 | save the computation |
---|
0:16:28 | and that that is a as uh black part |
---|
0:16:31 | so in this case can see the the on |
---|
0:16:34 | on the light |
---|
0:16:35 | ties are quite small in comparison to the whole |
---|
0:16:37 | scene |
---|
0:16:42 | a uh final conclusion |
---|
0:16:44 | so to summarise uh uh we propose a automatic synchronization my thought |
---|
0:16:49 | and i tried |
---|
0:16:50 | uh different |
---|
0:16:52 | correlation measures and the found that |
---|
0:16:54 | coder information |
---|
0:16:56 | implementation |
---|
0:16:58 | it was uh quite sensitive to the couch some parameters |
---|
0:17:01 | where at is the current canonical correlation the uh sure you overall all quite a quite robust a result |
---|
0:17:09 | and one thing i like to mention here is that uh |
---|
0:17:12 | the signal our approach was uh also applied to three D |
---|
0:17:17 | there was big video uh synchronization |
---|
0:17:20 | so in still scott big video you have to video streams and |
---|
0:17:24 | if a they are if they are not synchronized and you see uh double |
---|
0:17:28 | region for them here |
---|
0:17:31 | i if you can see clearly but you see the laptop |
---|
0:17:34 | the lead of the laptop is |
---|
0:17:35 | you |
---|
0:17:36 | you the twice |
---|
0:17:37 | one is here when you |
---|
0:17:38 | the ones here |
---|
0:17:40 | so this this |
---|
0:17:41 | synchronization problem may also a in this case |
---|
0:17:44 | and we applied the similar technique and beast "'cause" solve the problem and this was uh a to |
---|
0:17:50 | a present it next uh so |
---|
0:17:52 | a previous year in i C |
---|
0:17:56 | uh final the future work is uh |
---|
0:17:59 | we have an the test the method uh more uh one diapers contents because the be use only three content |
---|
0:18:04 | here |
---|
0:18:05 | and also we'd like to continue studying a on the this synchronization problem in different uh |
---|
0:18:10 | media like mobile i H T V or three |
---|
0:18:14 | right |
---|
0:18:14 | i |
---|
0:18:20 | time i'm just for one question |
---|
0:18:22 | councils |
---|
0:18:24 | a |
---|
0:18:31 | i think i know that uh if i remember correctly i think is on only or for for speech |
---|
0:18:38 | a |
---|
0:18:38 | for |
---|
0:18:39 | so i think they found |
---|
0:18:41 | it tried found the find the |
---|
0:18:43 | lip area first |
---|
0:18:45 | and then they use some some lit specific features |
---|
0:18:48 | to to recover the synchronisation that's what i remember |
---|
0:18:52 | but it |
---|
0:18:52 | this case the difference is i i didn't do that |
---|
0:18:56 | i think i and |
---|
0:18:58 | thank you uh we move uh |
---|
0:19:00 | second the |
---|
0:19:02 | paper |
---|