0:00:13 | paper or model based compressive sensing for this and multi party speech recognition |
---|
0:00:17 | this is a joint board we'd here mobile and are then one can show |
---|
0:00:22 | and and that's we focus on is the uh a problem of competing speech sources which is uh a a |
---|
0:00:27 | common and one of the most challenging and demanding you should see many speech applications |
---|
0:00:32 | and a lot of uh our work he's basically to perform the speech separation prior to recognition |
---|
0:00:40 | the scenario that we are considering is that case that the number of microphones it's less and the number of |
---|
0:00:46 | source so it's a actually and on their that on mine a speech separation problem |
---|
0:00:50 | but if you are |
---|
0:00:52 | hmmm the number of measurements is even less on the number of on unknown |
---|
0:00:56 | sources |
---|
0:00:58 | and the sparse component and and it's easy pro is one of the most promising approaches to deal with this |
---|
0:01:03 | problem |
---|
0:01:05 | a idea is that we caff down there to to mine |
---|
0:01:07 | speech separation problem as a sparse recovery where we leverage compressive sensing theory are you to still the |
---|
0:01:14 | and |
---|
0:01:15 | to in all other wars uh we |
---|
0:01:17 | proposed to integrate the sparse component analysis at to the front and processing of a speech recognition systems |
---|
0:01:24 | to provide some compliments are you process to make it robust |
---|
0:01:28 | and not in can |
---|
0:01:32 | i will follow we the |
---|
0:01:33 | very brief introduction and on compressive sensing |
---|
0:01:37 | to help me put the work |
---|
0:01:39 | in into the context and then i will explain the details of our method which is blind source separation via |
---|
0:01:44 | a model based compressive |
---|
0:01:46 | sensing |
---|
0:01:47 | um then i will provide the experiments so set top and speech recognition results and the concluding that |
---|
0:01:53 | the most four |
---|
0:01:55 | compressive sensing in and not said it's sensing thing why a dimensionality reduction |
---|
0:02:00 | and the idea is that uh when |
---|
0:02:02 | a signal |
---|
0:02:07 | and uh the sparse signal X |
---|
0:02:09 | high dimension all but the the fact is that the dimensionality of the signal is somewhat misleading and that's true |
---|
0:02:15 | information content are only in but that in very few on the are quite a few |
---|
0:02:20 | such as signal |
---|
0:02:22 | the information content of of as the sparse signal like these could be pretty there are uh what uh by |
---|
0:02:28 | a a kind of a dimensionality reduction measurements which we did not to year we fight |
---|
0:02:33 | through a very and be captured |
---|
0:02:35 | we very few measurements by |
---|
0:02:40 | so in a in |
---|
0:02:42 | a case that naturally is kind of dimensionality reduction happen |
---|
0:02:46 | and we can leverage compressive sensing theory in these cases to |
---|
0:02:51 | for a the and |
---|
0:02:55 | compressive sensing in theory |
---|
0:02:58 | a on three ingredients first of all is a sparse representation we have to come up with a representation of |
---|
0:03:04 | the signal |
---|
0:03:05 | a which is a sparse meaning that very few of the whole corpus are kept in most of the energy |
---|
0:03:10 | of the signal |
---|
0:03:11 | from a geometric perspective if the signal you V R C |
---|
0:03:15 | a a in fact most of the space is |
---|
0:03:18 | more |
---|
0:03:18 | in and |
---|
0:03:19 | the see the sparse signal be only in and played |
---|
0:03:23 | a a link with the court in eight |
---|
0:03:25 | signal not like these |
---|
0:03:27 | the information content of by of make it could be captured we five i |
---|
0:03:32 | if a the the i'm yeah provide an exam at three to the sparse |
---|
0:03:37 | uh a or X |
---|
0:03:39 | um meaning that the is stands or |
---|
0:03:41 | oh the information between the sparse vector the pairwise distances are pretty there in our observation time an are in |
---|
0:03:51 | a like and that's two he key ingredients have a sparse representation and her and measurement |
---|
0:03:56 | compressive sensing guarantees to recover the not i directly improve believe by searching for the sparse |
---|
0:04:03 | solution which matched observation |
---|
0:04:06 | oh i but in practice we don't have a sparse representation in most of the case that |
---|
0:04:12 | a lot for many natural in signal such as images and a speech is a kind of a sparse representation |
---|
0:04:18 | which we call compressed bell could be obtained |
---|
0:04:21 | well i some |
---|
0:04:22 | it's transformation |
---|
0:04:23 | in case of a speech such a transformation is |
---|
0:04:26 | in fact get worries |
---|
0:04:27 | action |
---|
0:04:28 | the |
---|
0:04:29 | it's a kind of a spectral brown of the speech uh has been illustrated here and you see that very |
---|
0:04:34 | few of the whole the spectrogram spectrographic representation |
---|
0:04:38 | has a a um values |
---|
0:04:41 | and the uh |
---|
0:04:44 | a uh and we if we sort the chord it's of the signal the the sorted coefficients show a very |
---|
0:04:50 | rapid decay |
---|
0:04:52 | which is a according to the power law |
---|
0:04:54 | um |
---|
0:04:55 | a signal not like these uh would be |
---|
0:04:58 | um would be cold compress the bell and this could be a a a in our framework of compress |
---|
0:05:04 | see |
---|
0:05:05 | moreover i are |
---|
0:05:06 | the you can even see in our spectrographic map there is an underlying structure are following the sparse coefficients for |
---|
0:05:13 | instance here you see that most of the large it'd coefficients are of those sort of cluster together |
---|
0:05:19 | which could we could further level structure and they're like the coefficients to improve the recall perform |
---|
0:05:26 | and to further you use the number of required observation |
---|
0:05:31 | i get very brief introduction of compressive sensing a i will explain the details all yes that's |
---|
0:05:37 | but source separation or a model based that sparse recovery which from now on i we'll just call bss M |
---|
0:05:43 | R |
---|
0:05:47 | in fact that a more duration from the very reach each each rich are in the context all a sparse |
---|
0:05:52 | component analysis |
---|
0:05:53 | and i have provided that very few of them |
---|
0:05:57 | of the paper is just as like there's men but i was most |
---|
0:06:00 | the but the fact is that the this is much longer |
---|
0:06:03 | and mostly the paper a all square you mouth and scott rick card were very much in aspiring for us |
---|
0:06:09 | to do you think of have the intuition that |
---|
0:06:12 | sparse sparse component analysis could put in help a speech recognition systems in overlapping |
---|
0:06:18 | many you that of a sparse component and it's is it's spatial cues have been used for late in can |
---|
0:06:23 | be covering of this know |
---|
0:06:25 | um |
---|
0:06:26 | and and in this context a what uh some work a um of what kinds of or and colleagues a |
---|
0:06:32 | at least in i P S N |
---|
0:06:33 | the mode that all us a which had uh to formulate a source week already as a sparse recovery a |
---|
0:06:40 | a source localisation as a sparse recovery |
---|
0:06:42 | we uh a a a finally the our yes that's ms are is nothing except that a sparse component analysis |
---|
0:06:49 | work to where |
---|
0:06:50 | which provides a joint |
---|
0:06:52 | a a framework for source localisation and separation |
---|
0:06:55 | what is the new one out S the segments are is that we |
---|
0:06:59 | experiment the model underlying the sparse coefficients we deal we convolutive mixtures |
---|
0:07:04 | and we use and you efficient and accurate already had agreed |
---|
0:07:08 | all |
---|
0:07:10 | a call from the C is the first thing the end used to come up with a |
---|
0:07:14 | kind of a sparse representation of the all node C not that we desire to recover |
---|
0:07:21 | the idea |
---|
0:07:22 | here here is that we describe as the plan or ready or the room into G D N |
---|
0:07:27 | re |
---|
0:07:27 | for this characterization use |
---|
0:07:30 | i if that the |
---|
0:07:31 | the are still dense that each of the speaker i Q by an exclusive |
---|
0:07:35 | three |
---|
0:07:36 | so if three of loss |
---|
0:07:37 | a free of a speaker was are competing |
---|
0:07:40 | um um only three or the grease are active and all the rest have absolute be the energy |
---|
0:07:46 | lee |
---|
0:07:47 | that kind of a spatial a sparse representation that we could obtain for uh simultaneous the speech source |
---|
0:07:54 | i i the spectral the sparsity we mused |
---|
0:07:58 | the short time fourier transform |
---|
0:08:00 | and a spectro-temporal representation |
---|
0:08:03 | now we in time L these two representations a spatial all and a start to gather and it should use |
---|
0:08:08 | the spatial a spectral representation |
---|
0:08:11 | all of where am |
---|
0:08:13 | we did we denote not eat here at a each component of it is in fact the signal not coming |
---|
0:08:17 | from each query any the meeting |
---|
0:08:19 | inside are are the spectral components |
---|
0:08:22 | due to their |
---|
0:08:24 | an people |
---|
0:08:27 | and that this thing we're yeah and is the in and measurements recent work of and scoring published in font |
---|
0:08:34 | uh and the might using a one ten or moon review |
---|
0:08:37 | he recognised the kind of natural manifestation of compressive sensing measurements through a of greens function projection |
---|
0:08:45 | lee |
---|
0:08:46 | aspired us to model our measurement make or matrix uh using the image model |
---|
0:08:53 | but a technique which has been already proposed by john out |
---|
0:08:56 | and break the U |
---|
0:08:57 | and uh the i'd yeah you made model uh is that uh when the room using for or brown and |
---|
0:09:02 | i'm speaking here |
---|
0:09:04 | is not in |
---|
0:09:05 | only me but that happens of my image is with respect to all these walls stick together |
---|
0:09:10 | and we could model that these uh we the greens function with this particular form a frequency domain which |
---|
0:09:17 | each component has been attenuated with respect to it's these that of the image to the |
---|
0:09:22 | so a sense or |
---|
0:09:23 | and has been the late |
---|
0:09:24 | which is the and the speaker |
---|
0:09:26 | so |
---|
0:09:28 | oh using this model we could do uh find the projection S you to to each sensor meant for each |
---|
0:09:35 | of the in in in the room |
---|
0:09:36 | and now we by all these predictions and construct our measurements |
---|
0:09:40 | a matrix five |
---|
0:09:42 | which is how power microphone you mention |
---|
0:09:44 | three |
---|
0:09:47 | now |
---|
0:09:49 | introducing the sparse representation of X which is our unknown we have a |
---|
0:09:54 | the no one observations of microphones which you how |
---|
0:09:58 | by |
---|
0:09:59 | we was suppose that we have a M microphones and we have a measurement matrix with image model |
---|
0:10:04 | i |
---|
0:10:05 | all is to recover X |
---|
0:10:08 | from very few measurements |
---|
0:10:10 | after me why |
---|
0:10:12 | the channel and is that |
---|
0:10:13 | for for i uh has the non trivial a space |
---|
0:10:17 | and a a like source coming from the now the space we give the same |
---|
0:10:21 | so though mm |
---|
0:10:23 | according to a linear algebra such a system doesn't have a nice addition how work |
---|
0:10:27 | the solution would be to sparse the solution and bases |
---|
0:10:31 | uh what |
---|
0:10:32 | a sparse recovery help us to and give them enough information to overcome that you posed as |
---|
0:10:39 | a of |
---|
0:10:40 | a of our inverse problem and |
---|
0:10:42 | you |
---|
0:10:45 | but do we do here is that we use a loss recovery at agree |
---|
0:10:48 | the uh it was presented in uh |
---|
0:10:51 | session |
---|
0:10:51 | that uh yesterday |
---|
0:10:53 | on a learning a low dimension signal models and the i of all these uh if fact a a known |
---|
0:11:00 | to the family of a check par thresholding method |
---|
0:11:03 | and the idea is that scenes project eighteen |
---|
0:11:05 | the signal into the whole the space and finding the sparse the solutions is in hard and it's the combinatorial |
---|
0:11:12 | a a problem and in |
---|
0:11:14 | i straight you hard thresholding approach as a we kind of a price they make an and i trade you |
---|
0:11:19 | manner to |
---|
0:11:20 | the a sparse solution by keeping only the |
---|
0:11:23 | hmmm hmmm in a sparse |
---|
0:11:24 | i and |
---|
0:11:25 | a the largest value coefficients and discarding all |
---|
0:11:28 | and this has been done on |
---|
0:11:30 | um um |
---|
0:11:31 | and |
---|
0:11:31 | a model based a of what be D is that |
---|
0:11:34 | we checked only largest |
---|
0:11:37 | a a large just |
---|
0:11:38 | uh |
---|
0:11:38 | energy of their uh of the blocks and discarded the rest of the blocks |
---|
0:11:48 | and now i i in one to their our experiments and up |
---|
0:11:54 | oh for the speech court who's we use our route to which was not overlapping but the overlap it be |
---|
0:11:59 | interference that selected randomly from its to me |
---|
0:12:02 | we discrete twice the |
---|
0:12:04 | a plan or are your of the room into two fifty by fifty cents a meet and reads |
---|
0:12:08 | and the reverberation time was two hundred sec |
---|
0:12:12 | in in a scenario we tested our method one of bad two |
---|
0:12:15 | uh a three competing the speak yours of air |
---|
0:12:18 | our target the speech coming from a work too |
---|
0:12:22 | interference one and two are active and in the second scenario |
---|
0:12:26 | uh interference three and four |
---|
0:12:28 | and older as |
---|
0:12:29 | are |
---|
0:12:30 | i'm ten |
---|
0:12:34 | and the result in the case of a story recording and um separation and using a |
---|
0:12:40 | when and three sources are competing |
---|
0:12:43 | are the following our on route to is kind of digit recognition task which for wise the training in two |
---|
0:12:48 | conditions one of that |
---|
0:12:50 | the hmm M M by and has been trained only using clean |
---|
0:12:53 | we not trance as and the other one is using multi condition or noisy the trance this to train our |
---|
0:12:58 | eight to the model |
---|
0:12:59 | and the baseline nine now overlapping in speech being clean condition is fifty nine percent cent sixty one person remote |
---|
0:13:05 | condition training and after or |
---|
0:13:07 | a separation and perform speech recognition and we could |
---|
0:13:11 | i of up to ninety two percent the multi condition string |
---|
0:13:14 | and um |
---|
0:13:16 | a a ball eighty percent of relative improvement have been a |
---|
0:13:22 | then in the second scenario five sources were active |
---|
0:13:25 | and the uh we |
---|
0:13:27 | but |
---|
0:13:28 | one of them |
---|
0:13:29 | appealing panelling |
---|
0:13:30 | a space all of this work is that uh we could we are very much for the and valley the |
---|
0:13:34 | microphone |
---|
0:13:37 | and the geometry to we could use |
---|
0:13:40 | oh oh in two cases once a one we use only two microphones and is just say can you use |
---|
0:13:45 | only for microphone |
---|
0:13:47 | and separated the speech and then perform a speech recognition and and the |
---|
0:13:52 | word accuracy rates are provided this |
---|
0:13:54 | in the table |
---|
0:13:56 | a as you to ninety four per and if or microphones have been used to do this |
---|
0:14:00 | source separation |
---|
0:14:01 | and the relative improvement would be up to eighty five first |
---|
0:14:11 | right the she that is that a would like to come back with that the information bearing components of for |
---|
0:14:16 | a speech recognition are indeed the sparse and |
---|
0:14:18 | that's some for all the main and these years to some compelling evidence that |
---|
0:14:22 | sparse component analyses is what is a potential approach to deal with the problem of overlapping in |
---|
0:14:27 | realistic really stick applications of the speech recognition |
---|
0:14:30 | and |
---|
0:14:31 | or or or or are we use a a a kind of model that the sparse recovery you we showed |
---|
0:14:36 | that we could go beyond this part |
---|
0:14:39 | i'm sorry |
---|
0:14:40 | it it was method |
---|
0:14:41 | oh are you could construct the audio |
---|
0:14:44 | are |
---|
0:14:45 | and destructive source and motion to |
---|
0:14:49 | or |
---|
0:14:53 | i |
---|
0:14:54 | to leave |
---|
0:14:55 | five |
---|
0:14:55 | yeah |
---|
0:14:56 | the tormented of are old so that we could all put also have some |
---|
0:15:00 | kind of a rat quantitative or at least of the using like a S I R |
---|
0:15:05 | they are all of one |
---|
0:15:06 | measures which have been proposed for to do |
---|
0:15:08 | source separation |
---|
0:15:09 | but lot thing so |
---|
0:15:11 | hmmm goal what's finally to |
---|
0:15:12 | speech recognition and we just |
---|
0:15:16 | so that the speech recognition results with keep the best fine final |
---|
0:15:20 | performance performance or or evaluation of how the system would work for speech rec |
---|
0:15:26 | a |
---|
0:15:28 | a |
---|
0:15:30 | source separation so true two |
---|
0:15:36 | to work |
---|
0:15:37 | i |
---|
0:15:38 | a a a a a a a source of a yeah we have well |
---|
0:15:48 | that's true |
---|
0:15:50 | oh |
---|
0:15:54 | so call types |
---|
0:15:58 | but |
---|
0:15:59 | uh |
---|
0:16:00 | so |
---|
0:16:02 | "'cause" |
---|
0:16:04 | subject to we we have |
---|
0:16:05 | uh we sent to some |
---|
0:16:07 | poles and the |
---|
0:16:08 | there are some |
---|
0:16:10 | some cases of the overlapping which with the seal be here or your you're back background |
---|
0:16:14 | have |
---|
0:16:15 | a |
---|
0:16:16 | a not like the kind of musical noise that we expect from |
---|
0:16:20 | binary man |
---|
0:16:20 | king |
---|
0:16:21 | case |
---|
0:16:22 | because the sparse recovery could be in some sense the even |
---|
0:16:26 | look that as the kind of soft my |
---|
0:16:27 | can still |
---|
0:16:28 | the kind of artifacts or P |
---|
0:16:33 | i |
---|
0:16:34 | i |
---|
0:16:36 | a |
---|
0:16:39 | i |
---|
0:16:41 | oh |
---|
0:16:44 | yeah |
---|
0:16:46 | i |
---|
0:16:49 | well |
---|
0:16:50 | and the measurement may treat in uh it depends on they are each |
---|
0:16:55 | for the environment inter element spacing the many factors that have been um um in written in very detail |
---|
0:17:02 | been considered in the car paper |
---|
0:17:05 | but in in our case are you can be was in fact |
---|
0:17:08 | as five well as so we know like some kind of precondition conditioning |
---|
0:17:12 | by orthogonalization and the details some are we can use sending are |
---|
0:17:19 | but in theory um |
---|
0:17:21 | a in still is for instance for that for a |
---|
0:17:24 | for very specific acoustic conditions |
---|
0:17:27 | uh |
---|
0:17:28 | we could still that are a P also hold |
---|
0:17:33 | uh |
---|
0:17:34 | the process |
---|
0:17:40 | a |
---|