0:00:13 | i |
---|
0:00:14 | yes |
---|
0:00:15 | factorization |
---|
0:00:18 | i |
---|
0:00:18 | for a task |
---|
0:00:20 | i |
---|
0:00:21 | paper have proposed a |
---|
0:00:23 | channel well |
---|
0:00:24 | and she |
---|
0:00:27 | i |
---|
0:00:29 | good morning our run uh were come to talk a for come to to um at all |
---|
0:00:34 | however have of be very uh please to present our reason to work arounds |
---|
0:00:38 | a title ways speaker and noise factorization are or for task |
---|
0:00:43 | and ball on you tell all and this is a joint work with by supervisor all |
---|
0:00:48 | mark gales |
---|
0:00:49 | and |
---|
0:00:50 | so he's here still will we'll |
---|
0:00:52 | a a a uh uh uh |
---|
0:00:54 | first slayer out a top or something about the |
---|
0:00:57 | a a model based approach for robust speech running |
---|
0:01:00 | speech recognition |
---|
0:01:02 | and is the is uh uh a uh uh a a a lot of the skiing that that have been |
---|
0:01:07 | developed over the years that does |
---|
0:01:09 | uh a specific acoustic |
---|
0:01:11 | uh us to distortion including speaker adaptation and noise robustness |
---|
0:01:16 | and the we'll talk about uh we we we discuss discussed options we have to handle multiple acoustic factors |
---|
0:01:22 | in this call so that the concept of acoustic factorisation is introduced |
---|
0:01:26 | and as an example we do |
---|
0:01:29 | uh we derive when you adaptations |
---|
0:01:31 | which we call joints that handle speaker and noise that uh distortions |
---|
0:01:36 | and then just uh |
---|
0:01:37 | i rooms and conclusion |
---|
0:01:39 | so for we start from the uh environment with as we all know the speech signal can be influenced by |
---|
0:01:45 | a factor |
---|
0:01:46 | think i i as in this diagram we have speaker differences |
---|
0:01:50 | i a channel mismatch |
---|
0:01:52 | and also some some sort of a back noise and room reverberation noise |
---|
0:01:57 | also also to do also of this factors can uh fact that speech and decode added it want to variations |
---|
0:02:03 | and degree or all to uh decoder speech in speech signals |
---|
0:02:07 | so this makes the |
---|
0:02:09 | a a robust speech recognition by challenge task |
---|
0:02:12 | and |
---|
0:02:13 | so in this work we consider using the model based of porsche |
---|
0:02:17 | to handle |
---|
0:02:18 | multiple close factors |
---|
0:02:21 | so |
---|
0:02:22 | a in in the in this framework we have a |
---|
0:02:25 | can all can not cope you look |
---|
0:02:26 | come of cool model be able to model the disorder versions |
---|
0:02:30 | and we use a a a a a a set of a transformed to the used at uh i used |
---|
0:02:35 | to adapt a can not come model to different of course the conditions |
---|
0:02:39 | and |
---|
0:02:41 | or the U S a different transforms |
---|
0:02:43 | has been to about the to hand do a pacific |
---|
0:02:46 | single acoustic factors including speaker adaptation and noise compensation schemes |
---|
0:02:52 | and |
---|
0:02:53 | so yeah |
---|
0:02:54 | uh a hard but hard to combine |
---|
0:02:57 | this transforms to handle multiple close to factors |
---|
0:03:00 | and you know if active and efficient way it is this central topic out this talk |
---|
0:03:06 | oh |
---|
0:03:07 | that's look at a first look at speaker adaptation be all know as being a transfer is |
---|
0:03:12 | uh well |
---|
0:03:13 | adapt acoustic models |
---|
0:03:15 | this uh this is on mean transforms |
---|
0:03:18 | and is this in a transform is very simple but very effective in practice |
---|
0:03:23 | but this uh a limitation of this this thing a transform that is |
---|
0:03:28 | uh uh we have a uh right relative a large number of parameters to estimate so we can't do |
---|
0:03:33 | robust estimation a single options |
---|
0:03:36 | this |
---|
0:03:36 | so this this thing in transform cannot be |
---|
0:03:39 | is not suitable for very rapid at adaptation |
---|
0:03:43 | so and and then a point interesting point to make it uh is |
---|
0:03:47 | uh this thing a transform or or uh a was or you don't design of for speaker adaptation but you've |
---|
0:03:53 | we this strands was a generating a transfer |
---|
0:03:56 | this can also extend to you my meant to you want men to adaptation |
---|
0:04:01 | and next so we look at as noise come compensation schemes |
---|
0:04:05 | normally a mismatch function of a be defined for the impact well environment |
---|
0:04:10 | uh this is |
---|
0:04:11 | is the first equation is a use the nonlinear |
---|
0:04:15 | uh |
---|
0:04:15 | some is about mean occlusions that relate |
---|
0:04:18 | i describes how lead to channel distortion and added noise can a fact that clean speech |
---|
0:04:24 | and it's on this mismatch functions |
---|
0:04:26 | uh model based approach um |
---|
0:04:29 | modified a models to |
---|
0:04:31 | and it to better represent a noisy speech distributions |
---|
0:04:35 | yeah you the D is used to here |
---|
0:04:38 | i |
---|
0:04:39 | the second you creations |
---|
0:04:40 | shoes how we can adapt acoustic models using vts should based the cool um |
---|
0:04:46 | which has but is the |
---|
0:04:47 | but the comp compensation schemes |
---|
0:04:49 | you can see if only creations that a |
---|
0:04:52 | do to we use a use the |
---|
0:04:54 | a mismatch function this |
---|
0:04:56 | transfer transfer use highly construing and nonlinear |
---|
0:04:59 | so we can |
---|
0:05:00 | uh uh we can see that's uh relativist film |
---|
0:05:03 | for a member of prime is to estimate |
---|
0:05:06 | so we can do very red we very rapid at that adaptation sings to |
---|
0:05:10 | a noise transform can be estimated a for a single options |
---|
0:05:14 | so |
---|
0:05:15 | and you know about be i talk about in speaker i the speaker adaptation a noise transforms |
---|
0:05:21 | a noise |
---|
0:05:21 | compensation schemes |
---|
0:05:22 | so hard to combine the |
---|
0:05:24 | in in practice we have a very simple various |
---|
0:05:27 | straight forward uh a combination schemes of we call this joint a we called this |
---|
0:05:32 | that's more combination |
---|
0:05:33 | and the E here you cushion here uh describe some how we can do |
---|
0:05:39 | a first uh adapt to a a uh a week |
---|
0:05:41 | the first adapt the acoustic models using vts transforms |
---|
0:05:45 | and failing dart we have learning a transformed to reduce is mismatch |
---|
0:05:49 | and this |
---|
0:05:51 | uh and and the diagram shoes with uh we |
---|
0:05:54 | we do is sing |
---|
0:05:55 | how we do is |
---|
0:05:56 | a given a acoustic addition be as to be noise friends one speaker transform |
---|
0:06:01 | uh a a proper update or |
---|
0:06:02 | and and if i that speaker or all noise transform to be to estimate re-estimated post |
---|
0:06:08 | are are and so |
---|
0:06:10 | uh at so we can see a a a a a a uh a limitation is obvious that uh |
---|
0:06:15 | did you know transform should be estimate |
---|
0:06:17 | on a block of data so this |
---|
0:06:19 | kind of a combination can out you very rapid a rapid at that adaptations this T |
---|
0:06:24 | it requires a block up a a a a block update data |
---|
0:06:28 | and i to me to a we and do you uh uh uh in another way we call this of |
---|
0:06:32 | acoustic factorisation |
---|
0:06:34 | in in uh uh we have |
---|
0:06:36 | we decompose the transform |
---|
0:06:38 | and a come constrain the each transform to more low as best the good |
---|
0:06:42 | as the best to suppress the close the factor |
---|
0:06:45 | in this case we has speak transform and noise transform |
---|
0:06:48 | which also have a each others |
---|
0:06:50 | this gives us the some free even two |
---|
0:06:53 | to to use this transform for example you've we know that same speaker as |
---|
0:06:58 | a speaking |
---|
0:06:59 | you know the changing noise conditions |
---|
0:07:01 | and we want to |
---|
0:07:03 | uh we we want to update the noise condition for and went to and two we just |
---|
0:07:08 | a to speech transfer is as we now was speaker has not changed |
---|
0:07:11 | and the can to noise update uh i i adaptation would just to do a a a nice adaptation |
---|
0:07:17 | and |
---|
0:07:17 | a similar way |
---|
0:07:19 | oh |
---|
0:07:20 | but environment that use is and change |
---|
0:07:22 | but a speaker has |
---|
0:07:23 | it has |
---|
0:07:24 | has has changed to another speaker speakers we can do |
---|
0:07:27 | uh make |
---|
0:07:28 | a speaker transform i out |
---|
0:07:29 | updating |
---|
0:07:30 | that we do this noise transforms |
---|
0:07:33 | so this |
---|
0:07:34 | that |
---|
0:07:34 | a a kind of acoustic factorization E |
---|
0:07:37 | factorization a a lots of this peak transform can be used in a range of noise condition |
---|
0:07:41 | and similar for noise transform |
---|
0:07:43 | and that when you sure with this is this approach is that |
---|
0:07:47 | the transfer what uh should be used the uh we use the transforming of factor i the fashion |
---|
0:07:52 | that to to estimate a transform need we need to join to estimate both speaker and noise trends are since |
---|
0:07:58 | that eight or uh of a fact |
---|
0:08:01 | a a a a a a of of |
---|
0:08:03 | uh |
---|
0:08:04 | a a a that of fact the by two |
---|
0:08:06 | uh to acoustic factors simultaneously |
---|
0:08:10 | a base on this comes at we derive a new adaptation schemes we call joint that that the king |
---|
0:08:16 | and this C D to on the right hand side shoes how we manipulate as transforms |
---|
0:08:21 | first what uh do in contrast to the previous should we do we it T plus them are |
---|
0:08:27 | this approach |
---|
0:08:28 | at that you use a a a reversed R a transform with applied to being a transform first and uh |
---|
0:08:34 | and the modified |
---|
0:08:36 | uh |
---|
0:08:37 | clean speech to nice the speech choose to crucial by doing so |
---|
0:08:41 | i work |
---|
0:08:42 | you can transform is a acting on the clean speech |
---|
0:08:46 | and |
---|
0:08:47 | the the in speaker independent clean speech and the S transform is a up you acting on this speech |
---|
0:08:54 | speaker dependence |
---|
0:08:56 | which true shouldn't |
---|
0:08:57 | this |
---|
0:08:58 | uh |
---|
0:08:59 | all do |
---|
0:09:00 | that are these are problems we |
---|
0:09:02 | we expect in speaker adaptation all |
---|
0:09:05 | noise compensation so we expect is |
---|
0:09:08 | is to transform she can be |
---|
0:09:10 | uh uh can be a so |
---|
0:09:12 | can be a some sort of of factor tries all also noise to each other so we can apply |
---|
0:09:17 | didn't me |
---|
0:09:18 | we can |
---|
0:09:19 | the them |
---|
0:09:20 | so that did i when use D we how we evaluate a hard you by it is uh uh a |
---|
0:09:26 | joint joint to transform seeing a in the |
---|
0:09:29 | X runs |
---|
0:09:30 | so we have for we have this song |
---|
0:09:33 | a we a you condition data are that's is from noise one |
---|
0:09:37 | peak K as me the noise phones for an speech transform joint state |
---|
0:09:41 | and the for and for the same speaker and then and uh i i can a noise condition |
---|
0:09:46 | we do a bit just a dude noise transform and uh these speak trance of we have all ten the |
---|
0:09:51 | in i don't in |
---|
0:09:52 | in the previous uh estimation |
---|
0:09:55 | and |
---|
0:09:55 | jen at |
---|
0:09:56 | at that acoustic models |
---|
0:09:58 | so that of it has the free and that's |
---|
0:10:00 | things |
---|
0:10:01 | uh uh since not only points friends far |
---|
0:10:03 | uh required do |
---|
0:10:05 | a a a update |
---|
0:10:06 | so this can be done or a single options so we can do this |
---|
0:10:10 | joint to speaker and noise i |
---|
0:10:12 | a a adaptation |
---|
0:10:14 | a single options |
---|
0:10:15 | which is very flexible |
---|
0:10:18 | so as scroll to the X ones |
---|
0:10:20 | uh for as when we we you bout the i-th runs on or four task |
---|
0:10:24 | this is a a is derived from most wrong as a joke one and task |
---|
0:10:29 | and we have for test set find |
---|
0:10:32 | there errors uh the in set a |
---|
0:10:35 | set a a is uh a test or one which is clean set |
---|
0:10:39 | and test |
---|
0:10:39 | in set B we have sick |
---|
0:10:41 | different to a six different types of noise at it |
---|
0:10:44 | and set C and said D is |
---|
0:10:46 | comes from the far-field microphones |
---|
0:10:49 | for the close to model training we do some of are pretty standard stuff |
---|
0:10:53 | and |
---|
0:10:54 | this is the X runs from a bashful batch X in a i'm |
---|
0:10:58 | batch more X in i mean |
---|
0:11:00 | the speaker and noise transform for i estimate for |
---|
0:11:03 | a for you shop that's for test set |
---|
0:11:06 | so |
---|
0:11:07 | it this no sharing bic uh off speaker transforms |
---|
0:11:10 | we can see that's uh |
---|
0:11:12 | by during speaker and noise adaptation |
---|
0:11:14 | a combine the speaker and noise adaptation |
---|
0:11:17 | we she we yeah she and signal and things over |
---|
0:11:22 | noise adaptational |
---|
0:11:23 | a noise adaptation on only |
---|
0:11:25 | and we L i can see that sings joint it's just the reverse all they're of each T from of |
---|
0:11:30 | i am i'm not transform so the order |
---|
0:11:32 | in share is not a very sensitive to it |
---|
0:11:36 | it it it is really uh a |
---|
0:11:39 | it it it it impacts performs it it does not impact for one too much |
---|
0:11:44 | so |
---|
0:11:44 | uh but we want to emphasise that this is a batch more X runs |
---|
0:11:49 | we we we uh this |
---|
0:11:51 | we recall which requires a a update is to estimate france transforms |
---|
0:11:56 | so it a is is better uh it he's not very flexible to be used |
---|
0:12:00 | so what is more interesting is the factorization X ones |
---|
0:12:04 | we can which uh in this X runs |
---|
0:12:06 | we have we can uh these estimates speaker transform for a from the clean set we should test or one |
---|
0:12:13 | and and we applied to speech transforming out the noise conditions |
---|
0:12:17 | we can see from the uh |
---|
0:12:19 | so the row of the table |
---|
0:12:21 | that's a we |
---|
0:12:22 | we uh this |
---|
0:12:23 | the speech transform from big from clean speech that's hard for a for the set B C is uh the |
---|
0:12:29 | out noise is set |
---|
0:12:31 | and then function at native S plus a that's not |
---|
0:12:35 | a that's not general at the that actually decrease performance |
---|
0:12:38 | because the each here the M O transfer in this case |
---|
0:12:42 | i uh is |
---|
0:12:43 | i i is acting on the vts adapted the being so uh so uh is |
---|
0:12:48 | is it is |
---|
0:12:49 | uh a you can it his uh a social a suspect and noise condition and |
---|
0:12:55 | can not be used uh that you know i don't noise conditions |
---|
0:12:58 | and and what is more interesting is that if we estimate transforms forms the speak transform font you got nice |
---|
0:13:04 | a set |
---|
0:13:05 | test or for all which is uh i since a restaurant noise |
---|
0:13:09 | and we have |
---|
0:13:10 | the |
---|
0:13:11 | a a a a a would joint to screw adapting skating actually |
---|
0:13:15 | uh you |
---|
0:13:16 | that that a get a a a uh guess a sum |
---|
0:13:18 | i that some better the result |
---|
0:13:22 | this is an interesting so |
---|
0:13:23 | and |
---|
0:13:25 | uh had this night might be a in a need and indicated that |
---|
0:13:29 | i would join transform uh i'm i'm a transformed in joint |
---|
0:13:33 | maybe more something that should be more that by |
---|
0:13:36 | i a by vts transform which is say which means that our factorization maybe not perfect |
---|
0:13:43 | that them but the a number of is |
---|
0:13:45 | uh are are uh are a up that nation use a as using the transform a speak transform as they |
---|
0:13:51 | from i is a shave |
---|
0:13:53 | for point for or which should use just which is very close to more expert |
---|
0:13:59 | a a more lax foreigns |
---|
0:14:00 | this demonstrate |
---|
0:14:01 | it |
---|
0:14:02 | we can we can fact tries act a we can fact a speak transform and using sprites speak transforming up |
---|
0:14:08 | you very good noise stations |
---|
0:14:11 | so |
---|
0:14:13 | so i i here i rival at my conclusions |
---|
0:14:16 | in this talk at i i we argue that um |
---|
0:14:19 | a handling doing or close to factors Z is important in |
---|
0:14:23 | being very complex realistic in closing moment |
---|
0:14:26 | and we present a are powerful and flexible polish test based on the |
---|
0:14:30 | acoustic factorisation with your the derive when you adaptation skiing because a joint |
---|
0:14:36 | and and this allows very rapid the speaker and noise adaptation |
---|
0:14:41 | this |
---|
0:14:41 | speak transform can can be used a cross them are local acoustic issues |
---|
0:14:45 | and just a little bit a about this a new X in |
---|
0:14:49 | a we have to to compare our approach |
---|
0:14:52 | is the uh uh a feature enhancement you enhancement to style is |
---|
0:14:57 | a style a but a noise robustness schemes is |
---|
0:15:01 | and am adaptation a speaker transforms |
---|
0:15:04 | a speaker adaptation |
---|
0:15:05 | we we observe that of joining the all all performed this a you have he feature these M are employed |
---|
0:15:12 | um um |
---|
0:15:13 | such and factorization mold |
---|
0:15:15 | and this is demonstrated |
---|
0:15:17 | a the the power of |
---|
0:15:18 | but the bayes framework |
---|
0:15:20 | and the we uh and the inside is from where we have a very powerful and flexible to does sees |
---|
0:15:27 | that he's acoustic factorisation |
---|
0:15:29 | second |
---|
0:15:36 | do have a a time for a couple of questions but have a process to to you might be are |
---|
0:15:42 | both behind the projectors |
---|
0:16:04 | so i have the |
---|
0:16:06 | so questions |
---|
0:16:07 | to use a speaker |
---|
0:16:11 | a to the factorization so |
---|
0:16:14 | quote a close at all |
---|
0:16:16 | three |
---|
0:16:17 | to that extent you so the assumption is about |
---|
0:16:22 | oh |
---|
0:16:24 | uh |
---|
0:16:25 | uh yeah i i i think this you are right to that this actually is |
---|
0:16:29 | um |
---|
0:16:30 | do not very uh it is not perfect fact uh of factor arise since me |
---|
0:16:35 | as you can see that uh |
---|
0:16:37 | uh |
---|
0:16:38 | but we have a that by on the on the transfer or and we also have a channel distortion actually |
---|
0:16:44 | is a also a bias on that |
---|
0:16:46 | that we do |
---|
0:16:48 | we can't |
---|
0:16:49 | uh |
---|
0:16:49 | and but is since |
---|
0:16:51 | but for the main main part the |
---|
0:16:54 | we transform is ending at as well and the nice friends for its in a transform |
---|
0:16:59 | so this to leave different types of transform one combined |
---|
0:17:03 | is actually the D can be |
---|
0:17:04 | uh uh |
---|
0:17:06 | uh uh uh a uh uh a factor rights |
---|
0:17:09 | and uh as the that there's the X were and demonstrated that |
---|
0:17:13 | oh |
---|
0:17:14 | we can't use |
---|
0:17:15 | you can because a is the fact right |
---|
0:17:18 | property is quite good |
---|
0:17:20 | um two |
---|
0:17:21 | the count to say in met matt in mathematically medically |
---|
0:17:24 | uh the the if it also a model to each other but we can see |
---|
0:17:28 | from this we can use the speech transforming wireds conditions |
---|
0:17:32 | so that's |
---|
0:17:33 | that's T |
---|
0:17:34 | a is kind of factor |
---|
0:17:35 | uh uh also an art |
---|
0:17:40 | a questions |
---|
0:17:45 | i i sure can for the speaker |
---|