0:00:14 | hello everyone today my report is |
---|
0:00:18 | joint training and from and the speech recognition systems with a good reviews i'm shouldn't |
---|
0:00:24 | be i and like comics |
---|
0:00:26 | all works for a nice e g |
---|
0:00:29 | advanced technology lab located in control japan |
---|
0:00:39 | in this paper that our motivation yes we focus on improving the performance all the |
---|
0:00:46 | state-of-the-art transformer based and into and speech recognition system |
---|
0:00:51 | the as we know has a hassle a multilingual speech to speech translation system that |
---|
0:00:58 | has meta data does it all speakers |
---|
0:01:01 | and how to improve see all that and two and the speech recognition system |
---|
0:01:07 | for dealing with such diversity input is well the focus of this paper |
---|
0:01:20 | since we are using transformer based and the two and the speech recognition system in |
---|
0:01:25 | t is the state of the art |
---|
0:01:28 | all five or so that a meter size is ten times larger than the traditional |
---|
0:01:33 | give a neural network and hidden markov i read models how to compress this model |
---|
0:01:39 | in this relatively small size it is another focus of this paper actually it is |
---|
0:01:45 | all previous |
---|
0:01:48 | interspeech paper in two thousand and nineteen we also introduced in this paper as |
---|
0:01:54 | as the summarization |
---|
0:02:02 | so this paper the tries to solve this serious all problem using following two |
---|
0:02:08 | have knowledge is so firstly is recurrent stacked layers |
---|
0:02:13 | the second days speech of interviews as all combinations so recurrent stack the layers tries |
---|
0:02:20 | to compress the model size so it's |
---|
0:02:23 | each attribute |
---|
0:02:24 | as the label level limitation to train the model i means of trains the compressed |
---|
0:02:31 | to model explicitly |
---|
0:02:34 | actually doing something like s p speech adaptive training |
---|
0:02:41 | to improve as a result |
---|
0:02:47 | in this lies we introduce how we can press holiday all model |
---|
0:02:52 | using z are currently is that |
---|
0:02:55 | layers |
---|
0:02:56 | so dimensional for conventional transformer based model |
---|
0:03:00 | with conventional than only as i'm each a layer s is independent of interest |
---|
0:03:06 | it is that for example six layers |
---|
0:03:10 | six including layers and a six a decoding layers |
---|
0:03:14 | so parameters size is very large so if we |
---|
0:03:18 | use the same parameter for all layers in the encoder and |
---|
0:03:22 | or so same kind of interesting to decode the |
---|
0:03:26 | so we can compress the model being |
---|
0:03:29 | well over six is original size take the example of six and six layers of |
---|
0:03:34 | transformer based the model |
---|
0:03:39 | this idea is simple but very effective |
---|
0:03:47 | this is always experimental setting so dataset we use japanese speech recognition corpus siesta corpus |
---|
0:03:54 | so training set sums up to five hundred hours |
---|
0:04:00 | the development set and the three testing set |
---|
0:04:06 | as well a model training sentence we use eight attention has but hundred and twelve |
---|
0:04:12 | keep units |
---|
0:04:13 | six rolls before you for the encoders and six plots for the record this |
---|
0:04:20 | the experimental settings at least we use the word is model a solvent |
---|
0:04:27 | as the training unit |
---|
0:04:31 | what is forty dimensional filterbank as the feature extraction |
---|
0:04:48 | as our experimental |
---|
0:04:50 | result |
---|
0:04:52 | is the share the model and see for model seashell a morse eers layers are |
---|
0:05:00 | used to use you have model and the banana layers used agency |
---|
0:05:06 | dimensional for models we tried different that's also layers |
---|
0:05:13 | and i and i mean is so number in this case and number of encoders |
---|
0:05:19 | and is the number of the corpus |
---|
0:05:22 | we find the c |
---|
0:05:24 | for the four models |
---|
0:05:26 | c six in all the and six the core the structure can choose the best |
---|
0:05:32 | result |
---|
0:05:33 | emits a much deeper but no significant improvement |
---|
0:05:40 | for the share the model we also observed sixteen that can encode those and sixty |
---|
0:05:45 | coders |
---|
0:05:47 | hash of the best result i have a so the performance you have caps for |
---|
0:05:52 | the share the model |
---|
0:05:55 | i mean it's a so performance have about one percent absolute |
---|
0:06:01 | one while the salute percent of performance gap |
---|
0:06:06 | how to minimize the upper formant caps is all |
---|
0:06:11 | the other focus |
---|
0:06:26 | as a summarization all the first experimental results also of this paper |
---|
0:06:32 | our observation years so share the model with recurrent finisher layers |
---|
0:06:38 | six times smaller hands the original and that are layers we in c in before |
---|
0:06:45 | model |
---|
0:06:46 | and the we can speed up the decoding time twice as fast as the original |
---|
0:06:53 | decoding speed and ten percent faster of the training using what we use easily i |
---|
0:07:00 | know medical decoding and the in such a standard training we will use a the |
---|
0:07:07 | gpu |
---|
0:07:09 | and it keeps directly use not beneficial |
---|
0:07:13 | i mean more than six layers |
---|
0:07:15 | i four experiment the we draw one and i |
---|
0:07:18 | six layers |
---|
0:07:21 | experimental setting |
---|
0:07:23 | in the following experiment we only used six and six layers |
---|
0:07:28 | structure |
---|
0:07:31 | and the non-linear operations more important since the number i don't it's |
---|
0:07:36 | i mean a |
---|
0:07:39 | with the process and that's why we use shared |
---|
0:07:42 | so i means association two layers works |
---|
0:07:51 | why we propose c s sis speech attributes a limitation in the training because for |
---|
0:07:59 | the autoregressive model it has an age as a first to recognize the words will |
---|
0:08:03 | you problems related recognize words although it's decoding speed is very slow but |
---|
0:08:09 | using this nature we can adopt as a speaker adaptive training similar like speaker adaptive |
---|
0:08:15 | training using the introduce as the at the beginning of the label |
---|
0:08:29 | c is a definition the of the speech attributes if the intruders of speakers information |
---|
0:08:35 | and also include as a |
---|
0:08:37 | us a speech segments in can you information |
---|
0:08:40 | we give a formal definition of the speech attributes lists we use a dialectical speakers |
---|
0:08:48 | i mean because we are using the corpus of japanese language so we use the |
---|
0:08:53 | people and so that where people was on |
---|
0:08:57 | i mean i've the polka all |
---|
0:09:00 | okay insider at all so on the place is we put it here |
---|
0:09:04 | and also we have a duration of the utterance is for these for these attributes |
---|
0:09:10 | we finally it's very useful so we put it also is nothing to do with |
---|
0:09:14 | speakers information and also we so it's a corpus is from three different us as |
---|
0:09:21 | the several different |
---|
0:09:24 | different resources academic simulated the dialogue read |
---|
0:09:28 | miscellaneous and the something else |
---|
0:09:30 | and also so forcefully is easy speech and all the speakers the female male and |
---|
0:09:36 | a novel |
---|
0:09:37 | and i mean unregistered information |
---|
0:09:40 | for the for the age |
---|
0:09:42 | we group all the speakers into four groups |
---|
0:09:47 | the young middle-aged and an unregistered information and the as a cluster group we use |
---|
0:09:55 | the educational level of the speakers the middle school high school pendulum was to |
---|
0:10:01 | order to and the unregistered information |
---|
0:10:04 | has all data definition we |
---|
0:10:07 | use individual |
---|
0:10:10 | and the different numbers of combination |
---|
0:10:13 | to train |
---|
0:10:15 | is that so i mean quotas put of those |
---|
0:10:18 | and reduce aztecs in the model training |
---|
0:10:22 | hence the experiments utilising all the speech attribute scenes a label |
---|
0:10:27 | so first line is |
---|
0:10:29 | without attribute |
---|
0:10:32 | i means as a very conventional method |
---|
0:10:36 | but we will probably to compare as a baseline |
---|
0:10:39 | and also we use speakers |
---|
0:10:42 | speaker id |
---|
0:10:44 | as attributes slight abuse itself once all in and |
---|
0:10:48 | and five hundred more speakers |
---|
0:10:51 | i mean what is the speaker's ideas the beginning of the center is also label |
---|
0:10:55 | we can see is not effective at all |
---|
0:10:59 | and we also tried |
---|
0:11:01 | i mean individual |
---|
0:11:03 | groups of tanks individual groups of text at the beginning of the |
---|
0:11:07 | that's the that's a label |
---|
0:11:09 | an defined as the sex |
---|
0:11:11 | information is the best and also duration |
---|
0:11:14 | and speech |
---|
0:11:16 | they are c selected it will be effective individually we also tried |
---|
0:11:21 | to combination to attributes combination strategically implementation phone attribute combination |
---|
0:11:26 | and the five even five attribute immigration is it is you can |
---|
0:11:31 | similar to do |
---|
0:11:32 | the combined together |
---|
0:11:34 | so it will work |
---|
0:11:36 | the best |
---|
0:11:38 | but the to more effectively we find a for to groom to introduce two groups |
---|
0:11:43 | sex and a duration works the best and the three groups it works the coming |
---|
0:11:49 | to duration |
---|
0:11:50 | sex and age works the best and the duration |
---|
0:11:54 | and the for the duration of sex change it works the best of the best |
---|
0:12:01 | still sees for groups of a the overall attribute combination can compare can compared to |
---|
0:12:09 | the full models baseline |
---|
0:12:12 | and i means that uncompressed larger network |
---|
0:12:16 | so performance is comparable |
---|
0:12:20 | we also find that using sees tanks |
---|
0:12:23 | using the speech attributes |
---|
0:12:25 | two element to train so |
---|
0:12:27 | so for model means uncompressed a big model is not effective because the so the |
---|
0:12:33 | model size is large enough to learn by himself |
---|
0:12:46 | as the observation all this part of experiment we find a full model it and |
---|
0:12:51 | then speech interviews by itself i mean in clicks italy |
---|
0:12:56 | because a model size is too large the kind and all by in it all |
---|
0:13:00 | by the self |
---|
0:13:02 | so she had a model than is you explicitly signals |
---|
0:13:07 | we also finds a hundred billion combinations of duration sex age and the |
---|
0:13:13 | and |
---|
0:13:14 | and a single sex |
---|
0:13:17 | tag mostly effective |
---|
0:13:21 | this information can predictive or from also |
---|
0:13:24 | the resource informations |
---|
0:13:27 | if |
---|
0:13:27 | before |
---|
0:13:28 | i mean |
---|
0:13:29 | do in real asr recognition in test time |
---|
0:13:39 | to conclude |
---|
0:13:40 | so the ins to transform are based this speech recognition models |
---|
0:13:46 | it has many layers in the encoder and decoder is it makes them on a |
---|
0:13:50 | very large |
---|
0:13:51 | so conventional and annotators each from the yes independent parameters |
---|
0:13:57 | i've a it sees used make some model too large so we propose recurrent respect |
---|
0:14:03 | to layers |
---|
0:14:04 | so use the same parameters for only those in the encoder and decoder individually |
---|
0:14:10 | we also propose speech attributes as input signal |
---|
0:14:15 | to perform documentation to trying to |
---|
0:14:19 | such as a so recurrent speculators |
---|
0:14:24 | explicitly |
---|
0:14:26 | so if you know t is |
---|
0:14:27 | it is as a extensive experiments on the cs train corpus |
---|
0:14:37 | we some model size reduction can two |
---|
0:14:41 | ninety three percent |
---|
0:14:43 | and also ten percent |
---|
0:14:45 | faster training by using on the gpu |
---|
0:14:48 | it is insignificant more seen kind of error rate in the speech recognition |
---|
0:14:53 | i use speech attributes |
---|
0:14:56 | and we also find so increase the tension entropy visualise the in the future maps |
---|
0:15:04 | the future work |
---|
0:15:05 | to maximize we will maximize a compression with a recurrent the stacked layers and with |
---|
0:15:12 | small introduce duration |
---|
0:15:15 | we also have the more flak we also we will also develop more flexible decoding |
---|
0:15:22 | by choosing the is dynamically |
---|
0:15:25 | and the we can use of precision in the model |
---|
0:15:30 | in the model storage and the parameter during the fast attention on the other soft |
---|
0:15:34 | max side |
---|
0:15:37 | technology to make the model smaller |
---|
0:15:40 | and the we also we investigate some model representations each layer |
---|
0:15:45 | to see how many useful informations from se |
---|
0:15:49 | so each layer |
---|
0:15:51 | thinking so much for a listening |
---|
0:15:55 | this is the end of the presentation and in questions are welcome |
---|