0:00:14hello everyone today my report is
0:00:18joint training and from and the speech recognition systems with a good reviews i'm shouldn't
0:00:24be i and like comics
0:00:26all works for a nice e g
0:00:29advanced technology lab located in control japan
0:00:39in this paper that our motivation yes we focus on improving the performance all the
0:00:46state-of-the-art transformer based and into and speech recognition system
0:00:51the as we know has a hassle a multilingual speech to speech translation system that
0:00:58has meta data does it all speakers
0:01:01and how to improve see all that and two and the speech recognition system
0:01:07for dealing with such diversity input is well the focus of this paper
0:01:20since we are using transformer based and the two and the speech recognition system in
0:01:25t is the state of the art
0:01:28all five or so that a meter size is ten times larger than the traditional
0:01:33give a neural network and hidden markov i read models how to compress this model
0:01:39in this relatively small size it is another focus of this paper actually it is
0:01:45all previous
0:01:48interspeech paper in two thousand and nineteen we also introduced in this paper as
0:01:54as the summarization
0:02:02so this paper the tries to solve this serious all problem using following two
0:02:08have knowledge is so firstly is recurrent stacked layers
0:02:13the second days speech of interviews as all combinations so recurrent stack the layers tries
0:02:20to compress the model size so it's
0:02:23each attribute
0:02:24as the label level limitation to train the model i means of trains the compressed
0:02:31to model explicitly
0:02:34actually doing something like s p speech adaptive training
0:02:41to improve as a result
0:02:47in this lies we introduce how we can press holiday all model
0:02:52using z are currently is that
0:02:55layers
0:02:56so dimensional for conventional transformer based model
0:03:00with conventional than only as i'm each a layer s is independent of interest
0:03:06it is that for example six layers
0:03:10six including layers and a six a decoding layers
0:03:14so parameters size is very large so if we
0:03:18use the same parameter for all layers in the encoder and
0:03:22or so same kind of interesting to decode the
0:03:26so we can compress the model being
0:03:29well over six is original size take the example of six and six layers of
0:03:34transformer based the model
0:03:39this idea is simple but very effective
0:03:47this is always experimental setting so dataset we use japanese speech recognition corpus siesta corpus
0:03:54so training set sums up to five hundred hours
0:04:00the development set and the three testing set
0:04:06as well a model training sentence we use eight attention has but hundred and twelve
0:04:12keep units
0:04:13six rolls before you for the encoders and six plots for the record this
0:04:20the experimental settings at least we use the word is model a solvent
0:04:27as the training unit
0:04:31what is forty dimensional filterbank as the feature extraction
0:04:48as our experimental
0:04:50result
0:04:52is the share the model and see for model seashell a morse eers layers are
0:05:00used to use you have model and the banana layers used agency
0:05:06dimensional for models we tried different that's also layers
0:05:13and i and i mean is so number in this case and number of encoders
0:05:19and is the number of the corpus
0:05:22we find the c
0:05:24for the four models
0:05:26c six in all the and six the core the structure can choose the best
0:05:32result
0:05:33emits a much deeper but no significant improvement
0:05:40for the share the model we also observed sixteen that can encode those and sixty
0:05:45coders
0:05:47hash of the best result i have a so the performance you have caps for
0:05:52the share the model
0:05:55i mean it's a so performance have about one percent absolute
0:06:01one while the salute percent of performance gap
0:06:06how to minimize the upper formant caps is all
0:06:11the other focus
0:06:26as a summarization all the first experimental results also of this paper
0:06:32our observation years so share the model with recurrent finisher layers
0:06:38six times smaller hands the original and that are layers we in c in before
0:06:45model
0:06:46and the we can speed up the decoding time twice as fast as the original
0:06:53decoding speed and ten percent faster of the training using what we use easily i
0:07:00know medical decoding and the in such a standard training we will use a the
0:07:07gpu
0:07:09and it keeps directly use not beneficial
0:07:13i mean more than six layers
0:07:15i four experiment the we draw one and i
0:07:18six layers
0:07:21experimental setting
0:07:23in the following experiment we only used six and six layers
0:07:28structure
0:07:31and the non-linear operations more important since the number i don't it's
0:07:36i mean a
0:07:39with the process and that's why we use shared
0:07:42so i means association two layers works
0:07:51why we propose c s sis speech attributes a limitation in the training because for
0:07:59the autoregressive model it has an age as a first to recognize the words will
0:08:03you problems related recognize words although it's decoding speed is very slow but
0:08:09using this nature we can adopt as a speaker adaptive training similar like speaker adaptive
0:08:15training using the introduce as the at the beginning of the label
0:08:29c is a definition the of the speech attributes if the intruders of speakers information
0:08:35and also include as a
0:08:37us a speech segments in can you information
0:08:40we give a formal definition of the speech attributes lists we use a dialectical speakers
0:08:48i mean because we are using the corpus of japanese language so we use the
0:08:53people and so that where people was on
0:08:57i mean i've the polka all
0:09:00okay insider at all so on the place is we put it here
0:09:04and also we have a duration of the utterance is for these for these attributes
0:09:10we finally it's very useful so we put it also is nothing to do with
0:09:14speakers information and also we so it's a corpus is from three different us as
0:09:21the several different
0:09:24different resources academic simulated the dialogue read
0:09:28miscellaneous and the something else
0:09:30and also so forcefully is easy speech and all the speakers the female male and
0:09:36a novel
0:09:37and i mean unregistered information
0:09:40for the for the age
0:09:42we group all the speakers into four groups
0:09:47the young middle-aged and an unregistered information and the as a cluster group we use
0:09:55the educational level of the speakers the middle school high school pendulum was to
0:10:01order to and the unregistered information
0:10:04has all data definition we
0:10:07use individual
0:10:10and the different numbers of combination
0:10:13to train
0:10:15is that so i mean quotas put of those
0:10:18and reduce aztecs in the model training
0:10:22hence the experiments utilising all the speech attribute scenes a label
0:10:27so first line is
0:10:29without attribute
0:10:32i means as a very conventional method
0:10:36but we will probably to compare as a baseline
0:10:39and also we use speakers
0:10:42speaker id
0:10:44as attributes slight abuse itself once all in and
0:10:48and five hundred more speakers
0:10:51i mean what is the speaker's ideas the beginning of the center is also label
0:10:55we can see is not effective at all
0:10:59and we also tried
0:11:01i mean individual
0:11:03groups of tanks individual groups of text at the beginning of the
0:11:07that's the that's a label
0:11:09an defined as the sex
0:11:11information is the best and also duration
0:11:14and speech
0:11:16they are c selected it will be effective individually we also tried
0:11:21to combination to attributes combination strategically implementation phone attribute combination
0:11:26and the five even five attribute immigration is it is you can
0:11:31similar to do
0:11:32the combined together
0:11:34so it will work
0:11:36the best
0:11:38but the to more effectively we find a for to groom to introduce two groups
0:11:43sex and a duration works the best and the three groups it works the coming
0:11:49to duration
0:11:50sex and age works the best and the duration
0:11:54and the for the duration of sex change it works the best of the best
0:12:01still sees for groups of a the overall attribute combination can compare can compared to
0:12:09the full models baseline
0:12:12and i means that uncompressed larger network
0:12:16so performance is comparable
0:12:20we also find that using sees tanks
0:12:23using the speech attributes
0:12:25two element to train so
0:12:27so for model means uncompressed a big model is not effective because the so the
0:12:33model size is large enough to learn by himself
0:12:46as the observation all this part of experiment we find a full model it and
0:12:51then speech interviews by itself i mean in clicks italy
0:12:56because a model size is too large the kind and all by in it all
0:13:00by the self
0:13:02so she had a model than is you explicitly signals
0:13:07we also finds a hundred billion combinations of duration sex age and the
0:13:13and
0:13:14and a single sex
0:13:17tag mostly effective
0:13:21this information can predictive or from also
0:13:24the resource informations
0:13:27if
0:13:27before
0:13:28i mean
0:13:29do in real asr recognition in test time
0:13:39to conclude
0:13:40so the ins to transform are based this speech recognition models
0:13:46it has many layers in the encoder and decoder is it makes them on a
0:13:50very large
0:13:51so conventional and annotators each from the yes independent parameters
0:13:57i've a it sees used make some model too large so we propose recurrent respect
0:14:03to layers
0:14:04so use the same parameters for only those in the encoder and decoder individually
0:14:10we also propose speech attributes as input signal
0:14:15to perform documentation to trying to
0:14:19such as a so recurrent speculators
0:14:24explicitly
0:14:26so if you know t is
0:14:27it is as a extensive experiments on the cs train corpus
0:14:37we some model size reduction can two
0:14:41ninety three percent
0:14:43and also ten percent
0:14:45faster training by using on the gpu
0:14:48it is insignificant more seen kind of error rate in the speech recognition
0:14:53i use speech attributes
0:14:56and we also find so increase the tension entropy visualise the in the future maps
0:15:04the future work
0:15:05to maximize we will maximize a compression with a recurrent the stacked layers and with
0:15:12small introduce duration
0:15:15we also have the more flak we also we will also develop more flexible decoding
0:15:22by choosing the is dynamically
0:15:25and the we can use of precision in the model
0:15:30in the model storage and the parameter during the fast attention on the other soft
0:15:34max side
0:15:37technology to make the model smaller
0:15:40and the we also we investigate some model representations each layer
0:15:45to see how many useful informations from se
0:15:49so each layer
0:15:51thinking so much for a listening
0:15:55this is the end of the presentation and in questions are welcome