hello everyone today my report is
joint training and from and the speech recognition systems with a good reviews i'm shouldn't
be i and like comics
all works for a nice e g
advanced technology lab located in control japan
in this paper that our motivation yes we focus on improving the performance all the
state-of-the-art transformer based and into and speech recognition system
the as we know has a hassle a multilingual speech to speech translation system that
has meta data does it all speakers
and how to improve see all that and two and the speech recognition system
for dealing with such diversity input is well the focus of this paper
since we are using transformer based and the two and the speech recognition system in
t is the state of the art
all five or so that a meter size is ten times larger than the traditional
give a neural network and hidden markov i read models how to compress this model
in this relatively small size it is another focus of this paper actually it is
all previous
interspeech paper in two thousand and nineteen we also introduced in this paper as
as the summarization
so this paper the tries to solve this serious all problem using following two
have knowledge is so firstly is recurrent stacked layers
the second days speech of interviews as all combinations so recurrent stack the layers tries
to compress the model size so it's
each attribute
as the label level limitation to train the model i means of trains the compressed
to model explicitly
actually doing something like s p speech adaptive training
to improve as a result
in this lies we introduce how we can press holiday all model
using z are currently is that
layers
so dimensional for conventional transformer based model
with conventional than only as i'm each a layer s is independent of interest
it is that for example six layers
six including layers and a six a decoding layers
so parameters size is very large so if we
use the same parameter for all layers in the encoder and
or so same kind of interesting to decode the
so we can compress the model being
well over six is original size take the example of six and six layers of
transformer based the model
this idea is simple but very effective
this is always experimental setting so dataset we use japanese speech recognition corpus siesta corpus
so training set sums up to five hundred hours
the development set and the three testing set
as well a model training sentence we use eight attention has but hundred and twelve
keep units
six rolls before you for the encoders and six plots for the record this
the experimental settings at least we use the word is model a solvent
as the training unit
what is forty dimensional filterbank as the feature extraction
as our experimental
result
is the share the model and see for model seashell a morse eers layers are
used to use you have model and the banana layers used agency
dimensional for models we tried different that's also layers
and i and i mean is so number in this case and number of encoders
and is the number of the corpus
we find the c
for the four models
c six in all the and six the core the structure can choose the best
result
emits a much deeper but no significant improvement
for the share the model we also observed sixteen that can encode those and sixty
coders
hash of the best result i have a so the performance you have caps for
the share the model
i mean it's a so performance have about one percent absolute
one while the salute percent of performance gap
how to minimize the upper formant caps is all
the other focus
as a summarization all the first experimental results also of this paper
our observation years so share the model with recurrent finisher layers
six times smaller hands the original and that are layers we in c in before
model
and the we can speed up the decoding time twice as fast as the original
decoding speed and ten percent faster of the training using what we use easily i
know medical decoding and the in such a standard training we will use a the
gpu
and it keeps directly use not beneficial
i mean more than six layers
i four experiment the we draw one and i
six layers
experimental setting
in the following experiment we only used six and six layers
structure
and the non-linear operations more important since the number i don't it's
i mean a
with the process and that's why we use shared
so i means association two layers works
why we propose c s sis speech attributes a limitation in the training because for
the autoregressive model it has an age as a first to recognize the words will
you problems related recognize words although it's decoding speed is very slow but
using this nature we can adopt as a speaker adaptive training similar like speaker adaptive
training using the introduce as the at the beginning of the label
c is a definition the of the speech attributes if the intruders of speakers information
and also include as a
us a speech segments in can you information
we give a formal definition of the speech attributes lists we use a dialectical speakers
i mean because we are using the corpus of japanese language so we use the
people and so that where people was on
i mean i've the polka all
okay insider at all so on the place is we put it here
and also we have a duration of the utterance is for these for these attributes
we finally it's very useful so we put it also is nothing to do with
speakers information and also we so it's a corpus is from three different us as
the several different
different resources academic simulated the dialogue read
miscellaneous and the something else
and also so forcefully is easy speech and all the speakers the female male and
a novel
and i mean unregistered information
for the for the age
we group all the speakers into four groups
the young middle-aged and an unregistered information and the as a cluster group we use
the educational level of the speakers the middle school high school pendulum was to
order to and the unregistered information
has all data definition we
use individual
and the different numbers of combination
to train
is that so i mean quotas put of those
and reduce aztecs in the model training
hence the experiments utilising all the speech attribute scenes a label
so first line is
without attribute
i means as a very conventional method
but we will probably to compare as a baseline
and also we use speakers
speaker id
as attributes slight abuse itself once all in and
and five hundred more speakers
i mean what is the speaker's ideas the beginning of the center is also label
we can see is not effective at all
and we also tried
i mean individual
groups of tanks individual groups of text at the beginning of the
that's the that's a label
an defined as the sex
information is the best and also duration
and speech
they are c selected it will be effective individually we also tried
to combination to attributes combination strategically implementation phone attribute combination
and the five even five attribute immigration is it is you can
similar to do
the combined together
so it will work
the best
but the to more effectively we find a for to groom to introduce two groups
sex and a duration works the best and the three groups it works the coming
to duration
sex and age works the best and the duration
and the for the duration of sex change it works the best of the best
still sees for groups of a the overall attribute combination can compare can compared to
the full models baseline
and i means that uncompressed larger network
so performance is comparable
we also find that using sees tanks
using the speech attributes
two element to train so
so for model means uncompressed a big model is not effective because the so the
model size is large enough to learn by himself
as the observation all this part of experiment we find a full model it and
then speech interviews by itself i mean in clicks italy
because a model size is too large the kind and all by in it all
by the self
so she had a model than is you explicitly signals
we also finds a hundred billion combinations of duration sex age and the
and
and a single sex
tag mostly effective
this information can predictive or from also
the resource informations
if
before
i mean
do in real asr recognition in test time
to conclude
so the ins to transform are based this speech recognition models
it has many layers in the encoder and decoder is it makes them on a
very large
so conventional and annotators each from the yes independent parameters
i've a it sees used make some model too large so we propose recurrent respect
to layers
so use the same parameters for only those in the encoder and decoder individually
we also propose speech attributes as input signal
to perform documentation to trying to
such as a so recurrent speculators
explicitly
so if you know t is
it is as a extensive experiments on the cs train corpus
we some model size reduction can two
ninety three percent
and also ten percent
faster training by using on the gpu
it is insignificant more seen kind of error rate in the speech recognition
i use speech attributes
and we also find so increase the tension entropy visualise the in the future maps
the future work
to maximize we will maximize a compression with a recurrent the stacked layers and with
small introduce duration
we also have the more flak we also we will also develop more flexible decoding
by choosing the is dynamically
and the we can use of precision in the model
in the model storage and the parameter during the fast attention on the other soft
max side
technology to make the model smaller
and the we also we investigate some model representations each layer
to see how many useful informations from se
so each layer
thinking so much for a listening
this is the end of the presentation and in questions are welcome