hello everyone today my report is

joint training and from and the speech recognition systems with a good reviews i'm shouldn't

be i and like comics

all works for a nice e g

advanced technology lab located in control japan

in this paper that our motivation yes we focus on improving the performance all the

state-of-the-art transformer based and into and speech recognition system

the as we know has a hassle a multilingual speech to speech translation system that

has meta data does it all speakers

and how to improve see all that and two and the speech recognition system

for dealing with such diversity input is well the focus of this paper

since we are using transformer based and the two and the speech recognition system in

t is the state of the art

all five or so that a meter size is ten times larger than the traditional

give a neural network and hidden markov i read models how to compress this model

in this relatively small size it is another focus of this paper actually it is

all previous

interspeech paper in two thousand and nineteen we also introduced in this paper as

as the summarization

so this paper the tries to solve this serious all problem using following two

have knowledge is so firstly is recurrent stacked layers

the second days speech of interviews as all combinations so recurrent stack the layers tries

to compress the model size so it's

each attribute

as the label level limitation to train the model i means of trains the compressed

to model explicitly

actually doing something like s p speech adaptive training

to improve as a result

in this lies we introduce how we can press holiday all model

using z are currently is that

layers

so dimensional for conventional transformer based model

with conventional than only as i'm each a layer s is independent of interest

it is that for example six layers

six including layers and a six a decoding layers

so parameters size is very large so if we

use the same parameter for all layers in the encoder and

or so same kind of interesting to decode the

so we can compress the model being

well over six is original size take the example of six and six layers of

transformer based the model

this idea is simple but very effective

this is always experimental setting so dataset we use japanese speech recognition corpus siesta corpus

so training set sums up to five hundred hours

the development set and the three testing set

as well a model training sentence we use eight attention has but hundred and twelve

keep units

six rolls before you for the encoders and six plots for the record this

the experimental settings at least we use the word is model a solvent

as the training unit

what is forty dimensional filterbank as the feature extraction

as our experimental

result

is the share the model and see for model seashell a morse eers layers are

used to use you have model and the banana layers used agency

dimensional for models we tried different that's also layers

and i and i mean is so number in this case and number of encoders

and is the number of the corpus

we find the c

for the four models

c six in all the and six the core the structure can choose the best

result

emits a much deeper but no significant improvement

for the share the model we also observed sixteen that can encode those and sixty

coders

hash of the best result i have a so the performance you have caps for

the share the model

i mean it's a so performance have about one percent absolute

one while the salute percent of performance gap

how to minimize the upper formant caps is all

the other focus

as a summarization all the first experimental results also of this paper

our observation years so share the model with recurrent finisher layers

six times smaller hands the original and that are layers we in c in before

model

and the we can speed up the decoding time twice as fast as the original

decoding speed and ten percent faster of the training using what we use easily i

know medical decoding and the in such a standard training we will use a the

gpu

and it keeps directly use not beneficial

i mean more than six layers

i four experiment the we draw one and i

six layers

experimental setting

in the following experiment we only used six and six layers

structure

and the non-linear operations more important since the number i don't it's

i mean a

with the process and that's why we use shared

so i means association two layers works

why we propose c s sis speech attributes a limitation in the training because for

the autoregressive model it has an age as a first to recognize the words will

you problems related recognize words although it's decoding speed is very slow but

using this nature we can adopt as a speaker adaptive training similar like speaker adaptive

training using the introduce as the at the beginning of the label

c is a definition the of the speech attributes if the intruders of speakers information

and also include as a

us a speech segments in can you information

we give a formal definition of the speech attributes lists we use a dialectical speakers

i mean because we are using the corpus of japanese language so we use the

people and so that where people was on

i mean i've the polka all

okay insider at all so on the place is we put it here

and also we have a duration of the utterance is for these for these attributes

we finally it's very useful so we put it also is nothing to do with

speakers information and also we so it's a corpus is from three different us as

the several different

different resources academic simulated the dialogue read

miscellaneous and the something else

and also so forcefully is easy speech and all the speakers the female male and

a novel

and i mean unregistered information

for the for the age

we group all the speakers into four groups

the young middle-aged and an unregistered information and the as a cluster group we use

the educational level of the speakers the middle school high school pendulum was to

order to and the unregistered information

has all data definition we

use individual

and the different numbers of combination

to train

is that so i mean quotas put of those

and reduce aztecs in the model training

hence the experiments utilising all the speech attribute scenes a label

so first line is

without attribute

i means as a very conventional method

but we will probably to compare as a baseline

and also we use speakers

speaker id

as attributes slight abuse itself once all in and

and five hundred more speakers

i mean what is the speaker's ideas the beginning of the center is also label

we can see is not effective at all

and we also tried

i mean individual

groups of tanks individual groups of text at the beginning of the

that's the that's a label

an defined as the sex

information is the best and also duration

and speech

they are c selected it will be effective individually we also tried

to combination to attributes combination strategically implementation phone attribute combination

and the five even five attribute immigration is it is you can

similar to do

the combined together

so it will work

the best

but the to more effectively we find a for to groom to introduce two groups

sex and a duration works the best and the three groups it works the coming

to duration

sex and age works the best and the duration

and the for the duration of sex change it works the best of the best

still sees for groups of a the overall attribute combination can compare can compared to

the full models baseline

and i means that uncompressed larger network

so performance is comparable

we also find that using sees tanks

using the speech attributes

two element to train so

so for model means uncompressed a big model is not effective because the so the

model size is large enough to learn by himself

as the observation all this part of experiment we find a full model it and

then speech interviews by itself i mean in clicks italy

because a model size is too large the kind and all by in it all

by the self

so she had a model than is you explicitly signals

we also finds a hundred billion combinations of duration sex age and the

and

and a single sex

tag mostly effective

this information can predictive or from also

the resource informations

if

before

i mean

do in real asr recognition in test time

to conclude

so the ins to transform are based this speech recognition models

it has many layers in the encoder and decoder is it makes them on a

very large

so conventional and annotators each from the yes independent parameters

i've a it sees used make some model too large so we propose recurrent respect

to layers

so use the same parameters for only those in the encoder and decoder individually

we also propose speech attributes as input signal

to perform documentation to trying to

such as a so recurrent speculators

explicitly

so if you know t is

it is as a extensive experiments on the cs train corpus

we some model size reduction can two

ninety three percent

and also ten percent

faster training by using on the gpu

it is insignificant more seen kind of error rate in the speech recognition

i use speech attributes

and we also find so increase the tension entropy visualise the in the future maps

the future work

to maximize we will maximize a compression with a recurrent the stacked layers and with

small introduce duration

we also have the more flak we also we will also develop more flexible decoding

by choosing the is dynamically

and the we can use of precision in the model

in the model storage and the parameter during the fast attention on the other soft

max side

technology to make the model smaller

and the we also we investigate some model representations each layer

to see how many useful informations from se

so each layer

thinking so much for a listening

this is the end of the presentation and in questions are welcome