thank you very much for video presentation
mandarin min come from you don't time
today i can actually for competition expectation for shown to the spoken language identification
i want to keep this presentation of the follows
clustering and we introduce the short utterance language identification tasks
the thing i shall use a neural network based on writing techniques
extractor
and they show how that vector use them for lid task
after that the feature compensation learning will be introduced
then
i'm sure you
our experiments are sent out
one really
and you summer and the conclusions
okay language identification techniques and topical use of a pre-processing stage a lot you lingo
did recognition and translation system
for real time speech processing system
incorporating performance of shock filters are task
are important
because it can
zero to reduce the real-time factor and the
it is also or system
well of the
state of the how
to
right the master is the i-vector based and that's it
alright to this semester very effective a relative number of devices
recently
most of the researcher neural network based approaches
because so the idea is the classification task
therefore they neural network model can be directly used for classification
the entanglements sure that the performance
a shot boundaries right you task
experiments a high initial for speaker verification task
and the recent study it was also successfully used to derive the task
in this work
we focus on the big vector based
nested
the expenditure the neural network based they write presentation data
note that using that are applied to men cost
the speaker recognition even today actually on the language identification
the network for extracting extractor
consists of three month euros
reliable feature extractor
statistics hogan
and the boundaries
variable representation years
a very well feature extractor model
outputs frame level
the utterance
we impose over a sequence of acoustic features
well this year s
time delay neural network
well convolutional neural network or used
then
a good coding here
canberra the frame level quality
further frame level features into a fixed to dimensional vector by using the mean and
they're
standard the condition
finally
for connected actually didn't is used to process all utterance level representations
and a final thoughts the next earlier you used it is all those response to
use you have
and the map i
and like to thank next are mostly used for speaker verification task
using the verification task
the extractor the doctors
frontends
that is the used to extract results of contracting agent
you back and
some of them and here or cosine similarity can be used up all common case
for the lid task
the front end up backends approach can also be used
compared to be that jointly row just thinking regression become more widely used directly
classification task
well clusters and
a reading tasks
we can also directly use the network outputs for classification
this work
make a shot authors lid task
not only
but the testing utterance become shorter
so performance also decreases
no degradation is mainly because
and i can think up to ten calls applies a large variation
of the shuttle to resist
to reduce
the variation or short utterances
normalization method using and
corresponding no other varieties
warranty investigated for i-vectors
and neural network based
it is the number that we can also apply stimuli the i-vector extractor
therefore
we inputting we think that
similar idea
two
improves accuracy performance by using vector network
the chair
compensation
well down by reducing the actually then
representation pleading a and the short duration
inputs
there
the s
is that representation overshot of the variance
and there is a representation of the corresponding rhino buttons is
the i-vector space
this education
can be rewriting "'cause" this one
well for training
drastically
which the vector is the network by using an l
duration encodes
then the shot input space to model the trend maybe a function
considering that difference between them out and the shot utterance
the shot boundaries
consis a very limited information
therefore to improve the performance a short utterance
both i and i were extracted and information local phonetic information an important issue
we suppose that
the variance
components the vector kind of that language and describe the information related to local phonetic
information
based on this consideration
but we propose to normalize only seventeen
component it's vector
it is
the representation overlap utterance
well
you mean
so rare in
components
to you the
frame level phonetic information
well alright discriminative features for language identification
the cost of the proposed a method is the only this time
for the representation of the utterance
could be obtained by neural network we assume that all those
so the intended to pass the last
in that program them that's a wine
we use and spectral and the
to supply
representation
and the in proposed a mess of the two we use the rest match
a global calibration pony
to obtain a representation
we evaluate you the proposed method that means that language recognition evaluation
two thousand and seventy set
it's a training data used
clover in this ad
and i dunno three five development data
for a rainy the to seven
and the
the telephone data so that i that line
for the test set it to be used as a close the standard nice to
those
the except that has recently that in section that the study is that okay and
the
this ad
we also program the
a wine one point five and to use against this sense
one of a trust
we used to sixty dimensional all they're pretty bad major
and then you covariance and that the existing as the average of was used for
evaluation metric
for this analysis is you can kind of the rest nets system and that it's
vector systems
the rest analysis to us
so the holy rollers that's
network
they are probably
and that while for the connectivity
the a lot of nist or both
well the i-vectors is to the thing last night to
we use the reliable feature extractor
well the training examples
some examples of our group had between five to ten seconds and the shot utterance
but it is going back to two seconds
in this case we show the results of the baseline and systems
come variation
we also realistic this results with popular by
other is utterance
was anybody can
it's a extractor system are more in fact you on long code utterances
and whatnot shop utterances the rest and
this is done in the better performance
and because of the duration mismatch the model trained with a lot of them is
samples
we form the where on the basis of the data but i'm not problem that
there shall i
the integration of the team here that without the feature compensation method
in this table
the baseline is the olympics vector network trained with the shops examples
the results of mean error rate is the
composition learning
and the two proposed them
mess to whether he's this table
for you the variation
we give a speaker to compare baseline
mean and variance this okay
and the proposed a method
problem of the results
we can say
the channel compensation
by using those
mean and variance
only could improve the performance
well not all utterances
yielding very
according to the best results
i four show the other varieties
compensation by using
me only
this significantly improve the performance
well concluded
in this work
we investigate an improvement of the neural network based the impending techniques
vector for shot about the rest lid task
we compare database that the channel compensation by comparing in various and the need i
think this the last
the proposed to me is the channel compensation only
it is expected to capture high-level or
construct a language information
right our meeting
variance components three because it is for that reason for software that it's
the results show that the proposed method the mock in fact the shock filters right
you task
that's what your attention