so first thank you very much for the odyssey conference for giving us the chance
to present our language recognition system my name's raymond we've from the university of sheffield
and the chinese university of hong kong
so well i was a these have a language recognition system is pretty a
fundamental and signed it
so the of motivation of the paper and at all today will be basically to
go through the keypoints maybe set of the core system and so this is some
more suggestions and well as be calibration
a bit of the background language recognition is about recognising language from a speech segment
so we go through the classical map that all language recognition we can see researches
using acoustic of phonotactic features working on that
and then there are shifted delta cepstral features which takes a longer temporal spend of
the signal which helps you language recognition recently i-vectors all at the end and for
the combination of all of methods proved to be useful in anguish recognitions
of all us we submitted three systems of the combination of three system in a
nice language recognition last year
the first one is a standard i-vector system and we have phonotactic system and the
third one is a frame basis but the nn system after the evaluation we
got a little bit enhancement combining the button and i-vector we'll go through its good
of that like to
so this is just a recap briefly on be a training data and also the
target languages we have the switchboard data used telephone speech training data also some
all multilingual
lre training data from past evaluations
training set of assets
so there are twenty languages in language recognition and therefore into six language clusters and
the task of language recognition is to identify languages within the clusters of language we
shot closely related
on the training data of language recognition comes as a role set of files in
about seven hundred eight hundred hours or to start with the training we run some
voice activity detection and to train our voice activity detector we use the competition that
are from speech all by
training our switchboard model of from tokenizer
run out
a forced alignment into them and then we just treats the silence label as non
speech and the nonsilence labeled as speech we also take some of the posterity train
data from voice of america broadcasts speech to train the of voice activity detector using
that channel
for this data we just take the role of speech nonspeech label
on the amount of the voiced and unvoiced speech in different corpus from the table
we train a to lay at the end and for a vad so this
takes a stand at the end and with
which with all train three dimensional filter bank features of features by saying of fifteen
frames laughs and the right
the outputs of the end and is
two neurons the end and which is voice and the voice put zero probability
we have sequence lyman using a tuesday hmm and forcing a minimum duration of twenty
frames of voiced and unvoiced on top of that we have a heuristic to bridge
the non a non speech gap which are shorter than two seconds
for the results
on the switchboard test data we have a miss and false alarm rate all around
to present
but for the be all the o a data the broke out of the broadcast
they to the error rates much higher so we did an oral inspections that they
and
we believe it's down to the inaccuracy of the reference data so will a first
system and to continue trying out language recognition system
on
we establish
define a training set in the cost of the system development so these are the
two "'cause" that's we use v one and p three
the v one data is already version of the training data we use it directly
text e of vad results
and then extracts
the whole segment whose duration lies between twenty and forty five seconds and then we
train that specifically for thirty minute all sort thirty second condition so in the developments
we
from the very beginning divided test and training in three second ten seconds thirty second
duration we're not sure whether this is correct or not very that
four
v three data are then we
actually run different tokenizer all over again on the whole training set of the data
and with that we will be one segmentation just that then we have a shorter
segments for offshore shorter segments for decoding at the speed up the decoding process in
the first round
then we run re-segmentation with differences i don't stressful
and we derive a three
training set of normal evaluation of thirty seconds ten seconds and three seconds
so these are not this thing gives that with a little bit of overlapping
what data partitions of for each of the set then we have
at present of the data for training time percent for development and that we're going
to report the internal pass result in the early bits of the experiment for ten
percent inter class
so this is a system diagram for our or language recognition system on the laughs
you can see the i-vector system and there is a phonotactic system the phonotactic system
generate bottleneck features to fit into
the nn system which is the frame based language recognition system
the i-vector system is i we follow standard county recipe of media and normalization for
the features shifted delta cepstrum mean normalization and also frame based vad to start with
we trained a two thousand forty eight combine ubm and so the variability matrix to
extract six hundred dimension i-vector we tried to language classifiers with all support vector machine
and logistic regression and then to focus of the study here is to see to
compare the use of
different datasets in the training of ubm and also to the for a bit matrix
also language classifier and also the comparison of global and cluster dependence classifiers
but i think global classifies i mean classify which all
classifies all the trendy languages and one go
so we have four configurations here is that with so form condition a to condition
be we increase the amount of data for ubm and total variability matrix training
from be to see
we replace the svm with logistic regression classifier and from c t we further increase
the amount of training data for logistic regression classifier
and the past year on the right shows the
minimum average minimums the average score for different all configurations of set up on the
i-vector system and the result is reported on the internal tests v one data
which has
thirty second duration
on for when we go to a where we look at the to read past
here in the middle then we can see
the comparison between using fewer amount of training data for the ubm and more amounts
then it gives some improvement there
and we also see some a difference
a by having a global classifier and within class the classifiers we did not manage
to try or the combination is listed here just because of the time constraint
but for this set of experiments on
what we conclude is that we tend to use
the full set of role training data and segment that for the training of ubm
and sort of error rate matrix and also within class the classifiers outperform the global
classifiers
and then when our training progresses then we moved to the v three data
we have similar conclusions as i just mentioned and then we tried
to use different amount of the training data forty logistic regression classifier as shown as
the three web boss here
basically the left bar here are used as few amount of training data only one
hundred hours
and we use three hundred hours of data
for the d one we use the roll set of data which a comprises about
eight hundred hours so here that showrooms
a trade-off between using more data and also whether the data are well structure of
our segment it or not and then we ended up with using three hundred hours
of segmented data training the a logistic regression classifier
for the two red bars on the far left and right it is about the
use of svm or use of the all
logistic regression in the language recognition again that shows the
improvement
for using logistic regression classifier
then that comes to our second system lid phonotactic language recognition system
there are two components in the phonotactic system first a phone tokenizer and the second
the language classifier the from tokenizer is based on the standard county setup we have
lda c m and how speaker adaptation
then it is that the n m with six layer and each layer contains around
two thousand euros
we used i don't bigram language model with a very low grammar scale factor of
zero point five we tried to have a high a scale factor of two and
it
that's and
gives better results in our internal test sets
optionally we try to run even sequence training on the training switchboard data but bear
in mind design english training data so we're not sure that
of discriminative training will give over trying new networks to the results
for the language classifier design svm classifiers
which runs are trained on the tf-idf statistics of the phone n-gram which tried from
bigram l from trigram the reason we back-off to bigram is that we of trained
on the form
position dependent form and we ended up with
roughly five million dimension of the trigram statistics we
where e that maybe sparsity issues
so this is the performance on the internal test sets
with the different setup
as we which the trigram outdated gives better performance in terms of the low means
the average score of this is valid for the thirty seconds later but you messy
in a while that may break very comes to very short duration segments
the purple bass a the results with the discriminatively trained the nn from tokenisers again
than that shown that be of the over trained the nn here are and it
gives higher word error rate
sorry a higher that means the average score i mean
the third system is the frame based the nn system for language recognition
we talk a sixty four dimensional bottleneck features from the switchboard tokenisers
and there are features slicing with the for frames one the left and for frames
on the right
the d n and is a four layer the nn with seven hundred neurons
we have a problem normalizations which
we multiplied it has probability with the inverse of the language prior and the decision
of language recognition system can buy every change the frame based language recognition posterior probability
so this is
hey summary of the frame based language recognition system on different handsets
then to trance we observed against very obviously when the situation is shorter than d
c average score is higher and then the second is generally the
the be the error he is higher than the phonotactic system and i-vector system but
it becomes more robust when it comes to a very short duration
situation
so after the evaluation we have an enhanced system which recall that a button that
i-vector system and is also a basic system
we talked the
bottleneck features from the switchboard and we place the mfcc in i-vector system with the
bottleneck features and build another system for language recognition
a bit of the details
we take the sixty four dimension bottleneck features
there are no vtln and no normalization or shifted delta cepstrum but they are frame
based vad here
so this is a side by side comparison between the i-vector system and the bottleneck
system where the mfcc features can replaced by the bottleneck
we can see roughly of relative improvement from fifteen to twenty five percent by replacing
the bottleneck features
for system calibration and fusion we train target language dependent gaussian back and
and the gaussian
has for age of sixteen components and then these are trained on da training data
of thirty seconds data
then of course system fusion we run logistic regression
that comprises the log-likelihood ratio conversion and that the system combination
the reflection
so we apply that separately on the three system the i-vector system the nn system
and phonotactic system we found that
the
gaussian back and you know why work for the i-vector so we do not use
that in the
final evaluation
and then for the in an informal to technique gives a
significant improvement
and this is the fusion result in our internal it has set so
for thirty second data
i-vector system gave so
the battery so i'm on the three
submissions system
and
it can the n and informal to take they have roughly the same performance
system fusion give some performance improvement actually a noticeable performance in the internal test that's
we have
and the bottleneck system did not give better results but and where we incorporate the
for system than there are the best results we have
when it comes down to three seconds a as i've set the phonotactic system
behaves much worse here
so that maybe because of this pastiche was on t particular setup of our own
current statistics
and
when
we compare the i-vector system and the bottleneck system then we see significant improvement for
the off button x system and the further improvement the impression
then here we show on the results of d formal evaluation
datasets
i-vector system
phonotactic system the nn system performs well
roughly as expected
and then bottleneck system again has
more than ten percent relative improvement on top of the i-vector system
and this system version
gives marginal improvement
on top of the best system here
then finally i'm going to a shown to be about a pair-wise system contribution
to see keyword you've contribution to t component system in our language recognition systems
so now you see clusters of boss hears for each clusters on the very laughed
pass we have a single system
and then what these single system for example this is about an i-vector system
we make a fusion with this system with one of the system and then the
older is that we take the worst system to fuse with
and then we take the second whereas and so on
so the interesting thing here he is that gender at apart from fusion with that
the nn system which is the worst system
fusion pairwise fusion you know case works
maybe you can argue we may be in a different all operating region that the
error region and that
maybe why we cease to work
and then another interesting thing is the of
performance of fusion system basically is in proportion to the performance of the single system
which means that when the fusion of about the system then we get a better
results here
so as a summary we introduce the three language direction recognition component systems submitted to
the or at least two thousand fifty and the description to segmentation data selection plan
and classifier training we have and then harassment button i i-vector system
and is demonstrate performance improvement for the future work we want to were a bit
on the data selection and augmentation as a team thus
and also we are interested in the multilingual new network of the adaptation of that
maybe some unsupervised training on that as well and to improve the bottleneck features also
some variability compensation to deal with the huge try no and development dataset mismatch in
the evaluation dataset
and a suggestions or maybe collaborations all welcome a thank very much more attention
here type questions
thanks for the when you're talking about the language clusters
the clusters the according to some linguists yes
for our small experiment
the linguistic clusters and the based on the
a to a lot of the same with silver last of the data
but they which is
and use these clusters that are on features
we gain that it would be the computer the
when compared to the results of plus so that are made by linguists
tried plus the language for trial
yes i think that's a scientific question an interesting question we follow language classes basically
by a narrow definition all exciting following what the nice a language recognition evaluations told
us to and you're absolutely right up there are some cases when the training where
you
just become a distinction between
even dialects or other unwanted of factors which does not directly related to language classes
at all so yes definitely this is something we want to look at them particularly
for some dialects were interested in for example chinese data are interested in it everyone
to do more
and the questions
i one quick question so in an eer at most teams we did a scroll
most works typically would sixty percent for train maybe going to seventy percent used a
little bit more you want to eighty percent so my question is once you did
your development when you actually submitted
the final results did you do of for retrained with all the data or did
you just stick with the original eighty percent range system that you
we trained with the original system with eighty percent which we now doubts whether this
should be the case and then we also have almost have a little bit by
even if in the very early stage we
divided data into three second ten second and three seconds or and that again
of reduce the amount of training size and that's that we should note decision we
tried to use h present and seventy we
one more suggestions on
how the data i think with of a bit on the all data segmentation and
selection time
here any other questions
and b let's think speaker again