Speech Transcript - The Sheffield language recognition system in NIST LRE 2015

so first thank you very much for the odyssey conference for giving us the chance

to present our language recognition system my name's raymond we've from the university of sheffield

and the chinese university of hong kong

so well i was a these have a language recognition system is pretty a

fundamental and signed it

so the of motivation of the paper and at all today will be basically to

go through the keypoints maybe set of the core system and so this is some

more suggestions and well as be calibration

a bit of the background language recognition is about recognising language from a speech segment

so we go through the classical map that all language recognition we can see researches

using acoustic of phonotactic features working on that

and then there are shifted delta cepstral features which takes a longer temporal spend of

the signal which helps you language recognition recently i-vectors all at the end and for

the combination of all of methods proved to be useful in anguish recognitions

of all us we submitted three systems of the combination of three system in a

nice language recognition last year

the first one is a standard i-vector system and we have phonotactic system and the

third one is a frame basis but the nn system after the evaluation we

got a little bit enhancement combining the button and i-vector we'll go through its good

of that like to

so this is just a recap briefly on be a training data and also the

target languages we have the switchboard data used telephone speech training data also some

all multilingual

lre training data from past evaluations

training set of assets

so there are twenty languages in language recognition and therefore into six language clusters and

the task of language recognition is to identify languages within the clusters of language we

shot closely related

on the training data of language recognition comes as a role set of files in

about seven hundred eight hundred hours or to start with the training we run some

voice activity detection and to train our voice activity detector we use the competition that

are from speech all by

training our switchboard model of from tokenizer

run out

a forced alignment into them and then we just treats the silence label as non

speech and the nonsilence labeled as speech we also take some of the posterity train

data from voice of america broadcasts speech to train the of voice activity detector using

that channel

for this data we just take the role of speech nonspeech label

on the amount of the voiced and unvoiced speech in different corpus from the table

we train a to lay at the end and for a vad so this

takes a stand at the end and with

which with all train three dimensional filter bank features of features by saying of fifteen

frames laughs and the right

the outputs of the end and is

two neurons the end and which is voice and the voice put zero probability

we have sequence lyman using a tuesday hmm and forcing a minimum duration of twenty

frames of voiced and unvoiced on top of that we have a heuristic to bridge

the non a non speech gap which are shorter than two seconds

for the results

on the switchboard test data we have a miss and false alarm rate all around

to present

but for the be all the o a data the broke out of the broadcast

they to the error rates much higher so we did an oral inspections that they

and

we believe it's down to the inaccuracy of the reference data so will a first

system and to continue trying out language recognition system

we establish

define a training set in the cost of the system development so these are the

two "'cause" that's we use v one and p three

the v one data is already version of the training data we use it directly

text e of vad results

and then extracts

the whole segment whose duration lies between twenty and forty five seconds and then we

train that specifically for thirty minute all sort thirty second condition so in the developments

from the very beginning divided test and training in three second ten seconds thirty second

duration we're not sure whether this is correct or not very that

four

v three data are then we

actually run different tokenizer all over again on the whole training set of the data

and with that we will be one segmentation just that then we have a shorter

segments for offshore shorter segments for decoding at the speed up the decoding process in

the first round

then we run re-segmentation with differences i don't stressful

and we derive a three

training set of normal evaluation of thirty seconds ten seconds and three seconds

so these are not this thing gives that with a little bit of overlapping

what data partitions of for each of the set then we have

at present of the data for training time percent for development and that we're going

to report the internal pass result in the early bits of the experiment for ten

percent inter class

so this is a system diagram for our or language recognition system on the laughs

you can see the i-vector system and there is a phonotactic system the phonotactic system

generate bottleneck features to fit into

the nn system which is the frame based language recognition system

the i-vector system is i we follow standard county recipe of media and normalization for

the features shifted delta cepstrum mean normalization and also frame based vad to start with

we trained a two thousand forty eight combine ubm and so the variability matrix to

extract six hundred dimension i-vector we tried to language classifiers with all support vector machine

and logistic regression and then to focus of the study here is to see to

compare the use of

different datasets in the training of ubm and also to the for a bit matrix

also language classifier and also the comparison of global and cluster dependence classifiers

but i think global classifies i mean classify which all

classifies all the trendy languages and one go

so we have four configurations here is that with so form condition a to condition

be we increase the amount of data for ubm and total variability matrix training

from be to see

we replace the svm with logistic regression classifier and from c t we further increase

the amount of training data for logistic regression classifier

and the past year on the right shows the

minimum average minimums the average score for different all configurations of set up on the

i-vector system and the result is reported on the internal tests v one data

which has

thirty second duration

on for when we go to a where we look at the to read past

here in the middle then we can see

the comparison between using fewer amount of training data for the ubm and more amounts

then it gives some improvement there

and we also see some a difference

a by having a global classifier and within class the classifiers we did not manage

to try or the combination is listed here just because of the time constraint

but for this set of experiments on

what we conclude is that we tend to use

the full set of role training data and segment that for the training of ubm

and sort of error rate matrix and also within class the classifiers outperform the global

classifiers

and then when our training progresses then we moved to the v three data

we have similar conclusions as i just mentioned and then we tried

to use different amount of the training data forty logistic regression classifier as shown as

the three web boss here

basically the left bar here are used as few amount of training data only one

hundred hours

and we use three hundred hours of data

for the d one we use the roll set of data which a comprises about

eight hundred hours so here that showrooms

a trade-off between using more data and also whether the data are well structure of

our segment it or not and then we ended up with using three hundred hours

of segmented data training the a logistic regression classifier

for the two red bars on the far left and right it is about the

use of svm or use of the all

logistic regression in the language recognition again that shows the

improvement

for using logistic regression classifier

then that comes to our second system lid phonotactic language recognition system

there are two components in the phonotactic system first a phone tokenizer and the second

the language classifier the from tokenizer is based on the standard county setup we have

lda c m and how speaker adaptation

then it is that the n m with six layer and each layer contains around

two thousand euros

we used i don't bigram language model with a very low grammar scale factor of

zero point five we tried to have a high a scale factor of two and

that's and

gives better results in our internal test sets

optionally we try to run even sequence training on the training switchboard data but bear

in mind design english training data so we're not sure that

of discriminative training will give over trying new networks to the results

for the language classifier design svm classifiers

which runs are trained on the tf-idf statistics of the phone n-gram which tried from

bigram l from trigram the reason we back-off to bigram is that we of trained

on the form

position dependent form and we ended up with

roughly five million dimension of the trigram statistics we

where e that maybe sparsity issues

so this is the performance on the internal test sets

with the different setup

as we which the trigram outdated gives better performance in terms of the low means

the average score of this is valid for the thirty seconds later but you messy

in a while that may break very comes to very short duration segments

the purple bass a the results with the discriminatively trained the nn from tokenisers again

than that shown that be of the over trained the nn here are and it

gives higher word error rate

sorry a higher that means the average score i mean

the third system is the frame based the nn system for language recognition

we talk a sixty four dimensional bottleneck features from the switchboard tokenisers

and there are features slicing with the for frames one the left and for frames

on the right

the d n and is a four layer the nn with seven hundred neurons

we have a problem normalizations which

we multiplied it has probability with the inverse of the language prior and the decision

of language recognition system can buy every change the frame based language recognition posterior probability

so this is

hey summary of the frame based language recognition system on different handsets

then to trance we observed against very obviously when the situation is shorter than d

c average score is higher and then the second is generally the

the be the error he is higher than the phonotactic system and i-vector system but

it becomes more robust when it comes to a very short duration

situation

so after the evaluation we have an enhanced system which recall that a button that

i-vector system and is also a basic system

we talked the

bottleneck features from the switchboard and we place the mfcc in i-vector system with the

bottleneck features and build another system for language recognition

a bit of the details

we take the sixty four dimension bottleneck features

there are no vtln and no normalization or shifted delta cepstrum but they are frame

based vad here

so this is a side by side comparison between the i-vector system and the bottleneck

system where the mfcc features can replaced by the bottleneck

we can see roughly of relative improvement from fifteen to twenty five percent by replacing

the bottleneck features

for system calibration and fusion we train target language dependent gaussian back and

and the gaussian

has for age of sixteen components and then these are trained on da training data

of thirty seconds data

then of course system fusion we run logistic regression

that comprises the log-likelihood ratio conversion and that the system combination

the reflection

so we apply that separately on the three system the i-vector system the nn system

and phonotactic system we found that

the

gaussian back and you know why work for the i-vector so we do not use

that in the

final evaluation

and then for the in an informal to technique gives a

significant improvement

and this is the fusion result in our internal it has set so

for thirty second data

i-vector system gave so

the battery so i'm on the three

submissions system

and

it can the n and informal to take they have roughly the same performance

system fusion give some performance improvement actually a noticeable performance in the internal test that's

we have

and the bottleneck system did not give better results but and where we incorporate the

for system than there are the best results we have

when it comes down to three seconds a as i've set the phonotactic system

behaves much worse here

so that maybe because of this pastiche was on t particular setup of our own

current statistics

and

when

we compare the i-vector system and the bottleneck system then we see significant improvement for

the off button x system and the further improvement the impression

then here we show on the results of d formal evaluation

datasets

i-vector system

phonotactic system the nn system performs well

roughly as expected

and then bottleneck system again has

more than ten percent relative improvement on top of the i-vector system

and this system version

gives marginal improvement

on top of the best system here

then finally i'm going to a shown to be about a pair-wise system contribution

to see keyword you've contribution to t component system in our language recognition systems

so now you see clusters of boss hears for each clusters on the very laughed

pass we have a single system

and then what these single system for example this is about an i-vector system

we make a fusion with this system with one of the system and then the

older is that we take the worst system to fuse with

and then we take the second whereas and so on

so the interesting thing here he is that gender at apart from fusion with that

the nn system which is the worst system

fusion pairwise fusion you know case works

maybe you can argue we may be in a different all operating region that the

error region and that

maybe why we cease to work

and then another interesting thing is the of

performance of fusion system basically is in proportion to the performance of the single system

which means that when the fusion of about the system then we get a better

results here

so as a summary we introduce the three language direction recognition component systems submitted to

the or at least two thousand fifty and the description to segmentation data selection plan

and classifier training we have and then harassment button i i-vector system

and is demonstrate performance improvement for the future work we want to were a bit

on the data selection and augmentation as a team thus

and also we are interested in the multilingual new network of the adaptation of that

maybe some unsupervised training on that as well and to improve the bottleneck features also

some variability compensation to deal with the huge try no and development dataset mismatch in

the evaluation dataset

and a suggestions or maybe collaborations all welcome a thank very much more attention

here type questions

thanks for the when you're talking about the language clusters

the clusters the according to some linguists yes

for our small experiment

the linguistic clusters and the based on the

a to a lot of the same with silver last of the data

but they which is

and use these clusters that are on features

we gain that it would be the computer the

when compared to the results of plus so that are made by linguists

tried plus the language for trial

yes i think that's a scientific question an interesting question we follow language classes basically

by a narrow definition all exciting following what the nice a language recognition evaluations told

us to and you're absolutely right up there are some cases when the training where

you

just become a distinction between

even dialects or other unwanted of factors which does not directly related to language classes

at all so yes definitely this is something we want to look at them particularly

for some dialects were interested in for example chinese data are interested in it everyone

to do more

and the questions

i one quick question so in an eer at most teams we did a scroll

most works typically would sixty percent for train maybe going to seventy percent used a

little bit more you want to eighty percent so my question is once you did

your development when you actually submitted

the final results did you do of for retrained with all the data or did

you just stick with the original eighty percent range system that you

we trained with the original system with eighty percent which we now doubts whether this

should be the case and then we also have almost have a little bit by

even if in the very early stage we

divided data into three second ten second and three seconds or and that again

of reduce the amount of training size and that's that we should note decision we

tried to use h present and seventy we

one more suggestions on

how the data i think with of a bit on the all data segmentation and

selection time

here any other questions

and b let's think speaker again

The Sheffield language recognition system in NIST LRE 2015

Speaker & Language Recognition Systems

Raymond W. M. Ng, Mauro Nicolao, Oscar Saz, Madina Hasan, Bhusan Chettri, Mortaza Doulaty, Tan Lee, Thomas Hain