Speech Transcript - Description and analysis of the Brno276 system for LRE2011

you

okay very much so the other not to seventy six system was a collaborative work

between the brno university of technology

medial and a little

so let's start already introduced we had twenty four languages to deal with

we had a new metric

the list of the languages somehow

screwed the

and so we have the new metric the two seventy six language but that that's

where the name comes from

and we had to select a twenty four was pairs in terms of mindcf and

then compute the average

actually see

in order to be able to deal with those languages we had to call we

had to collect or data so this is basically the list of data that we

used in the past

evaluations there are some callfriend fisher

oh mixer data from the sre evaluations

previous lre evaluations ogi data of foreign accented english

some a speech data for the you've really used european languages

i was and switchboard and some broadcasts though

data from the voice of america and ready for your

and then some

iraqi arabic conversational speech

arabic broadcast

speech

oh as doc showed

there were some languages for which we didn't have enough data

so what we did is that we add on this additional radio data from public

sources so we use the radio free europe radio free asia

some czech broadcast

america

and there's a there's a list of languages first we covered check farsi lotion and

job be

and again i say arabic knackered be

mandarin in training and i and i guess that were couple

so what we did is that we did a phone call detection so we

detecting the parts of

broadcasts where

there were some

and

telephone

conversations would be greater

and for each language we we're and

automatic speaker like labeling

so that we didn't want that we do

yeah

train and test sets

but the speakers to overlap

oh this is this is the scheme for a development set so we use the

lre eleven development data

so we make two sets actually one D one was the trusty data which was

actually based on nist so thirty seconds so cost

definition

we also an automatic speaker labeling

and we split into non-overlapping

training data

test parts

and then we took the entire conversations and in the thirty seconds excerpts

results

so was presented thirty second segments

and all splits from one conversation side to be trained and test set

again there was some

speaker automatic speaker labeling

of course we had more data but it was less reliable that's all seen our

contrastive system

this doing this

helped a little bit

having the all the P V their position

to give a little bit of statistics on or not

dataset so we get the train set which are sixty six sre six two thousand

segments

and it was based of all kinds of sources

and yet the test set

which was thirty eight thousand segments

was based basically on a previous lre evaluations

and then be and test sets with service

on the other you

evaluations comprising one

so a little overview of our systems we have we have a summation of three

systems one primary and you can trust it

so the primary system consisted of one acoustic system which was based on the i-vectors

you be the descriptions later

and then we had three phonotactic subsystems so that would

yeah diver system so we had

a binary decision tree systems based on the english tokenizer then we had a pc

reduction

systems based on the russian systems

and then we had some multinomial subspace

i-vector space but

hungarian

tokenizer

that the first thing just a system or same as the primary what we did

is that we excluded the P two

that means that the entire conversations

we'll see the results

later

and then contrasted to system was just of

fusion of two best system with the acoustic and the english

problem of the image was that the at the development that the case

very good results but it's see as kind of problem i think

so there is a little diagram

of our system

in the in the first

at a very left you know we have the front end so we have the

acoustic i-vector the phonotactic i-vector and the pca

oh which basically convert the input of some form

into a fixed factors either i-vectors for the for the acoustic i-vector extractor or we

also column i-vectors for the phonotactic i-vector extractor pca

and after which we had to do some scoring

and then we use the det binary decision tree model which show was based basically

going on a log likelihood evaluation

of the of the n-gram counts

itself so we already got discourse

which could then go do precalibration

both the scoring and pre-calibration are based on logistic regression

i'm unit-fusion also based on logistic regression out of which we get twenty four scores

likelihoods and then we do pair-wise log likelihood ratio for

each of the errors

it is just to show the how is the data that the data described in

the previous section so that the train database was used for the

the for the front-end training and for the for the scoring

classifier training

at the dev and test databases where they used for the for the

back in the pre-calibration

fusion

so for acoustic system we use the hungarian phoneme recognizer based vad

oh basically to kill the

silence then we use the vtln

i'll features dithering

cepstral mean

and variance normalisation with rasta processing basically

what similar to that

right later previously

and the modeling was based on full covariance ubm we two thousand forty eight components

and the i-vector size was example

for the phonotactic systems

we used we used a diversity of techniques to

sources are tokenization

so that pca for each feature extraction

was based on is something you

hungarian tokenizer

so what we do is that we would do accounts with the square root of

the counts

ppca on top of that we used the dimensions to six hundred

and we basically used in the same way as the acoustic i-vector

and we had a multinomial subspace modeling of the trigram counts was based on the

regression tokenizer

so this is something a slightly newer and something that marshall

one

features

is the basically modeling the n-gram counts

the subspace

of the simplex

the output of

of such a approach is also

i-vector like

a feature

which we then again process the same way as the i-vectors

and then we had the binary decision tree which is basically a novel technique where

the decision trees are used to cluster the n-gram counts

and a claimant like this is used

if the score

so you know the scoring for the for the acoustic i-vector and the two phonotactic

i-vector systems was the

the input was usually the i-vector gonna six hundred dimensional or one thousand dimensional case

pca

we perform length normalization

ultimately performed within class covariance normalization

and after that is that as the

S R

classify we used to regularize multiclass logistic regression

with cross entropy objective function

the regularizer was the L two regularizer but the penalty was chosen without cross validation

and it was trained on the train database

the output

was

scores

what we did that with the

the each set up to twenty four scores is that we do precalibration

of E system so that was a full affine transform

and we use the regularized logistic regression

which was trained on the test

and the database is

and in the end we use the

we used the four systems

two

with the with the constrain affine transform so instead of assigning each of the twenty

four scores of each of the systemsindividual scale constant

we had one scale constants for one system

and we had a we have a vector offsets

and this logistic regression was also bayesian logistic regression stumps regularized

was trained on the test

jeff databases

as i said that the decisions where done using the log-likelihood ratios that came out

these are

the decisions for the for the

two seventy six course where

converted from the twenty four scores

oh as a as a log likelihood ratios

among all those pairs

for

this is just a

the subtraction of course

we give a score

it is

and decisions the models

where

all set to threshold of zero

as just a little common that these decisions are invariant to scaling of the of

the log likelihoods

and

you relations by the heart language pairs

does not sell

just the calibration

so the analysis that we used to

we so we

we fixed that the

denotes the use when designing the system is that

oh we fix the twenty four worst pairs on our thirty seconds evaluations

and we compared

of three different numbers of the actual dcf

minimum dcf

start dcf which was based

on the

on that may cause recipe and mentioned monday

it's based on the

likelihood pre-calibration

so we will compare the development

and the evaluation sets

we will try if the comparison of four eight system so we will have the

individual systems on getting phonotactic direction of the technique phonotactic

eucharistic i-vectors and then we will present for fusions

so one of the primary

the contrastive the second contrastive

and we also have a three system fusion which excluded the

the english phonotactic system which somehow

he misbehaved

this will see

so this is that this is the result of three seconds that we fixed the

person thirty seconds

as we see this is the this is the

this is the miss behavior of the system so the last parser with the that

set and the right or with

the evaluation set and we see that the trend

going like this in the in the dataset but

the

the english phonotactic system

speech

oh of that vector

yeah

this is for the for the ten seconds

so this is this contrast if one system is the system where we excluded the

where we excluded that the dft to data which are

comprise the entire segment so we see that there is a slight hit

compared to

the primary system so these two systems to remind you are

very same except that

in the in the calibration and scoring

there are there are some data left out

oh a again the english

the english is very behaved here

oh we see that though

that the difference between the balloon the and the right one

well which is the difference between the minimum and actual is actually

we see that

the miscalibration was

no i mean

it was not a tragedy but

we didn't do very well the calibration which is even more simple on the thirty

second stuff so if we see that

that the miscalibration is much

much more reasonable

especially fusions

here on the on the contrastive one versus the primary

oh we see that the that excluding the data

really are

that's one thing

on the evaluation systems excluded so the three systems is equivalent to the primary by

excluding the english system

we see that the on the development set didn't do much

in fact the system is slightly worse excluding the english

systems because the english perform very well

on the

the development set

but is putting it

really hot on the evaluation

after the evaluation

so again just to summarise the observation is that it was

the big deterioration between the mindcf for development and

we differ

then there was no calibration disasters but on the thirty seconds as i pointed out

well we could have done better

the binary tree system was kind of screwed and

what we found out later is that if we apply

similar dimensionality-reduction plot scoring techniques

even to the english tokens the system where good again so

so it was it was

due to the

the plane landed evaluation

and acoustic outperforms phonotactic almost everywhere that there were a couple of systems a couple

of language pairs where the where the acoustic

where the phonotactic was better

if you have ever so we did it is that we didn't analysis but for

novice versus mit system since the mighty was the best

a there was a weak correlation between sites and difficulty of paris

a domain mindcf

was were very similar

of the mindcf five

for the worst twenty four pairs for slightly worse for us than for mit

an actual dcf oh

we had a big calibration

that's an interesting how there's an interesting plot here which compares us some of the

selected arabic dialects versus like languages where O

somehow we knew that i might get more data are for the arabic dialects

so we see that we do very poor

and some of the some of the pairs

okay arabic versus push to

et cetera because we mostly due to the lack of data while i'm on the

on the

the slavic languages

we had we do better

and some

selected pairs

so this is just two are just to show that

the amount of data really

there's

this is just a correlation between some of the best of

unlike the end and be useful is the

but axes and use them mit excess

we see that if we did the same thing

all the points we would be aligned

as the

the ability but the we see that some errors are really

of the

we did very differently

some of the past

and this is just to show all the worst

the worst the mindcf

and versus actual mindcf for the for the mit and be used

but not system

we these are average

at which point so we see that on average my better

so mit

and my these points are more know on a on a single line all the

systems are more scattered around here so this shows the again

the

calibration hit

so that concludes that we built several systems but we only selected for

for the primary for the primary fusion

get the acoustic outperforms the phonotactic

for the phonotactic we try to from the different backgrounds and we saw that the

dimension reduction really else

we have

the big hit

for the english phonotactic systems where there was a

i forgot to delete the we did not know why

we already is the

scoring

and probably we could use special detectors for select paris

that is that is

yes

yeah

yes

sorry

well

the unique and the shifted

yeah we use that

right

so we use the six mfccs plus C zero yeah shifted again

okay

sorry

oh yeah

for the which so we use

real the regularisation in our scoring and then or pre-calibration

so a little L two regularization

okay

Description and analysis of the Brno276 system for LRE2011

SESSION 07: Language Recognition Evaluation

Ondrej Glembek