Speech Transcript - Summary of the 2015 NIST Language Recognition i-Vector Machine Learning Challenge

she thank you also

so the language recognition i-vector challenge had three main goals

first to including attracts people from outside a regular community

and to make

this

work that we do more accessible to that

and the idea behind that was to people to explore new approaches and methods

from machine learning and language recognition with the overall goal of improving performance and language

recognition

the task was open set language identification so given audio segments a which are and

languages the audio segments spoken in or whether was and

unknown language

the data used was from previous and a cell l are used as well as

from the i r pa babble program

and the data was selected in such a manner such that multiple sources were used

for each language in order to reduce

the source and language fact

and we're also select in order to have highly confusable languages included in the

dataset

accuracy the size of the data there were fifty languages and train and sixty five

and dev and test

about three hundred per language gender segments per language in the training and about a

hundred

and the devon test

and we see the total number of segments all the way the right hand column

fifteen hundred for training so about sixty four hundred for dev and about sixty five

hundred for test

and the training set did not include data that was from out of set

the development set included and unlabeled out of set

and the test set was divided into progress and evaluation subsets so we'll

cover and just a moment

people were able to upload their system outputs and receive some feedback on how that

one and that was done using a progress set

and then at the end of the evaluation period

a feedback was given on an evaluation set in it was a partition so there's

not overlap

here we see data sources for each language

on the

right hand side i sure noisy that is to see

you can see different corpora labels i think that a high-level we can say

blue or conversational telephone speech green include

broadcast narrowband speech and yellow is a combination of the two

i think

one thing to say is that if you look across

the training data which is the i guess you're leftmost column

the dev data which is in the middle and the test data to rest of

the right

the distribution across sources is very similar per language there are a few exceptions

and as we mentioned there was no out of set

due to the training

and here we see us speech duration

both in trained up and test

training is this page that is green and test is blue

and we see it again a similar distribution a model trained of interest

this was low more

the performance metric was error rates split into out of seven languages and within seven

languages

where the prior probability of a lot of seven languages point two three

participation was

wonderful a more than what will typically see and a lre

was from international sites six continents and thirty one countries

about eighty participants to model the data know little a fifty five per se but

the results

from

forty four unique organisations

during the evaluation period a little over seventy i'm sorry thirty seven hundred dollars emissions

were submitted

and that number continues to grow

after which

and mentioned that we

i had more participation and the i-vector challenge that we need to be with your

salary and we can see some other comparisons

i guess i've not had said one of the main differences between the i-vector challenge

and a traditional areas in the data that we distribute

and the traditional battery we send a audio segments as input to systems and i-vector

challenge we send i-vectors instead

the task was different never to challenge as a open set identification instill detection

and i-vector challenge the cost was based on a kind of total error rates per

language and in the traditional laureates on miss and false alarm rates

a larger number of target languages a different

distribution of speech duration and mention that was log normal and i-vector challenge in the

traditional array it's three ten and thirty second bins traditionally

the challenge lasted much longer than the i-vector challenge

and it

but also the i-vector challenge results were

feedback where it was given during the challenge period which is also about something we

do in traditional evaluations

and last there was a an evaluation platform that was online

and this was something that we

focused on for the i-vector challenge

in particular the goal was to facilitate

the evaluation process with limited human involvement

all evaluation activities were conducted via this platform including receiving the data

uploading submissions and been able to see how things went

and now looking at some results on the y-axis we see

cost

and on the x-axis a time

the first

first diff i think is around may seventeenth the choice certainly first

and the second floor

large dip is on may twenty first so

of about half roughly half of the progress made during the evaluation to place during

the first

two or three weeks or so

and then during the remainder of four months the rest of the progress was made

here we also see cost on the y-axis one x-axis we see

participant id so these are really discrete it's sorted by best cost

obtained on the evaluation

a subset

and so we see most of the sites be the be the baseline

which is trained and a few sites be an oracle system so i guess speaking

of speaking to both of these the baseline i believe is a simple

a simple

system that used cosine distance and oracle system used p lda

so it's called oracle because there were unlabeled data that were distributed to the participants

butts the oracle system used those labels

and here we see the number of submissions per participant

in general

a participants you did well estimated more systems but there were

a few exceptions i think now is a reasonable time dimension that

participant id and

site id the distinction between participants and site so

participants as someone who signed up and maybe there were multiple participants personally so i

use are not necessarily unrelated for example section three may have also been by thirty

just

and you receive results by a target language we have every year on the y-axis

on x-axis we see language the lowest error or was received on

parameters and highest on hindi

what was surprising was english also had a high error rate

second from can be actually of second for the right

and the blue was the out of seven languages somewhere in the middle the pack

and here we see results by speech duration i guess no surprise that is you

get more audio

you tend to do better

one thing that

is also may be interesting is there seems to be some diminishing marginal returns

so if for example you had three seconds and you could get ten you do

maybe

point to better but if you want from

a ten to twenty

the difference is not so great

just as an example

so some lessons learned

wonderful participation were all very grateful for you in the audience to fit it is

this was those we couldn't dryness today

number of systems be the baseline that surprisingly six you're actually better than the oracle

system sure hoping to learn more about

a half of the improvement made as early on i which may just to reconsider

the timeline

surprisingly top systems do not all do so well on english

performance of out of seven languages also was not is for this we might have

expected

we did not receive many system descriptions so it's unclear how many of the participants

attended have its although

later in the session will your from

tops is thus able to capture stated in the a team that created top system

that did develop level techniques and we'll see more that

and the web platform ends up so please feel free to visit and participant the

challenge now

and see how see how you're doing

and a quick plug for upcoming activities there's a story sixteen and workshop

where the it speaker detection on telephone speech recorded over a variety of handsets

similar to lre fifteen those are from layer there's now a fixed training condition as

well as an open condition

can see some other there so that the evaluation and there's also a twenty sixteen

lre analysis workshop and all of this will be co-located with salty sixteen and

send

so it looks like we have time for

for questions

Summary of the 2015 NIST Language Recognition i-Vector Machine Learning Challenge

NIST 2015 Language Recognition i-Vector Machine Learning Challenge

Audrey Tong, Craig Greenberg, Alvin Martin, Desire Banse, John Howard, Hui Zhao, George Doddington, Daniel Garcia-Romero, Alan McCree, Douglas Reynolds, Elliot Singer, Jaime Hernandez-Cordero, Lisa Mason