oh

yeah

C nine i can't wait to present

oh

the first is

instead

and the for this call

i was

you

all right

application speaker

maybe may not be as well this work is a well known

you may not know whether you want to pay attention to this thing so let

me just summarise it

oh gee by

proof speaker

yeah

one of the interested in that have to pay attention

okay so let me

in this

you're

just

duration

lee

speech processing

yeah

five channel

really

some knowledge about this

discounts characteristics present in the acoustic signal could be helpful for improving performance of such

recognition system in our case you interested in speaker recognition system

and the svm already before that using such information speech

we also very informational side information in the fusion calibration and how well

and

most of the recent approaches would always

the L one recent words like to put people do for nist evaluations that usually

below

oh independent detectors for various

i

information various detectors various

estimators like estimators of a signal to noise ratio reverberation a detectors of language and

so on

and what we propose in this work is to detect a

this acoustic condition based on direct everybody else i-vectors nowadays be yeah time to show

that you can just the i-vectors

detect all kinds of other different

acoustic condition information or nuisance condition or call it is like from the i-vector and

then use this information in quite simple way

calibration

fusion speaker

this

just some

maybe the

most important previous work on making use of

i is the condition detection that you have

in the past you may still remember feature mapping where features are compensated based on

detecting acoustic condition or

more specifically channel the of the signal and then

this work in that for the channels that they just based on

it's channel specific gaussian mixture models but then we thought that the

don't need to detect a channel condition or acoustic conditions anymore because you got this

wonderful is the joint factor analysis and i-vectors and we thought that you don't have

to really explicitly detect the condition that the channel compensation scheme will account for this

variability what is intersession variability directly but again we saw that using some side information

in a calibration or fusion was actually helping even so that you're compensating using

space methods

oh

again the side information that we have been using the side already mentioned recently where

extracted by language identification systems also signal-to-noise the racial estimators and we collect all kinds

of information about the signal like that it will try to make use of it

in to improve speaker identification system and there are the request in several different ways

of using such information so they probably because this thing was

i just a bill

evolution condition specific fusion or calibration so

okay you were trained different calibration for a specific condition like specific duration or

or english only trials

spoken by different languages

or more than anything it was something that possible cm actually started with that and

i think the nickel remark one it is focal toolkit was by linear score can

be defined by linear score side information combination where a that was bilinear form the

new interaction between the scores

and the side information itself

all speech so i mean

mention when using

where

collecting

information about try itself rather than

the in digital recording integral

i

segmental speech so that was

this side information will be is this child actually

could be different recordings in the trial come from different languages or other people's of

these short duration of all these long duration something that we have tried

in two thousand and politicians was finally to get side information from individual segments and

combine these side information from individual segments in certain way

to improve again calibration fusion this is something that people i'm using but we would

be a four in the morning this

in this talk

and let me maybe

splendid more closely

oh

so what is our approach them to do

this acoustic conditions

as i said we are going to use

i-vectors that's as an input to classifier to detect predefined set of various audio characteristics

in our is that when you just simple linear gaussian classifier

which is a similar thing that people you're using for i-vector based language identification

and

the way we are going to represent the audio characteristics of the signal would be

just vector of posterior probabilities of these individual

and

oh yeah like to show that we can use this vector of all ones are

all these this us and side information for

a fusion and calibration of speaker a system and get quite substantial

where

improvement in performance

and in this work we actually using exactly the same i-vectors for both

characterization of the audio segment and speaker recognition and

a justification for this thing is

reasoning that we have is that

or maybe nuisance characteristics including an i-vector itself affect speaker id performance so if we

can the take those characteristics from the i-vector itself those should be those most important

for

for

improving speaker recognition performance for

compensating for the effects are still there in i-vectors

or

oh before i get into more details and on what exactly do than actually the

results

oh let me introduce the a development and relations that used in this work we

should give you some idea what kind of partly this and kind of condition you

actually

and

so the data but we have used this prism evaluation set which is something that

you have presented

i during the last nist

workshop

and

decided that you have

collected and pretty well collected by the database was something that we therefore

best project and we see this comparison of the design for actress

many dataset so would be data that yeah are comes from feature sre you evolution

take a speech for you would say so these are basically that everybody uses for

for training system for nist evolution but we try to bill evaluation set that accounts

for different kinds of our abilities

so that was a huge investment in normalizing all the old in a method a

information of all the files and we try to include as many trials and read

as many trials as we could

we try to create trials for a specific types of variability so we do lots

of evaluation conditions for the specific types of variability

like different styles in general for to the different vocal for language portability usually the

conditions that the oldest the specific type of a really db we didn't try to

really makes different types

where

the different

and so on

right now results and see what is the different types of variability courses

what degradation

i

it's and i never explicitly always tried to break more trials compared to what has

been defined for installations

oh we also tried to introduce new types of variability in this they are specifically

was

noise variability reverberation so we artificially added actually non sound and reverberation the a i

don't

a few more this on the next slide

they should be also duration for condition

for each mixing

at this prison set consists of two parts that elements to be better at the

moment we have a one thousand speakers around thirty K thirty thousand audio files

for more than seven

seventy million five so that was so this is actually relations that

and then you have a tree

sixty thousand speakers

comes from

a hundred thousand

sessions

oh

deciding this task just to give you some idea wasn't really just as easy as

taking saying of the we used features switchboard and sre four for training the rest

for four testing either really and attention to reduce to the data from different

different

sets to use the models for training and testing so for example to get some

language portability we have used for evolution data from five

the same are trained for different

yeah

i

oh yeah

for all i

we try to use them for

eventually for testing but at the same time we wanted to cover some of the

channels that are some of the microphones and i two thousand they don't they also

training also so

they be related to

pay attention

splitting the database is very like this

last number of trials

i

you see

straight patients

i don't i don't i just try to quickly summarise the bigger for designing the

noisy and reverberant data what exactly do what's really can be some design so that

we define the way how to for all that the noise and what

we try to use open source tools and principles the other noises that added to

the data and county somerset is that this other people are interested in adding new

a noisy is it should be straightforward to just for the rest is that you

have designed a new types of noises new also reverberations

so i mean

just the in the blue box it pretty much summarizes the additive noise but also

use this file for

for

just adding the be

noise to the data as a specific S

the signal to noise ratio and you have used different noise is the kind of

course there are only is not use different kind of noises for

for training data for enrollment trials for enrollment segments for test segments of try to

make sure that you never train or test that not even the same noise are

not even noise taken from the same

final or not even that exactly the same time

so if i say that was a cocktail party

i noise to be noise from restaurant noise from our for so different kind of

pop noises and make sure that really makes this

very similar

training

still

and you have added the noise to the data at different snrs specifically

twenty fifty eight

the noise was actually added to clean data for these data are wrong

thousand and four

my

the data should be

right before

similarly defined is a reverberant subset of the data a which again use this

we're to which is open source

for simulating a rectangle or impulse responses from a rectangular room set and then added

reverberation at different

reverberation times to date and again pay attention to at the same time reverberation to

training and test data

destroying

i

inputs

okay so we get the time how to the sounds like not be

and you more details on be characterization six

so they system itself

is based as i said i-vector the i-vector

is pretty much the standard i-vector extractor that everybody using nowadays a ubm based on

gaussian train and that's

my

you mean and variance

a extract actually si substrate exactly the same as for speaker identification string only one

speech frames are trained on

silence

but it's quite possible that for detecting

different

it is

you just features like may be applied

normalization

so

so

i

or the U six hundred dimensional i-vectors are extracted from the standard variability space that

is assumed to

are expected to contain information about speaker

acoustic conditions in this case we didn't do any of the eight

oh

for the speaker information so use the i-vector

channel and the length normalization for this

channel

position

so as a classifier is to use linear gaussian classifier trained on these i-vectors about

oh

what is trained for classify a

conditions is that i'm going to show on the next slide and a final diarization

characterization present this paper

posterior probabilities be specified classes so in fact they

taking this vector is

simple as a nonlinear function forty five and

this is just a simple as i mean

affine transformation i-vector for mass function

take this

posture

this summarizes the whole system works

and as well as you can see that is

i-vector

such as

is that

i

as you can see

the same training data variance

post-training

train a ubm training the subspace matrix and also

option

i

i

so

okay

based on this actually

our system for

so we try to distinguish between

three dollars

a microphone they are

a noisy and this case where

those kind of noisy data that you are actually noise added to the clean originally

clean microphone the a and B distinguish three different conditions which is noise a db

fifteen db and twenty db snr

and the conditions for are currently covered we define the condition according to reverberation time

three five

zero point

you can see how much data used for training data

and

hence

i

right and soon as you can see there is always the same number of training

and test files for

noise

because those are actually

the same file

and noise in different level noise

the way to those classes because we just the vector posture only use of those

classes defined

assume that the classes are usually X

that's

which is exactly this

green and with or elevation data because this is exactly how our evaluation set was

designed

we

never have reverberation and noise in the same recording

but of course this is unrealistic

in relatively you can

reverberation of the army

background i

oh

still be viewed that using this paper would be useful for such conditions because this

all the vectors of posteriors can account for

or something just conditions in the data also

one

this is

i animation that they

sure

where if you have reported that comes from my

comes from

all then we do this estimation you probably get all through that somehow reflects the

that i

there is

stands for my

yeah probably

my

what how much

yes

yeah

but of course we can go for more principled way we can even a little

independent classifiers for these independent types of articulators of that

classifiers also speech and noise which kind of reverberation level of reverberation but it still

a microphone was it would be trained data which contains a mix of

such

such conditions

so this

oh

table summarizes

what performance be obtained in terms of that i think these conditions

so they the table shows of the two classes and the detected classes

i

and i know that i'm supposed to be pressing enter

space

so they

the if you had a perfect classification we should see numbers hundred and diagonal and

zero

as well as the justice confusion matrix and normalized in such a way that you

i

persons

i can see that this didn't really have a what

what is

what you were pleased with was that we could actually see that at least

recognition microphone and telephone data is almost

right so that almost here for microphone we get some confusion

here we might

and

twenty db

noisy data and as i told you actually great it is a noisy i think

these microphone data and adding noise to

exactly this i twenty db snr and if you listen to those clear that some

of the data actually contained some voice so it's quite natural that some of the

states from the clean microphone

becomes

twenty db which is not

kind of like

in a year

condition

oh also if you look at the a different noise levels we see quite reasonable

performance of base

a reasonably large numbers in that all again like nicely twenty db recognise that there

is some confusion

but this is again something that would be expected specially the S and now ratios

which are close to each other should be actually

use

and

a what one thing that actually seen was that the most of the confusion comes

with some type of noise they don't really affect the i-vector much i think this

type of noise resulted in almost

exactly the same i-vector and something that was also naturalness you get

where you don't do very well i'll be

these conditions where maybe try to detect the reverberation time

and we see that the those they thought of those detections are actually comes all

over the place you really confused for

for noisy at a party

the main reason that we believe that is the thing is happening is that redefining

conditions reverberation time is not actually a good thing to do because the reverberation be

if you played reverberations then you could actually hear that one

one

one type of reverberation at one reverberation time was much models are just a perceptually

to another reporting which was completely different reverberation times of the reverberation time is probably

the right consonant as we apply the C using these data for improving speech recognition

actually improves the speaker recognition performance we actually looks like that the classification itself does

a good job in terms of

classifying things in putting things into the right classes

which allows us to degrade

so finally how to use this information about acoustic condition for the calibration for improving

the speaker recognition system

cells of you see this approach that the that no an echo actually proposed for

when we do our the A B C system for nist sre two thousand and

you also and i believe this is the thing that is implemented in

in both source to that

three available

so the idea is to be just one calibration if we review this and tara

people obtain standard linear combination where you know a some bias and some multiply the

experiments with

switching from be touches

oh nickel

okay

and that's of the device is the

wavelet multiply the scores

but you can see that we actually in some bias term which is just your

combination

between the vector of posteriors from one and the second

segment that are space in the

trial and then based some matrix so this linear phone that is

vectors in there

from

is

i

just bias and this is that the final score this is to go next

for

also

so we just mentioned before

for

okay

the same

vectors

you're just list conditions

running times and let me just

say briefly that we are presenting results on one list of conditions that a subset

of all conditions summarising and these are these

we know that no problem of that a lot of entry

microphone just my own different vocal for different languages in the recordings of different noises

in recording room reverberation and the system that we use for speaker id

used exactly this

i

as i say

once invited me

right normalization lda are used as a

train the presence of and

this slide just two

results and you can see that you are actually nice

on so these are

once in principle

the dcf and eer

fine

relations

can see so maybe less

we just are relevant for

because

yeah

not

from

from this condition which are actually the condition of the conversation but recorded over a

microphone somewhere our prediction actually does very job

oh surprisingly get some improvements in

also used it all comes

from the single condition that you have a probably again can do some two one

detecting voice is that you have a does a good job on detecting on the

noise condition

from different

noise levels

i

thus reasonable job on room reverberation actually in

right

so only conditions but they are quite

anyway for speaker identification proposed we don't get any problem at all the that comes

from just one is to be don't have conditioned and then tell us to improve

the thing we do not get improvements for language and a common words that

again we didn't really have condition

thus

so

and the next slide actually just showing the same thing that we also still pretty

much the same gains you can be fused with

system cepstral prosody just

so this is just to say

suspecting summarizes

and

the conclusions are summarized in practice

i

well

oh

so well what is that it doesn't classify people are

issues that you have in training and test

actually is if i say i reverberation is rubber reverberation time domain is different for

reverberation but they come from not only are artistry of the actual five

and he defined the reverberation time just cost is

what we have seen is that if you listen to the recordings that in test

fine two recordings that sound similar or perceptually but they come from different classes we

just the way defined the cost is probably wasn't where i think

direction part is how the possibility fine it's probably not correct that would be more

natural clustering a more natural clustering that would account for the type of reverberation i

mean you and regression that you if you flat there is nothing for about that

kind of you late reverberation order reverberation that spread over all the time

which will affect the speech and

what else

would be in our case may be considered to be come from the same part

so then you probably can be some classes which about related speaker recognition performance and

it helps at the end in the in the

and the speaker recognition performance even for that

the classification that would but the classification is not because we define

and

i

i