and everybody

i welcome you in my story on this thing in automatic speaker recognition

i'm a similarity score on assistant professor at your local news data

frames you're looks cool

and there's some other regions

at a low overall difference was moving detection rate of cognition we first you all

speaker verification

giving more attention to current research plan and progress

in the middle and all this information for a speech systems

but also we don't to the cost

automatic speaker verification is one of the most convenient enough room means of but you

might also recognition

this is why this technology is values from your application services such a smart phones

small speaker single sensors

it's technology has about a lot over the last years based data that a is

increasing the we need of by the premier network solution

so just it's vector

we to some extent is weaker than traditional gaussian mixture models

or the so-called i-vectors

and when the roaches are also emerging

we guess at the speaker recognition technology s probably reach the level of performance required

so or practical issue

it wasn't no is whether or not the remaining system is one a normal to

what we're gonna be the answer is yes

the reality of voice biometric technology can be compromised by political status namely born and

ability to the technology external

one of the measures trees the security of biometric systems are spoofing attacks

there is there are four

the final severe okay stores carry out of whatever you matrix system into recognising and

legitimate user is a general user order to avoid being recognised

this is achieved by presenting to this is a synthetic for all the money we

bash

or the volume at least eight

but before we locate is a are the second walk ons this system is processed

there is this is then try to answer this question is that there's on what

they say the are

this means that the target that idea in this case studies as well as a

non-target trial the t v

can be a set the origin by speaker verification system

this results in two different types of errors name false alarms and false rejection

as shown in table

only if this user used a and a change dataset or that this user is

an bolster the challenge i

there is a v

system based

according to their change

here target speaks when they are now available is whining boxers makes no f or

when there's anything about

so

given a test right it is we provide some score behind the score integrator the

confidence that the speaker voices

a better discrimination you see green order to increase in body then between target trials

and non-target trial scores by selecting a threshold between the leash motion looks coarse

however as trying to figure that in the non-target score distribution

usually overlap region

this is can you being the detection error tradeoff at school

on the right well the point where the false alarm rate is in well to

the force the

a certain three is cool enquiry

is this really realistic

though the impostor may can you have for performing system

or they can implement it is if you is my task

so they aim at all that is to provoke false alarms by increasing easily classifier

scores target while i'm going detection

we can distinguish costly to get in bolster from an eye impostor

there are there are also going to zero for impostors

the processing to create fake speech signal you know it down for let's see that

the challenge here is to find a solution to that there are many valuable and

involving this process and there are still menu question to ask

do their car from linear earlier processing due only receive you know part of the

spectrum should be able to look also and the phase signal

but something this question later when we have more element goods you are

there are many a general approaches for the measures improving the easily robustness for example

by speech or the u r c d this is an invasion that action

or winded executive countermeasures for example based on that for sure

this is and its energy detection

in this legal issue on an example that plot you stating baseline performance is when

they posters are non-zero for impostors

baseline black line

the performance degradation when data getting both

by the system

so is this also that the red line

and improvement of the performance is where they can to measure the client

this is the one dimensional fashion

rule i

and know that on a meeting with perfect countermeasures those this is the best performance

reach its baseline performance

nobody six including voice volume it is becoming an instance

many speaker pointed out there is usually issues

can think speech

decision can undermine confidence in easy and it is important you regional level of control

measure of presentation that detection to reduce false acceptances

to spoofing attacks

does that this additional tasks can be originated from more efficient synthesis

or voice

in unlogical system old or just we recording related approach you know basic process

well

where we enjoy directly the audio stream in the easy my

these four percent the measured rates

and a time or a is impersonation which ones used in dating a human voice

also the tree to but this condition is not only inter school and twenty minutes

studies

involving small datasets

it is not surprising a

that

there is no previous work misleading countermeasures maybe impersonation

a possible location of that point the in time typical icily system maybe before or

after the microphone as illustrated in three

corresponding to physical access and logical

is he is more or something then older biometric system based on different biometric is

just conceded that symbols of a human persons goal is can be collected the really

bystanders to face to face or telephone conversation

and then blame in order to my twenty a day is just

or more advanced voice conversion or speech synthesis algorithms

in used to generate particular

if it is looking at that

using only modest amounts of voiced the calculate the for a person

this table summarize the for splitting and that's in terms of us a single decreases

and in we will consider measures

except for the impersonation at time so that have a menu model i s is

unity

and i freeze

especially for text event is the scenario and the error of intermediate of dimension

that's the use of for scroll

generalization it is the meeting to the different

or unseen i

so this is the timeline which the task

two days visible units you

and is studies on speaker and feasible thing where and are on me now speech

for were created using a limited number or something

in see it is clear that the development of can to measure using only a

small number was looking at task

no you generalization to be

moreover

there was a lack of a galaxy we will corpora and evaluation bottle but not

for the to the results of being by different researchers

daisy of this study aims to establish a key during the initial you by making

of evil standard speech corpora

we have a large amount of signal that's

evaluation protocols and matrix

to some or a common evaluation and the benchmarking different systems

is feasible challenge is as being organised in time so far

the first was having to sausage in

the second two thousand and thirteen two thousand

it were presented and the corresponding special session loading the interspeech conference

is actually current own analyses of this visible for you as well as the their

finish definition to partition your see the company around the work

but the first thing is challenge involve detection of the division speech

the data using a mixture of voice conversion to speech synthesis techniques

it was or something during basically to a special session it english speech of those

in

and the sixteen organisation have debated the this challenge

there is useful for those of fifteen involve only logical a system that that's and

the a as it was generated we ten different of diffusion speech generation algorithms

well based on a large collections accordingly scolding this of course

version well

and consist of but not without and t v show that a speech

one of each was recorded using i one thing microphone

and we don't seem difficult channel or of background noise effects

and if one database was divided into two subsets coolant

the training level of an evaluation set in a speaker and he's joined mar

finally i s from the s one was i ni is known

where used

in the training and development and evaluation set

and the one to five times from six s c and it is then going

a known or and seen that

where are used on the in the evaluation set along we know that that's

based on the dimension and of the bias the or on what it used for

voice conditions speech synthesis

nine of them are we'll database and the hmm of gmm based addition model

while only one the s and is the unit selection based

speech synthesis implement we that one source madly

text-to-speech system

the banana but all of easy system based the on the i-vector but the is

pretty clear

except for the i guess who

well that that's are very effective with importantly reasoning

greece all equal error rate

in the worst case

that is s then

i don't to one

directly to fifty one will ones

it is seventeen

so that it will the on the left show here the challenge results

the in terms of the average equal error rate across all their a score the

evaluation set

for no one and i do not

the exactly a lack of a generalization these results

over the table on the left to sure that

i'm sorry believable the double on the on the right initials the that the top

performing system evaluated only

on the s ten

the unit selection based speech synthesis

isn't that isn't most if you without

then the and the most dangerous for speaker verification system is i are shown previously

so as then i used to efficiently the biggest three for the msd system in

this case

and used in one is on the

the front end of a against the door for a performing system

on the challenge

it will not the to read for the in this challenge is related to the

two features

and the level of the low end of the front and

other people between if the in the v a dynasty the use cochlear filter a

cepstral coefficients

that are related to the human auditory system

possible these something that john it problem

so no less and i don't know are most the challenge evaluation on the is

v is of two thousand fifteen

we propose a new feature domain constantly coefficients

this on the constant you possible which is a an alternative to put it costs

and which employ a variable time-frequency resolution that means

greater time resolution for and frequency

and you the frequency resolution for lower frequencies

so that wasn't you the first one vicinity of an idea which are different more

closely the human perception

and the to obtain a c uses you features we combine a cuda increase of

the initial k would have also with the prediction cepstral analysis

i should be for that the only thing started in the challenge

where only able to the test then i probably

so is it is easy as a

obtain completely can be you results for knowing the task and the best results for

i do not a week and eighty seven relative improvement on stand

and overall seventy two ground control

so to summarize basis for fifteen focused on the i don't voice conversion and speech

since is a task so not ugly

easily disapprovingly detection so no at

that's the band the scenario

the participant in their invested for to develop features using most simple classifiers

and the fourth line regionalisation used in the missing

any of

i think meet again we the some possible mission improvements

i like it doesn't fifteen addition to that used very high quality speech material it'll

seventeen addition aims to assess the we have a detection

we call in the white

condition

in focus exclusively on earlier works

a second of them i think speaker verification code dimension challenge was presented including this

is a special session

adding the speech those of indian

and fourteen now consider shows a distributed of the challenge

cost function if this were from the riesz a text

that adults

course

was proposed was to collect speech lead to over mobile devices

in the form of smart phones or a black computers

a bible tears of from across to low

we collect the a's this will does seven in the database using a playback device

and a recording device different acoustic environment

we did not to use a realistic scenario using core the recording but we made

actually got

and do the you don't call me all the target speakers voice

to create the plane data collection

this is the worst case scenario that of those the use of x sixteen speech

were to be linear access

the colour curve was is divided into three subsets for training development and evaluation

we different speakers replay section and ugly configuration

in training and development subset were collected in three different sites

and evaluation subset was collected at the same a three sides and also the data

for a new side

this is the loudest most the inverse italy that

in terms of a basically a wider meeting t s for the challenge also here

is a clear

the this is m is based on the a gmm

and the really that's a big effect you

with an important case of the equal error rate

for all

one point eight fifty one point five

on these evaluation set

the primary evaluation is only whether they can rest of this additional two thousand fifty

challenge

the equal error rate is computed from scores all across all training segments rather than

condition averaging

why fourteen estimation

perform the baseline while existing three and their the

at a performance is the old in more than seven percent relative improvement we used

a dismissal a

baseline system is based on gmm of a classifier we can you cepstral coefficient features

it was provided to the data

comparing the baseline mean zero one thing to do

it is important performance improvement when using wondering plus their the three

this is this idea of the parameter submission to residuals

it doesn't seventy

i don't training refer to the bar all the time for training

a sense for three and a reasonable

most all the systems a lower bound for the features

this call mom for all the systems to build a gmm classifier

single cost you as you can see

the invariant use whatever means of all around solution is twenty five one ninety one

understand

where s the best single system result show

and average detection whatever in

or

only six point seven percent

this is a test tools for looters challenge show that

the channel of a layer that is more difficult then detection speech synthesis and with

compression

for me a dimension generalization also remains a problem

after the challenge that were that the anomalies

ieee beyond zero samples present a beginning on managing speech uterrances

is zero really running by for the easy to be a

but maybe but i for a modified versions for speech detection

these issues it is so for version two point zero was released to colour be

anomalous

i detected of course the evolution

in addition the metadata which describes the recording and playback devices and that was the

environments where once released along we and you are not the baseline

the new metadata along with the data by ching as there is the number uterrances

as well as the a population or the evaluation set

remember when i'm better than for each other

for a better understanding of the outcomes we can rewrite the square the regulation terms

of the speaker measurement recording playback devices

acoustic environment is a physical spacing which original stage the that basically then here or

it is reasonable because seventeen database was collected you have a different environment

the evaluation meeting there about the accent level over even more controlled noise

the

for example can be in we model noise and balcony are assumed to be noisy

all these

all right are assumed to be maybe which in your oracle room huh

are assumed to be are actually

there are under the of a twenty six a little better prices

a smart phones the lower bound we

if we the we fifteen this moral speakers

are assumed to be all over the

well e

a little larger lot of speakers are assumed to be your mean you rightly

and the professional or do we managed are assumed to be i

assuming only there are a total twenty five recording devices

some are ones that are the weights for my from source would be a little

windy and it's where a microphone are assumed to be over the medium by i

and the again the regression your and b i

this figure shows the impact of different illegally configuration of one lazy performance measure in

terms of equal error rate

we have sent over a zero for impostor trials are replaced with a replaceable by

iteratively the each other little degradation

the control the demo on the right shows the resulting legal regulations sort of according

to the easy equal error rate in the

all pole a core also reflect the supposed to be a is the

where we are in this a little degradation

this is done

they higher than one at a very little degradation the motive for effect in a

the three years

it is this detection performance of a gmm robot

and i-vectors read about smoking the dimension

for this thing that a little degradation

also expressing that all the equal error rate

the first edition these results is that the recently the correlation between the specifically to

the thing

detection or everybody detection or

this is a fine reflect the final complex of overwhelmingly device

there was to get about a man and the recording right

the control on the right a to see the results in terms of the all

only a in a environment going back and replay value

results show the number of a single element of the little degradation for all i

trials this was all we trials corresponding with either one of the

i in my all their acoustic environment a system we need the effect of the

playback and recording device

to summarise it is able to go seventeen false own regalia

so not at a slow was commission

performances are reminding

even for the worst case scenarios

analysis is a very difficult since the data collection was the whole roll

remote control data collection mean thing to ensure a which is one recognition or the

that is useful to doesn't matter the in

so again is related to smoking detection so nicely where

text independent scenario will use

a there is no gave a database that for a little features and classifiers

it generalisation is even missing giving me a

it's been mitigated i mean green post evaluation improvement

so let's go to the to provide a speaker verification additional information challenge

a straightforward on boats

speech synthesis and the really

as for the because efficient it was examined everything is feasible for special session in

their speech goes on a in

and forty and fifty organisation there are basically the of the challenge order to standards

it is useful because i'm in the in a database is this i would've liked

to different use case scenarios

well you got and this guy was the score

also different a is this strategy of assessing still thing to measure performance on a

state

instead of the test

stand-alone compare measure

for this reason for if there is alright we have provided the

is this

score of the participant

so we have got the a s primary method of the minimum normalized the actual

cost

in this

and this is a very maybe at whatever rate

also for most discrimination

use of the a dcf means that the these this design database is this i'm

not for the standard on this task will commercial

but they are on the availability in is very system where subject to scooping up

necessarily now to use in a normalized dcf so inspired by the detection cost function

the

c f

used in these the sre challenge is

i in a this it is

aims to assess is the this is the last to make sure

to all formalize assessment

so long format or by rate

or you really motivation for a four

okay and the a whole basically

countermeasures system

there are a total of four possible error

where

quantify

target uses a by the company measures is that

i wanna five target is rejected by easy this is the

i don't target trials are so that

and cost of the idea is

the for possible errors in be formally describe so it is for the costs and

priors are this i mean that one

and the classification tree

it

are computed be taken

the roadie dcf a venue a can be difficult to either us or forming the

formation of the well in the nist speaker recognition issue

it is useful to normalize the cost

the normalized that it is it's a function of a the measured pressure

a similar to the bus the challenge efficient

is useful for those online dating does not goals of pressure of the set in

that means that the calibration

so we think source in this case the traditional or mutually the standard measure to

install involve a corresponding to go for calibration

that correspond to the remaining on remote i

in this

in by fitting the all my racial the to mine

for from the evaluation set using the

so this is able to those on a the database is visible the for score

one dorky be seen again corpus

okay speaker english speech database a or in the a union going

charmer still clearly all these things

either

before weights

so it was a the using this is from whatever the seven speakers

forty six main thing see more humane

but they are the ensemble to a sixteen khz the sixteen bits per sample

a collection of course uses colour that these in baseball problem in this analysis

it is divided in three

for training development evaluation in a speaker is john manner

for the logical is there are six

text-to-speech and voice conversion box

for training and there's fifteen

yes and b c score evaluations that

what the physical analysis

there are then these a holes the

environment

and i sleepily calculation of training

they're an imbalanced

we yes

the two is then of the double doors to provide state-of-the-art yes this is this

if you show a lot of assigning all over the course

this table summarize this system which are fundamentally you go first

the known

small things is the for a zero one at zero six

in the lab

two v c and four yes systems

then

well at zero seven to eighty nine d r for a sixteen and even being

are the eleven and or something a systems

and a sixteen at the eighteen nineteen i don't the reference

systems using the same algorithms

s

at zero four and at zero six

the l a verification is the lattice

most of our database for speech synthesis and was version is moving the results

this is this ensemble of problem a the weather

two

so

we did not complete with any of the local form

what if i

no

the a

we did not completely of any of the local phone

is you know there speaker one of i

employees are entitled to follow that contract to the latter

a data

employees are entitled followed by a contract so the latter

another speaker who finished

at that time it's telling faction like and five miles

a

i at time m is now and faction within five miles

as you can see that one of your the synthesis of a speech is quite

impressive

this is the size of a

a subset evaluations and session

results in terms of a it is for a little baseline we are provided

first of all shows the results for two categories of the us to the speech

yes we see

yes and v c you might

and i saw show results for types of models

there are neural network based

i one

a neural network based and where

yes

neural network based itsy a statistical model based p c

last rule

shows the results from different with for generation that the

in that are

their own where for model classical speech moreover

with four combinations

spectral filtering with typically and orders

in the testing is the complementary you of your over the baseline

otherwise dishonest users you features and the idiot there is a someone else

sdc features

it doesn't say challenge data was created from the rio your presentation visual quality of

the score was somewhat cold or

leading to improve upon the last challenge it doesn't line in addition to this once

you weighted and all

acoustic and global calibration

once we use these two similarly enrollment listings and devices we establish right

the remainder of this work are similarly directly on that

we choose a the one sure on the slide

realistic environment winkler only holding the noise putting aside for now the additive noise

we really a decision we consider perfect microphones

and

only at the recording this meeting about a five user

and for variability representation

we can see the that there are

it's carry out that the single session as that used a

and will only of the device quite in this case the last speaker

the physical access scenario assumes use in it is the leading to convey such as

illustrated in fig

there was a single iteration which please this is then this it will it is

also s

is this the data will environment distinction room size or categorize in two different

in the remote's label

i will rule

we may be able

and see that actual

the position of the aec easily see that by the yellow cross

circle in the three or whatever position of the to go is illustrated by the

blue star

well i assess it is harder

maybe by the okay well we'll see change a distance yes for the microphone

it is also illustrated in the table environment definition there are three categories or at

least and

and unlabeled a short distance be making this that and see that at least

each physical space system to explain that in addition variability are according to the difference

between space

which can be seen as a wall ceiling and the for submission coefficients

as well as the position interval

the level overrated variation used busy fighting the or the is sixty two variation by

the by are

it's fifty whatever item of definition

they are the result is six is the u

a little i shall we menu and

see i recognition

it is this is the microphone and that okay or writing reading the visual speech

there was a shown are so well

we think that although there is an environment as

you can see that symbol on the right

the man and language for the that's a month it is also illustrated in this

paper

but something that is modeled by making and then recording over one of five as

this

and but are sending their according to be is the microphone

according are assumed to be made in one over the three zones used to people

each representing a different vowel the oldest the problem or

in the state in table are a definition if they are labeled character i shows

this task of the medium distance and

largest

in addition to the variation lately we release let us define the means for recording

and presentation devices

we can see that only the presentation

no speaker

encoding only and better living in the last speaker if there are four selected

we use the categorisation

and without any

but if there

that would be

i and it

currency one

this case we or they have online replaying configuration as you can see and the

table

on the right

the simulation once either two containers all the speakers

each with a different range of the whole by about we mean frequency and maybe

a linear calibration

the first

a typical vector category represent the mean dillydallying in full band lot speaker

i one last speaker and a megabyte bound we the icsi and units

and the being able to more linear or racial a study

and one hundred

addition

and if you're you can see an illustration of set of the higher money frequency

responses

for i don't be noise model

the little device estimated using desynchronized we design a linear system identification

based on a linear convolution

each one in the finger is the a linear component

while from age to if i

i the higher wouldn't nonlinear components

the blue where the shaded region represent the right boundary

is it is still real devices from which measurement where the again for simulation or

a clear presentation

the first table on the left indicates a multi device is why on the right

in the case of interest

device that will signifies which type of the magazines

what are some all but is a little speaker

right most column in the case

if the device were used for the simulation of dance in the training and development

sets were not devices

or evaluations and i don't devices

this figure shows again at least commission for the different laws speakers

device

used for this evaluation

the top plot shows a by means of the glottal sure the lower one of

the binary but we are the mean and frequency

the bottom plot the should ideally a linear calibration

in the range of the d

or by about

devices are sort the wheat the wideband

this figure shows baseline results for maybe a scenario of the is useful to two

thousand nineteen database

results are used to read and fourteen you important to be in configuration

one you acoustic environments

and for to monitor a standard on arrays here's something german equal error rate between

target and zero for impostor trials that is the blood spatter

and target and replaceable from the area they leave are

i mean don't wanna mixture on the stand-alone replace moving in terms of equal error

rate

for baseline a be one and b two

and the bottom panel there is a combine is the and cm results use created

in terms of the me

e it is yes

for this result we guess they the to the is anyone interview medium

as for the previous challenges expecting clear

and moreover the worst the screens are

two or swings high when the device scenes and a little darker to talk be

stuff

its own can now the challenge results this figure shows the profiles for the baseline

this system b

zero two

and the best the

performing primary system for the in the means you're fine

and the seen teams single system

it is also shown the second best performing the single system for a in the

for immorality

forty five

so the lowest equal error rate is zero point two

percent

that is a greater us out

however for this results it is clear that there is a substantial gaps between

primary and single system

a four

so this means that fusion is important

is line shows the one the mean the team this year and equal error rate

the results from one before you conditions

to the in the age scenario

the first screening feel boring the on the x-axis and then don't whether or not

the system are the nn based or three systems

while the second denotes whether or not the systems are instance systems

which combine more all

so systems

or single system

we cannot the for really there is a manager you all the n and beast

and the in symbol systems

in addition to is also clear that the new word error rate and mean this

are measurements that are not correlated

as you can see in these two are red and blue

in this like the it is shown all the results for the thirty nine hour

in the evaluation set for the top then brown many solutions

first of all we can see that the baseline is the equal error rate

that means no smoking

is two point five percent

when we need class i think moving at a the is this is then becomes

what inaudible

again if the individual tax someone else a degree is the performance

that are easy to detect

there reminding the against you

us some degree the easy performance

and i difficult to the data they want in the or ranch a physical

and one only one that is the a seventeen

as in this entire on the knees the but is very difficult to detect

that is the one in the utterance to scroll

so let's evolution no i the challenge results for but these figures show that provides

for the baseline system be zero one

the best performing primary system fourteen d u and the same teams of the systems

the lowest the equal error rate here used zero point four

the is indeed we results

was it to invade

here there is less a discrepancy between primary and single system

so fusion since that is not so we bought

this is my shoulder while the mean dcf decoder ring the results for one if

is shown that to the each and you

and anything point as before on the x-axis denote a unit based in the nn

three or and channel and

the known in some other systems

not of the to as for any

cole p the there is a manager he or and bees and the instance systems

it is like a this on or on the results for all the nine a

single evaluation set for the door then primary submission

and we can see that the baseline is the query

well seems keys needs solos moving is he going for example for stack

when we in class looking at a this is then

because

wouldn't it

so looking at these i

we can see that the performance is increases

where

the distance back to okay becomes greater

so there are very fancy one

and decreases when the quietly of the device we got better

so real routes suitable

it is nice on all of the silence now four or other than twenty seven

that environments and evaluation sets again for the a the parameter estimation

so looking at least and over individual environments we can see that the performance is

the graces where the room i recall

really

so the received go

in case is when they are very the given variational model because higher

c

and increase when the to go to easily distance becomes higher

getting

see what

so to summarise a system that doesn't like being focus on the

but eagerly and yes or voice conversion

a simple even if one would be evaluated

we have a show that to there is this is then the i wanna normal

to squatting task

we have defined and limiting the dcf was just moving on to measure performance on

a c d

so instead of a doing these the on the standard on one dimensional

we have seen a transition from features to classifiers so and unit order to into

and that

and one double the fused system with the biggest challenges

don't demand countermeasures are very

how to the speech sounds are

very natural

is the recognition accuracy very clear by detection again be proven to work this time

of by only and stage

generalization is in missing

much more as to be done

so i don't to the union a and for decision

the is this will two thousand

then t one

so but for finnish thing to do not i like to wish to some softer

each for speaker recognition grunting using from us at all

it appears to keep the from a is

and my results to identically to overcome my as well

currently silently to from the university

you can finally two databases for easy and the disposing

i thought winter the is additional database misleading

and nist and the are star burst in that the speaker recognition database

a right don't

and the text dependent speaker recognition database

we also the a e for it is simply a database from

and the speaker wire a new speech and boxers

so here you can find some of the for this thing

matlab implementation of training and the scope of this common conditions

this is used as you features

and the three these coding systems that an easy to a last challenge

you know website you can find the matlab client on implementation of the teens yes

and the in your with the regarding the is a you please easy the a

one website

we need you are cool

last time at least i like to shoot due to budget

where i'm the principal investigator region two d measurement recognition also not only speech

a disapproving and

closing phase information

classifiers and respect

thus nazis ultimately increase the number eighty three networks

and the domain instruments increment because representing volume i mean and he uttering networks

and the second respect

use a friend gentlemen project

and is completely means he or more secure and presenter's the remote embodiment person authentication

thank you for listening and see you the you at session