and everybody
i welcome you in my story on this thing in automatic speaker recognition
i'm a similarity score on assistant professor at your local news data
frames you're looks cool
and there's some other regions
at a low overall difference was moving detection rate of cognition we first you all
speaker verification
giving more attention to current research plan and progress
in the middle and all this information for a speech systems
but also we don't to the cost
automatic speaker verification is one of the most convenient enough room means of but you
might also recognition
this is why this technology is values from your application services such a smart phones
small speaker single sensors
it's technology has about a lot over the last years based data that a is
increasing the we need of by the premier network solution
so just it's vector
we to some extent is weaker than traditional gaussian mixture models
or the so-called i-vectors
and when the roaches are also emerging
we guess at the speaker recognition technology s probably reach the level of performance required
so or practical issue
it wasn't no is whether or not the remaining system is one a normal to
what we're gonna be the answer is yes
the reality of voice biometric technology can be compromised by political status namely born and
ability to the technology external
one of the measures trees the security of biometric systems are spoofing attacks
there is there are four
the final severe okay stores carry out of whatever you matrix system into recognising and
legitimate user is a general user order to avoid being recognised
this is achieved by presenting to this is a synthetic for all the money we
bash
or the volume at least eight
but before we locate is a are the second walk ons this system is processed
there is this is then try to answer this question is that there's on what
they say the are
this means that the target that idea in this case studies as well as a
non-target trial the t v
can be a set the origin by speaker verification system
this results in two different types of errors name false alarms and false rejection
as shown in table
only if this user used a and a change dataset or that this user is
an bolster the challenge i
there is a v
system based
according to their change
here target speaks when they are now available is whining boxers makes no f or
when there's anything about
so
given a test right it is we provide some score behind the score integrator the
confidence that the speaker voices
a better discrimination you see green order to increase in body then between target trials
and non-target trial scores by selecting a threshold between the leash motion looks coarse
however as trying to figure that in the non-target score distribution
usually overlap region
this is can you being the detection error tradeoff at school
on the right well the point where the false alarm rate is in well to
the force the
a certain three is cool enquiry
is this really realistic
though the impostor may can you have for performing system
or they can implement it is if you is my task
so they aim at all that is to provoke false alarms by increasing easily classifier
scores target while i'm going detection
we can distinguish costly to get in bolster from an eye impostor
there are there are also going to zero for impostors
the processing to create fake speech signal you know it down for let's see that
the challenge here is to find a solution to that there are many valuable and
involving this process and there are still menu question to ask
do their car from linear earlier processing due only receive you know part of the
spectrum should be able to look also and the phase signal
but something this question later when we have more element goods you are
there are many a general approaches for the measures improving the easily robustness for example
by speech or the u r c d this is an invasion that action
or winded executive countermeasures for example based on that for sure
this is and its energy detection
in this legal issue on an example that plot you stating baseline performance is when
they posters are non-zero for impostors
baseline black line
the performance degradation when data getting both
by the system
so is this also that the red line
and improvement of the performance is where they can to measure the client
this is the one dimensional fashion
rule i
and know that on a meeting with perfect countermeasures those this is the best performance
reach its baseline performance
nobody six including voice volume it is becoming an instance
many speaker pointed out there is usually issues
can think speech
decision can undermine confidence in easy and it is important you regional level of control
measure of presentation that detection to reduce false acceptances
to spoofing attacks
does that this additional tasks can be originated from more efficient synthesis
or voice
in unlogical system old or just we recording related approach you know basic process
well
where we enjoy directly the audio stream in the easy my
these four percent the measured rates
and a time or a is impersonation which ones used in dating a human voice
also the tree to but this condition is not only inter school and twenty minutes
studies
involving small datasets
it is not surprising a
that
there is no previous work misleading countermeasures maybe impersonation
a possible location of that point the in time typical icily system maybe before or
after the microphone as illustrated in three
corresponding to physical access and logical
is he is more or something then older biometric system based on different biometric is
just conceded that symbols of a human persons goal is can be collected the really
bystanders to face to face or telephone conversation
and then blame in order to my twenty a day is just
or more advanced voice conversion or speech synthesis algorithms
in used to generate particular
if it is looking at that
using only modest amounts of voiced the calculate the for a person
this table summarize the for splitting and that's in terms of us a single decreases
and in we will consider measures
except for the impersonation at time so that have a menu model i s is
unity
and i freeze
especially for text event is the scenario and the error of intermediate of dimension
that's the use of for scroll
generalization it is the meeting to the different
or unseen i
so this is the timeline which the task
two days visible units you
and is studies on speaker and feasible thing where and are on me now speech
for were created using a limited number or something
in see it is clear that the development of can to measure using only a
small number was looking at task
no you generalization to be
moreover
there was a lack of a galaxy we will corpora and evaluation bottle but not
for the to the results of being by different researchers
daisy of this study aims to establish a key during the initial you by making
of evil standard speech corpora
we have a large amount of signal that's
evaluation protocols and matrix
to some or a common evaluation and the benchmarking different systems
is feasible challenge is as being organised in time so far
the first was having to sausage in
the second two thousand and thirteen two thousand
it were presented and the corresponding special session loading the interspeech conference
is actually current own analyses of this visible for you as well as the their
finish definition to partition your see the company around the work
but the first thing is challenge involve detection of the division speech
the data using a mixture of voice conversion to speech synthesis techniques
it was or something during basically to a special session it english speech of those
in
and the sixteen organisation have debated the this challenge
there is useful for those of fifteen involve only logical a system that that's and
the a as it was generated we ten different of diffusion speech generation algorithms
well based on a large collections accordingly scolding this of course
version well
and consist of but not without and t v show that a speech
one of each was recorded using i one thing microphone
and we don't seem difficult channel or of background noise effects
and if one database was divided into two subsets coolant
the training level of an evaluation set in a speaker and he's joined mar
finally i s from the s one was i ni is known
where used
in the training and development and evaluation set
and the one to five times from six s c and it is then going
a known or and seen that
where are used on the in the evaluation set along we know that that's
based on the dimension and of the bias the or on what it used for
voice conditions speech synthesis
nine of them are we'll database and the hmm of gmm based addition model
while only one the s and is the unit selection based
speech synthesis implement we that one source madly
text-to-speech system
the banana but all of easy system based the on the i-vector but the is
pretty clear
except for the i guess who
well that that's are very effective with importantly reasoning
greece all equal error rate
in the worst case
that is s then
i don't to one
directly to fifty one will ones
it is seventeen
so that it will the on the left show here the challenge results
the in terms of the average equal error rate across all their a score the
evaluation set
for no one and i do not
the exactly a lack of a generalization these results
over the table on the left to sure that
i'm sorry believable the double on the on the right initials the that the top
performing system evaluated only
on the s ten
the unit selection based speech synthesis
isn't that isn't most if you without
then the and the most dangerous for speaker verification system is i are shown previously
so as then i used to efficiently the biggest three for the msd system in
this case
and used in one is on the
the front end of a against the door for a performing system
on the challenge
it will not the to read for the in this challenge is related to the
two features
and the level of the low end of the front and
other people between if the in the v a dynasty the use cochlear filter a
cepstral coefficients
that are related to the human auditory system
possible these something that john it problem
so no less and i don't know are most the challenge evaluation on the is
v is of two thousand fifteen
we propose a new feature domain constantly coefficients
this on the constant you possible which is a an alternative to put it costs
and which employ a variable time-frequency resolution that means
greater time resolution for and frequency
and you the frequency resolution for lower frequencies
so that wasn't you the first one vicinity of an idea which are different more
closely the human perception
and the to obtain a c uses you features we combine a cuda increase of
the initial k would have also with the prediction cepstral analysis
i should be for that the only thing started in the challenge
where only able to the test then i probably
so is it is easy as a
obtain completely can be you results for knowing the task and the best results for
i do not a week and eighty seven relative improvement on stand
and overall seventy two ground control
so to summarize basis for fifteen focused on the i don't voice conversion and speech
since is a task so not ugly
easily disapprovingly detection so no at
that's the band the scenario
the participant in their invested for to develop features using most simple classifiers
and the fourth line regionalisation used in the missing
any of
i think meet again we the some possible mission improvements
i like it doesn't fifteen addition to that used very high quality speech material it'll
seventeen addition aims to assess the we have a detection
we call in the white
condition
in focus exclusively on earlier works
a second of them i think speaker verification code dimension challenge was presented including this
is a special session
adding the speech those of indian
and fourteen now consider shows a distributed of the challenge
cost function if this were from the riesz a text
that adults
course
was proposed was to collect speech lead to over mobile devices
in the form of smart phones or a black computers
a bible tears of from across to low
we collect the a's this will does seven in the database using a playback device
and a recording device different acoustic environment
we did not to use a realistic scenario using core the recording but we made
actually got
and do the you don't call me all the target speakers voice
to create the plane data collection
this is the worst case scenario that of those the use of x sixteen speech
were to be linear access
the colour curve was is divided into three subsets for training development and evaluation
we different speakers replay section and ugly configuration
in training and development subset were collected in three different sites
and evaluation subset was collected at the same a three sides and also the data
for a new side
this is the loudest most the inverse italy that
in terms of a basically a wider meeting t s for the challenge also here
is a clear
the this is m is based on the a gmm
and the really that's a big effect you
with an important case of the equal error rate
for all
one point eight fifty one point five
on these evaluation set
the primary evaluation is only whether they can rest of this additional two thousand fifty
challenge
the equal error rate is computed from scores all across all training segments rather than
condition averaging
why fourteen estimation
perform the baseline while existing three and their the
at a performance is the old in more than seven percent relative improvement we used
a dismissal a
baseline system is based on gmm of a classifier we can you cepstral coefficient features
it was provided to the data
comparing the baseline mean zero one thing to do
it is important performance improvement when using wondering plus their the three
this is this idea of the parameter submission to residuals
it doesn't seventy
i don't training refer to the bar all the time for training
a sense for three and a reasonable
most all the systems a lower bound for the features
this call mom for all the systems to build a gmm classifier
single cost you as you can see
the invariant use whatever means of all around solution is twenty five one ninety one
understand
where s the best single system result show
and average detection whatever in
or
only six point seven percent
this is a test tools for looters challenge show that
the channel of a layer that is more difficult then detection speech synthesis and with
compression
for me a dimension generalization also remains a problem
after the challenge that were that the anomalies
ieee beyond zero samples present a beginning on managing speech uterrances
is zero really running by for the easy to be a
but maybe but i for a modified versions for speech detection
these issues it is so for version two point zero was released to colour be
anomalous
i detected of course the evolution
in addition the metadata which describes the recording and playback devices and that was the
environments where once released along we and you are not the baseline
the new metadata along with the data by ching as there is the number uterrances
as well as the a population or the evaluation set
remember when i'm better than for each other
for a better understanding of the outcomes we can rewrite the square the regulation terms
of the speaker measurement recording playback devices
acoustic environment is a physical spacing which original stage the that basically then here or
it is reasonable because seventeen database was collected you have a different environment
the evaluation meeting there about the accent level over even more controlled noise
the
for example can be in we model noise and balcony are assumed to be noisy
all these
all right are assumed to be maybe which in your oracle room huh
are assumed to be are actually
there are under the of a twenty six a little better prices
a smart phones the lower bound we
if we the we fifteen this moral speakers
are assumed to be all over the
well e
a little larger lot of speakers are assumed to be your mean you rightly
and the professional or do we managed are assumed to be i
assuming only there are a total twenty five recording devices
some are ones that are the weights for my from source would be a little
windy and it's where a microphone are assumed to be over the medium by i
and the again the regression your and b i
this figure shows the impact of different illegally configuration of one lazy performance measure in
terms of equal error rate
we have sent over a zero for impostor trials are replaced with a replaceable by
iteratively the each other little degradation
the control the demo on the right shows the resulting legal regulations sort of according
to the easy equal error rate in the
all pole a core also reflect the supposed to be a is the
where we are in this a little degradation
this is done
they higher than one at a very little degradation the motive for effect in a
the three years
it is this detection performance of a gmm robot
and i-vectors read about smoking the dimension
for this thing that a little degradation
also expressing that all the equal error rate
the first edition these results is that the recently the correlation between the specifically to
the thing
detection or everybody detection or
this is a fine reflect the final complex of overwhelmingly device
there was to get about a man and the recording right
the control on the right a to see the results in terms of the all
only a in a environment going back and replay value
results show the number of a single element of the little degradation for all i
trials this was all we trials corresponding with either one of the
i in my all their acoustic environment a system we need the effect of the
playback and recording device
to summarise it is able to go seventeen false own regalia
so not at a slow was commission
performances are reminding
even for the worst case scenarios
analysis is a very difficult since the data collection was the whole roll
remote control data collection mean thing to ensure a which is one recognition or the
that is useful to doesn't matter the in
so again is related to smoking detection so nicely where
text independent scenario will use
a there is no gave a database that for a little features and classifiers
it generalisation is even missing giving me a
it's been mitigated i mean green post evaluation improvement
so let's go to the to provide a speaker verification additional information challenge
a straightforward on boats
speech synthesis and the really
as for the because efficient it was examined everything is feasible for special session in
their speech goes on a in
and forty and fifty organisation there are basically the of the challenge order to standards
it is useful because i'm in the in a database is this i would've liked
to different use case scenarios
well you got and this guy was the score
also different a is this strategy of assessing still thing to measure performance on a
state
instead of the test
stand-alone compare measure
for this reason for if there is alright we have provided the
is this
score of the participant
so we have got the a s primary method of the minimum normalized the actual
cost
in this
and this is a very maybe at whatever rate
also for most discrimination
use of the a dcf means that the these this design database is this i'm
not for the standard on this task will commercial
but they are on the availability in is very system where subject to scooping up
necessarily now to use in a normalized dcf so inspired by the detection cost function
the
c f
used in these the sre challenge is
i in a this it is
aims to assess is the this is the last to make sure
to all formalize assessment
so long format or by rate
or you really motivation for a four
okay and the a whole basically
countermeasures system
there are a total of four possible error
where
quantify
target uses a by the company measures is that
i wanna five target is rejected by easy this is the
i don't target trials are so that
and cost of the idea is
the for possible errors in be formally describe so it is for the costs and
priors are this i mean that one
and the classification tree
it
are computed be taken
the roadie dcf a venue a can be difficult to either us or forming the
formation of the well in the nist speaker recognition issue
it is useful to normalize the cost
the normalized that it is it's a function of a the measured pressure
a similar to the bus the challenge efficient
is useful for those online dating does not goals of pressure of the set in
that means that the calibration
so we think source in this case the traditional or mutually the standard measure to
install involve a corresponding to go for calibration
that correspond to the remaining on remote i
in this
in by fitting the all my racial the to mine
for from the evaluation set using the
so this is able to those on a the database is visible the for score
one dorky be seen again corpus
okay speaker english speech database a or in the a union going
charmer still clearly all these things
either
before weights
so it was a the using this is from whatever the seven speakers
forty six main thing see more humane
but they are the ensemble to a sixteen khz the sixteen bits per sample
a collection of course uses colour that these in baseball problem in this analysis
it is divided in three
for training development evaluation in a speaker is john manner
for the logical is there are six
text-to-speech and voice conversion box
for training and there's fifteen
yes and b c score evaluations that
what the physical analysis
there are then these a holes the
environment
and i sleepily calculation of training
they're an imbalanced
we yes
the two is then of the double doors to provide state-of-the-art yes this is this
if you show a lot of assigning all over the course
this table summarize this system which are fundamentally you go first
the known
small things is the for a zero one at zero six
in the lab
two v c and four yes systems
then
well at zero seven to eighty nine d r for a sixteen and even being
are the eleven and or something a systems
and a sixteen at the eighteen nineteen i don't the reference
systems using the same algorithms
s
at zero four and at zero six
the l a verification is the lattice
most of our database for speech synthesis and was version is moving the results
this is this ensemble of problem a the weather
two
so
we did not complete with any of the local form
what if i
no
the a
we did not completely of any of the local phone
is you know there speaker one of i
employees are entitled to follow that contract to the latter
a data
employees are entitled followed by a contract so the latter
another speaker who finished
at that time it's telling faction like and five miles
a
i at time m is now and faction within five miles
as you can see that one of your the synthesis of a speech is quite
impressive
this is the size of a
a subset evaluations and session
results in terms of a it is for a little baseline we are provided
first of all shows the results for two categories of the us to the speech
yes we see
yes and v c you might
and i saw show results for types of models
there are neural network based
i one
a neural network based and where
yes
neural network based itsy a statistical model based p c
last rule
shows the results from different with for generation that the
in that are
their own where for model classical speech moreover
with four combinations
spectral filtering with typically and orders
in the testing is the complementary you of your over the baseline
otherwise dishonest users you features and the idiot there is a someone else
sdc features
it doesn't say challenge data was created from the rio your presentation visual quality of
the score was somewhat cold or
leading to improve upon the last challenge it doesn't line in addition to this once
you weighted and all
acoustic and global calibration
once we use these two similarly enrollment listings and devices we establish right
the remainder of this work are similarly directly on that
we choose a the one sure on the slide
realistic environment winkler only holding the noise putting aside for now the additive noise
we really a decision we consider perfect microphones
and
only at the recording this meeting about a five user
and for variability representation
we can see the that there are
it's carry out that the single session as that used a
and will only of the device quite in this case the last speaker
the physical access scenario assumes use in it is the leading to convey such as
illustrated in fig
there was a single iteration which please this is then this it will it is
also s
is this the data will environment distinction room size or categorize in two different
in the remote's label
i will rule
we may be able
and see that actual
the position of the aec easily see that by the yellow cross
circle in the three or whatever position of the to go is illustrated by the
blue star
well i assess it is harder
maybe by the okay well we'll see change a distance yes for the microphone
it is also illustrated in the table environment definition there are three categories or at
least and
and unlabeled a short distance be making this that and see that at least
each physical space system to explain that in addition variability are according to the difference
between space
which can be seen as a wall ceiling and the for submission coefficients
as well as the position interval
the level overrated variation used busy fighting the or the is sixty two variation by
the by are
it's fifty whatever item of definition
they are the result is six is the u
a little i shall we menu and
see i recognition
it is this is the microphone and that okay or writing reading the visual speech
there was a shown are so well
we think that although there is an environment as
you can see that symbol on the right
the man and language for the that's a month it is also illustrated in this
paper
but something that is modeled by making and then recording over one of five as
this
and but are sending their according to be is the microphone
according are assumed to be made in one over the three zones used to people
each representing a different vowel the oldest the problem or
in the state in table are a definition if they are labeled character i shows
this task of the medium distance and
largest
in addition to the variation lately we release let us define the means for recording
and presentation devices
we can see that only the presentation
no speaker
encoding only and better living in the last speaker if there are four selected
we use the categorisation
and without any
but if there
that would be
i and it
currency one
this case we or they have online replaying configuration as you can see and the
table
on the right
the simulation once either two containers all the speakers
each with a different range of the whole by about we mean frequency and maybe
a linear calibration
the first
a typical vector category represent the mean dillydallying in full band lot speaker
i one last speaker and a megabyte bound we the icsi and units
and the being able to more linear or racial a study
and one hundred
addition
and if you're you can see an illustration of set of the higher money frequency
responses
for i don't be noise model
the little device estimated using desynchronized we design a linear system identification
based on a linear convolution
each one in the finger is the a linear component
while from age to if i
i the higher wouldn't nonlinear components
the blue where the shaded region represent the right boundary
is it is still real devices from which measurement where the again for simulation or
a clear presentation
the first table on the left indicates a multi device is why on the right
in the case of interest
device that will signifies which type of the magazines
what are some all but is a little speaker
right most column in the case
if the device were used for the simulation of dance in the training and development
sets were not devices
or evaluations and i don't devices
this figure shows again at least commission for the different laws speakers
device
used for this evaluation
the top plot shows a by means of the glottal sure the lower one of
the binary but we are the mean and frequency
the bottom plot the should ideally a linear calibration
in the range of the d
or by about
devices are sort the wheat the wideband
this figure shows baseline results for maybe a scenario of the is useful to two
thousand nineteen database
results are used to read and fourteen you important to be in configuration
one you acoustic environments
and for to monitor a standard on arrays here's something german equal error rate between
target and zero for impostor trials that is the blood spatter
and target and replaceable from the area they leave are
i mean don't wanna mixture on the stand-alone replace moving in terms of equal error
rate
for baseline a be one and b two
and the bottom panel there is a combine is the and cm results use created
in terms of the me
e it is yes
for this result we guess they the to the is anyone interview medium
as for the previous challenges expecting clear
and moreover the worst the screens are
two or swings high when the device scenes and a little darker to talk be
stuff
its own can now the challenge results this figure shows the profiles for the baseline
this system b
zero two
and the best the
performing primary system for the in the means you're fine
and the seen teams single system
it is also shown the second best performing the single system for a in the
for immorality
forty five
so the lowest equal error rate is zero point two
percent
that is a greater us out
however for this results it is clear that there is a substantial gaps between
primary and single system
a four
so this means that fusion is important
is line shows the one the mean the team this year and equal error rate
the results from one before you conditions
to the in the age scenario
the first screening feel boring the on the x-axis and then don't whether or not
the system are the nn based or three systems
while the second denotes whether or not the systems are instance systems
which combine more all
so systems
or single system
we cannot the for really there is a manager you all the n and beast
and the in symbol systems
in addition to is also clear that the new word error rate and mean this
are measurements that are not correlated
as you can see in these two are red and blue
in this like the it is shown all the results for the thirty nine hour
in the evaluation set for the top then brown many solutions
first of all we can see that the baseline is the equal error rate
that means no smoking
is two point five percent
when we need class i think moving at a the is this is then becomes
what inaudible
again if the individual tax someone else a degree is the performance
that are easy to detect
there reminding the against you
us some degree the easy performance
and i difficult to the data they want in the or ranch a physical
and one only one that is the a seventeen
as in this entire on the knees the but is very difficult to detect
that is the one in the utterance to scroll
so let's evolution no i the challenge results for but these figures show that provides
for the baseline system be zero one
the best performing primary system fourteen d u and the same teams of the systems
the lowest the equal error rate here used zero point four
the is indeed we results
was it to invade
here there is less a discrepancy between primary and single system
so fusion since that is not so we bought
this is my shoulder while the mean dcf decoder ring the results for one if
is shown that to the each and you
and anything point as before on the x-axis denote a unit based in the nn
three or and channel and
the known in some other systems
not of the to as for any
cole p the there is a manager he or and bees and the instance systems
it is like a this on or on the results for all the nine a
single evaluation set for the door then primary submission
and we can see that the baseline is the query
well seems keys needs solos moving is he going for example for stack
when we in class looking at a this is then
because
wouldn't it
so looking at these i
we can see that the performance is increases
where
the distance back to okay becomes greater
so there are very fancy one
and decreases when the quietly of the device we got better
so real routes suitable
it is nice on all of the silence now four or other than twenty seven
that environments and evaluation sets again for the a the parameter estimation
so looking at least and over individual environments we can see that the performance is
the graces where the room i recall
really
so the received go
in case is when they are very the given variational model because higher
c
and increase when the to go to easily distance becomes higher
getting
see what
so to summarise a system that doesn't like being focus on the
but eagerly and yes or voice conversion
a simple even if one would be evaluated
we have a show that to there is this is then the i wanna normal
to squatting task
we have defined and limiting the dcf was just moving on to measure performance on
a c d
so instead of a doing these the on the standard on one dimensional
we have seen a transition from features to classifiers so and unit order to into
and that
and one double the fused system with the biggest challenges
don't demand countermeasures are very
how to the speech sounds are
very natural
is the recognition accuracy very clear by detection again be proven to work this time
of by only and stage
generalization is in missing
much more as to be done
so i don't to the union a and for decision
the is this will two thousand
then t one
so but for finnish thing to do not i like to wish to some softer
each for speaker recognition grunting using from us at all
it appears to keep the from a is
and my results to identically to overcome my as well
currently silently to from the university
you can finally two databases for easy and the disposing
i thought winter the is additional database misleading
and nist and the are star burst in that the speaker recognition database
a right don't
and the text dependent speaker recognition database
we also the a e for it is simply a database from
and the speaker wire a new speech and boxers
so here you can find some of the for this thing
matlab implementation of training and the scope of this common conditions
this is used as you features
and the three these coding systems that an easy to a last challenge
you know website you can find the matlab client on implementation of the teens yes
and the in your with the regarding the is a you please easy the a
one website
we need you are cool
last time at least i like to shoot due to budget
where i'm the principal investigator region two d measurement recognition also not only speech
a disapproving and
closing phase information
classifiers and respect
thus nazis ultimately increase the number eighty three networks
and the domain instruments increment because representing volume i mean and he uttering networks
and the second respect
use a friend gentlemen project
and is completely means he or more secure and presenter's the remote embodiment person authentication
thank you for listening and see you the you at session