0:00:01and everybody
0:00:02i welcome you in my story on this thing in automatic speaker recognition
0:00:08i'm a similarity score on assistant professor at your local news data
0:00:12frames you're looks cool
0:00:14and there's some other regions
0:00:18at a low overall difference was moving detection rate of cognition we first you all
0:00:26speaker verification
0:00:28giving more attention to current research plan and progress
0:00:32in the middle and all this information for a speech systems
0:00:37but also we don't to the cost
0:00:43automatic speaker verification is one of the most convenient enough room means of but you
0:00:48might also recognition
0:00:51this is why this technology is values from your application services such a smart phones
0:00:56small speaker single sensors
0:01:00it's technology has about a lot over the last years based data that a is
0:01:06increasing the we need of by the premier network solution
0:01:09so just it's vector
0:01:11we to some extent is weaker than traditional gaussian mixture models
0:01:16or the so-called i-vectors
0:01:18and when the roaches are also emerging
0:01:22we guess at the speaker recognition technology s probably reach the level of performance required
0:01:29so or practical issue
0:01:33it wasn't no is whether or not the remaining system is one a normal to
0:01:37what we're gonna be the answer is yes
0:01:41the reality of voice biometric technology can be compromised by political status namely born and
0:01:48ability to the technology external
0:01:51one of the measures trees the security of biometric systems are spoofing attacks
0:01:57there is there are four
0:01:59the final severe okay stores carry out of whatever you matrix system into recognising and
0:02:04legitimate user is a general user order to avoid being recognised
0:02:10this is achieved by presenting to this is a synthetic for all the money we
0:02:16bash
0:02:18or the volume at least eight
0:02:20but before we locate is a are the second walk ons this system is processed
0:02:29there is this is then try to answer this question is that there's on what
0:02:34they say the are
0:02:37this means that the target that idea in this case studies as well as a
0:02:41non-target trial the t v
0:02:44can be a set the origin by speaker verification system
0:02:50this results in two different types of errors name false alarms and false rejection
0:02:56as shown in table
0:02:59only if this user used a and a change dataset or that this user is
0:03:06an bolster the challenge i
0:03:09there is a v
0:03:10system based
0:03:12according to their change
0:03:14here target speaks when they are now available is whining boxers makes no f or
0:03:20when there's anything about
0:03:25so
0:03:26given a test right it is we provide some score behind the score integrator the
0:03:32confidence that the speaker voices
0:03:36a better discrimination you see green order to increase in body then between target trials
0:03:41and non-target trial scores by selecting a threshold between the leash motion looks coarse
0:03:47however as trying to figure that in the non-target score distribution
0:03:53usually overlap region
0:03:55this is can you being the detection error tradeoff at school
0:04:00on the right well the point where the false alarm rate is in well to
0:04:05the force the
0:04:06a certain three is cool enquiry
0:04:09is this really realistic
0:04:11though the impostor may can you have for performing system
0:04:15or they can implement it is if you is my task
0:04:20so they aim at all that is to provoke false alarms by increasing easily classifier
0:04:26scores target while i'm going detection
0:04:30we can distinguish costly to get in bolster from an eye impostor
0:04:34there are there are also going to zero for impostors
0:04:41the processing to create fake speech signal you know it down for let's see that
0:04:47the challenge here is to find a solution to that there are many valuable and
0:04:52involving this process and there are still menu question to ask
0:04:58do their car from linear earlier processing due only receive you know part of the
0:05:04spectrum should be able to look also and the phase signal
0:05:09but something this question later when we have more element goods you are
0:05:16there are many a general approaches for the measures improving the easily robustness for example
0:05:23by speech or the u r c d this is an invasion that action
0:05:28or winded executive countermeasures for example based on that for sure
0:05:33this is and its energy detection
0:05:37in this legal issue on an example that plot you stating baseline performance is when
0:05:43they posters are non-zero for impostors
0:05:47baseline black line
0:05:49the performance degradation when data getting both
0:05:53by the system
0:05:55so is this also that the red line
0:05:57and improvement of the performance is where they can to measure the client
0:06:03this is the one dimensional fashion
0:06:05rule i
0:06:06and know that on a meeting with perfect countermeasures those this is the best performance
0:06:12reach its baseline performance
0:06:18nobody six including voice volume it is becoming an instance
0:06:22many speaker pointed out there is usually issues
0:06:26can think speech
0:06:29decision can undermine confidence in easy and it is important you regional level of control
0:06:36measure of presentation that detection to reduce false acceptances
0:06:42to spoofing attacks
0:06:46does that this additional tasks can be originated from more efficient synthesis
0:06:51or voice
0:06:52in unlogical system old or just we recording related approach you know basic process
0:06:58well
0:07:00where we enjoy directly the audio stream in the easy my
0:07:06these four percent the measured rates
0:07:09and a time or a is impersonation which ones used in dating a human voice
0:07:17also the tree to but this condition is not only inter school and twenty minutes
0:07:23studies
0:07:24involving small datasets
0:07:26it is not surprising a
0:07:28that
0:07:29there is no previous work misleading countermeasures maybe impersonation
0:07:37a possible location of that point the in time typical icily system maybe before or
0:07:45after the microphone as illustrated in three
0:07:48corresponding to physical access and logical
0:07:53is he is more or something then older biometric system based on different biometric is
0:07:59just conceded that symbols of a human persons goal is can be collected the really
0:08:04bystanders to face to face or telephone conversation
0:08:09and then blame in order to my twenty a day is just
0:08:14or more advanced voice conversion or speech synthesis algorithms
0:08:19in used to generate particular
0:08:22if it is looking at that
0:08:24using only modest amounts of voiced the calculate the for a person
0:08:32this table summarize the for splitting and that's in terms of us a single decreases
0:08:37and in we will consider measures
0:08:40except for the impersonation at time so that have a menu model i s is
0:08:45unity
0:08:47and i freeze
0:08:48especially for text event is the scenario and the error of intermediate of dimension
0:08:55that's the use of for scroll
0:08:58generalization it is the meeting to the different
0:09:02or unseen i
0:09:07so this is the timeline which the task
0:09:10two days visible units you
0:09:13and is studies on speaker and feasible thing where and are on me now speech
0:09:19for were created using a limited number or something
0:09:23in see it is clear that the development of can to measure using only a
0:09:29small number was looking at task
0:09:31no you generalization to be
0:09:35moreover
0:09:36there was a lack of a galaxy we will corpora and evaluation bottle but not
0:09:42for the to the results of being by different researchers
0:09:48daisy of this study aims to establish a key during the initial you by making
0:09:56of evil standard speech corpora
0:09:58we have a large amount of signal that's
0:10:01evaluation protocols and matrix
0:10:04to some or a common evaluation and the benchmarking different systems
0:10:10is feasible challenge is as being organised in time so far
0:10:16the first was having to sausage in
0:10:18the second two thousand and thirteen two thousand
0:10:23it were presented and the corresponding special session loading the interspeech conference
0:10:32is actually current own analyses of this visible for you as well as the their
0:10:39finish definition to partition your see the company around the work
0:10:47but the first thing is challenge involve detection of the division speech
0:10:51the data using a mixture of voice conversion to speech synthesis techniques
0:10:57it was or something during basically to a special session it english speech of those
0:11:02in
0:11:03and the sixteen organisation have debated the this challenge
0:11:08there is useful for those of fifteen involve only logical a system that that's and
0:11:16the a as it was generated we ten different of diffusion speech generation algorithms
0:11:23well based on a large collections accordingly scolding this of course
0:11:29version well
0:11:31and consist of but not without and t v show that a speech
0:11:37one of each was recorded using i one thing microphone
0:11:41and we don't seem difficult channel or of background noise effects
0:11:48and if one database was divided into two subsets coolant
0:11:53the training level of an evaluation set in a speaker and he's joined mar
0:11:58finally i s from the s one was i ni is known
0:12:05where used
0:12:07in the training and development and evaluation set
0:12:11and the one to five times from six s c and it is then going
0:12:17a known or and seen that
0:12:20where are used on the in the evaluation set along we know that that's
0:12:27based on the dimension and of the bias the or on what it used for
0:12:33voice conditions speech synthesis
0:12:36nine of them are we'll database and the hmm of gmm based addition model
0:12:43while only one the s and is the unit selection based
0:12:46speech synthesis implement we that one source madly
0:12:50text-to-speech system
0:12:56the banana but all of easy system based the on the i-vector but the is
0:13:02pretty clear
0:13:05except for the i guess who
0:13:08well that that's are very effective with importantly reasoning
0:13:13greece all equal error rate
0:13:16in the worst case
0:13:17that is s then
0:13:20i don't to one
0:13:21directly to fifty one will ones
0:13:24it is seventeen
0:13:28so that it will the on the left show here the challenge results
0:13:33the in terms of the average equal error rate across all their a score the
0:13:39evaluation set
0:13:40for no one and i do not
0:13:44the exactly a lack of a generalization these results
0:13:48over the table on the left to sure that
0:13:55i'm sorry believable the double on the on the right initials the that the top
0:14:00performing system evaluated only
0:14:04on the s ten
0:14:07the unit selection based speech synthesis
0:14:11isn't that isn't most if you without
0:14:13then the and the most dangerous for speaker verification system is i are shown previously
0:14:20so as then i used to efficiently the biggest three for the msd system in
0:14:27this case
0:14:31and used in one is on the
0:14:33the front end of a against the door for a performing system
0:14:39on the challenge
0:14:40it will not the to read for the in this challenge is related to the
0:14:45two features
0:14:47and the level of the low end of the front and
0:14:51other people between if the in the v a dynasty the use cochlear filter a
0:14:58cepstral coefficients
0:14:59that are related to the human auditory system
0:15:02possible these something that john it problem
0:15:10so no less and i don't know are most the challenge evaluation on the is
0:15:16v is of two thousand fifteen
0:15:19we propose a new feature domain constantly coefficients
0:15:23this on the constant you possible which is a an alternative to put it costs
0:15:28and which employ a variable time-frequency resolution that means
0:15:34greater time resolution for and frequency
0:15:37and you the frequency resolution for lower frequencies
0:15:42so that wasn't you the first one vicinity of an idea which are different more
0:15:46closely the human perception
0:15:49and the to obtain a c uses you features we combine a cuda increase of
0:15:54the initial k would have also with the prediction cepstral analysis
0:16:02i should be for that the only thing started in the challenge
0:16:07where only able to the test then i probably
0:16:12so is it is easy as a
0:16:15obtain completely can be you results for knowing the task and the best results for
0:16:21i do not a week and eighty seven relative improvement on stand
0:16:26and overall seventy two ground control
0:16:34so to summarize basis for fifteen focused on the i don't voice conversion and speech
0:16:40since is a task so not ugly
0:16:44easily disapprovingly detection so no at
0:16:48that's the band the scenario
0:16:51the participant in their invested for to develop features using most simple classifiers
0:16:59and the fourth line regionalisation used in the missing
0:17:04any of
0:17:06i think meet again we the some possible mission improvements
0:17:18i like it doesn't fifteen addition to that used very high quality speech material it'll
0:17:23seventeen addition aims to assess the we have a detection
0:17:27we call in the white
0:17:29condition
0:17:31in focus exclusively on earlier works
0:17:34a second of them i think speaker verification code dimension challenge was presented including this
0:17:41is a special session
0:17:42adding the speech those of indian
0:17:45and fourteen now consider shows a distributed of the challenge
0:17:52cost function if this were from the riesz a text
0:17:58that adults
0:17:58course
0:18:00was proposed was to collect speech lead to over mobile devices
0:18:05in the form of smart phones or a black computers
0:18:10a bible tears of from across to low
0:18:14we collect the a's this will does seven in the database using a playback device
0:18:20and a recording device different acoustic environment
0:18:27we did not to use a realistic scenario using core the recording but we made
0:18:34actually got
0:18:35and do the you don't call me all the target speakers voice
0:18:40to create the plane data collection
0:18:44this is the worst case scenario that of those the use of x sixteen speech
0:18:50were to be linear access
0:18:56the colour curve was is divided into three subsets for training development and evaluation
0:19:05we different speakers replay section and ugly configuration
0:19:11in training and development subset were collected in three different sites
0:19:16and evaluation subset was collected at the same a three sides and also the data
0:19:23for a new side
0:19:27this is the loudest most the inverse italy that
0:19:34in terms of a basically a wider meeting t s for the challenge also here
0:19:41is a clear
0:19:44the this is m is based on the a gmm
0:19:48and the really that's a big effect you
0:19:52with an important case of the equal error rate
0:19:55for all
0:19:55one point eight fifty one point five
0:19:59on these evaluation set
0:20:04the primary evaluation is only whether they can rest of this additional two thousand fifty
0:20:10challenge
0:20:12the equal error rate is computed from scores all across all training segments rather than
0:20:17condition averaging
0:20:20why fourteen estimation
0:20:22perform the baseline while existing three and their the
0:20:28at a performance is the old in more than seven percent relative improvement we used
0:20:33a dismissal a
0:20:35baseline system is based on gmm of a classifier we can you cepstral coefficient features
0:20:42it was provided to the data
0:20:45comparing the baseline mean zero one thing to do
0:20:49it is important performance improvement when using wondering plus their the three
0:20:57this is this idea of the parameter submission to residuals
0:21:02it doesn't seventy
0:21:04i don't training refer to the bar all the time for training
0:21:09a sense for three and a reasonable
0:21:14most all the systems a lower bound for the features
0:21:19this call mom for all the systems to build a gmm classifier
0:21:24single cost you as you can see
0:21:27the invariant use whatever means of all around solution is twenty five one ninety one
0:21:33understand
0:21:34where s the best single system result show
0:21:39and average detection whatever in
0:21:41or
0:21:42only six point seven percent
0:21:47this is a test tools for looters challenge show that
0:21:52the channel of a layer that is more difficult then detection speech synthesis and with
0:21:58compression
0:22:01for me a dimension generalization also remains a problem
0:22:07after the challenge that were that the anomalies
0:22:10ieee beyond zero samples present a beginning on managing speech uterrances
0:22:17is zero really running by for the easy to be a
0:22:23but maybe but i for a modified versions for speech detection
0:22:29these issues it is so for version two point zero was released to colour be
0:22:35anomalous
0:22:37i detected of course the evolution
0:22:39in addition the metadata which describes the recording and playback devices and that was the
0:22:45environments where once released along we and you are not the baseline
0:22:51the new metadata along with the data by ching as there is the number uterrances
0:22:58as well as the a population or the evaluation set
0:23:02remember when i'm better than for each other
0:23:07for a better understanding of the outcomes we can rewrite the square the regulation terms
0:23:13of the speaker measurement recording playback devices
0:23:17acoustic environment is a physical spacing which original stage the that basically then here or
0:23:25it is reasonable because seventeen database was collected you have a different environment
0:23:32the evaluation meeting there about the accent level over even more controlled noise
0:23:38the
0:23:39for example can be in we model noise and balcony are assumed to be noisy
0:23:46all these
0:23:46all right are assumed to be maybe which in your oracle room huh
0:23:53are assumed to be are actually
0:23:58there are under the of a twenty six a little better prices
0:24:02a smart phones the lower bound we
0:24:07if we the we fifteen this moral speakers
0:24:11are assumed to be all over the
0:24:14well e
0:24:15a little larger lot of speakers are assumed to be your mean you rightly
0:24:20and the professional or do we managed are assumed to be i
0:24:27assuming only there are a total twenty five recording devices
0:24:32some are ones that are the weights for my from source would be a little
0:24:36windy and it's where a microphone are assumed to be over the medium by i
0:24:43and the again the regression your and b i
0:24:50this figure shows the impact of different illegally configuration of one lazy performance measure in
0:24:56terms of equal error rate
0:24:58we have sent over a zero for impostor trials are replaced with a replaceable by
0:25:04iteratively the each other little degradation
0:25:09the control the demo on the right shows the resulting legal regulations sort of according
0:25:15to the easy equal error rate in the
0:25:19all pole a core also reflect the supposed to be a is the
0:25:25where we are in this a little degradation
0:25:29this is done
0:25:30they higher than one at a very little degradation the motive for effect in a
0:25:35the three years
0:25:39it is this detection performance of a gmm robot
0:25:44and i-vectors read about smoking the dimension
0:25:48for this thing that a little degradation
0:25:52also expressing that all the equal error rate
0:25:56the first edition these results is that the recently the correlation between the specifically to
0:26:02the thing
0:26:03detection or everybody detection or
0:26:08this is a fine reflect the final complex of overwhelmingly device
0:26:15there was to get about a man and the recording right
0:26:19the control on the right a to see the results in terms of the all
0:26:24only a in a environment going back and replay value
0:26:32results show the number of a single element of the little degradation for all i
0:26:39trials this was all we trials corresponding with either one of the
0:26:46i in my all their acoustic environment a system we need the effect of the
0:26:51playback and recording device
0:26:57to summarise it is able to go seventeen false own regalia
0:27:02so not at a slow was commission
0:27:05performances are reminding
0:27:07even for the worst case scenarios
0:27:10analysis is a very difficult since the data collection was the whole roll
0:27:17remote control data collection mean thing to ensure a which is one recognition or the
0:27:24that is useful to doesn't matter the in
0:27:27so again is related to smoking detection so nicely where
0:27:32text independent scenario will use
0:27:35a there is no gave a database that for a little features and classifiers
0:27:41it generalisation is even missing giving me a
0:27:45it's been mitigated i mean green post evaluation improvement
0:27:53so let's go to the to provide a speaker verification additional information challenge
0:27:58a straightforward on boats
0:28:00speech synthesis and the really
0:28:09as for the because efficient it was examined everything is feasible for special session in
0:28:16their speech goes on a in
0:28:17and forty and fifty organisation there are basically the of the challenge order to standards
0:28:26it is useful because i'm in the in a database is this i would've liked
0:28:30to different use case scenarios
0:28:32well you got and this guy was the score
0:28:35also different a is this strategy of assessing still thing to measure performance on a
0:28:42state
0:28:42instead of the test
0:28:44stand-alone compare measure
0:28:46for this reason for if there is alright we have provided the
0:28:52is this
0:28:52score of the participant
0:28:55so we have got the a s primary method of the minimum normalized the actual
0:29:00cost
0:29:01in this
0:29:02and this is a very maybe at whatever rate
0:29:06also for most discrimination
0:29:10use of the a dcf means that the these this design database is this i'm
0:29:17not for the standard on this task will commercial
0:29:21but they are on the availability in is very system where subject to scooping up
0:29:34necessarily now to use in a normalized dcf so inspired by the detection cost function
0:29:41the
0:29:42c f
0:29:43used in these the sre challenge is
0:29:47i in a this it is
0:29:51aims to assess is the this is the last to make sure
0:29:55to all formalize assessment
0:29:59so long format or by rate
0:30:02or you really motivation for a four
0:30:09okay and the a whole basically
0:30:14countermeasures system
0:30:17there are a total of four possible error
0:30:20where
0:30:21quantify
0:30:23target uses a by the company measures is that
0:30:27i wanna five target is rejected by easy this is the
0:30:31i don't target trials are so that
0:30:34and cost of the idea is
0:30:40the for possible errors in be formally describe so it is for the costs and
0:30:46priors are this i mean that one
0:30:49and the classification tree
0:30:51it
0:30:52are computed be taken
0:30:55the roadie dcf a venue a can be difficult to either us or forming the
0:31:02formation of the well in the nist speaker recognition issue
0:31:08it is useful to normalize the cost
0:31:11the normalized that it is it's a function of a the measured pressure
0:31:18a similar to the bus the challenge efficient
0:31:22is useful for those online dating does not goals of pressure of the set in
0:31:27that means that the calibration
0:31:30so we think source in this case the traditional or mutually the standard measure to
0:31:34install involve a corresponding to go for calibration
0:31:39that correspond to the remaining on remote i
0:31:43in this
0:31:44in by fitting the all my racial the to mine
0:31:48for from the evaluation set using the
0:31:56so this is able to those on a the database is visible the for score
0:32:01one dorky be seen again corpus
0:32:04okay speaker english speech database a or in the a union going
0:32:10charmer still clearly all these things
0:32:15either
0:32:16before weights
0:32:19so it was a the using this is from whatever the seven speakers
0:32:25forty six main thing see more humane
0:32:27but they are the ensemble to a sixteen khz the sixteen bits per sample
0:32:36a collection of course uses colour that these in baseball problem in this analysis
0:32:44it is divided in three
0:32:46for training development evaluation in a speaker is john manner
0:32:52for the logical is there are six
0:32:55text-to-speech and voice conversion box
0:32:58for training and there's fifteen
0:33:00yes and b c score evaluations that
0:33:05what the physical analysis
0:33:06there are then these a holes the
0:33:09environment
0:33:10and i sleepily calculation of training
0:33:13they're an imbalanced
0:33:17we yes
0:33:18the two is then of the double doors to provide state-of-the-art yes this is this
0:33:24if you show a lot of assigning all over the course
0:33:31this table summarize this system which are fundamentally you go first
0:33:36the known
0:33:37small things is the for a zero one at zero six
0:33:41in the lab
0:33:42two v c and four yes systems
0:33:46then
0:33:46well at zero seven to eighty nine d r for a sixteen and even being
0:33:55are the eleven and or something a systems
0:33:59and a sixteen at the eighteen nineteen i don't the reference
0:34:04systems using the same algorithms
0:34:07s
0:34:07at zero four and at zero six
0:34:11the l a verification is the lattice
0:34:14most of our database for speech synthesis and was version is moving the results
0:34:23this is this ensemble of problem a the weather
0:34:29two
0:34:31so
0:34:37we did not complete with any of the local form
0:34:41what if i
0:34:42no
0:34:43the a
0:34:47we did not completely of any of the local phone
0:34:51is you know there speaker one of i
0:34:55employees are entitled to follow that contract to the latter
0:34:59a data
0:35:02employees are entitled followed by a contract so the latter
0:35:06another speaker who finished
0:35:09at that time it's telling faction like and five miles
0:35:13a
0:35:15i at time m is now and faction within five miles
0:35:20as you can see that one of your the synthesis of a speech is quite
0:35:24impressive
0:35:30this is the size of a
0:35:33a subset evaluations and session
0:35:36results in terms of a it is for a little baseline we are provided
0:35:44first of all shows the results for two categories of the us to the speech
0:35:51yes we see
0:35:53yes and v c you might
0:35:56and i saw show results for types of models
0:36:01there are neural network based
0:36:03i one
0:36:05a neural network based and where
0:36:08yes
0:36:09neural network based itsy a statistical model based p c
0:36:14last rule
0:36:16shows the results from different with for generation that the
0:36:22in that are
0:36:24their own where for model classical speech moreover
0:36:28with four combinations
0:36:29spectral filtering with typically and orders
0:36:34in the testing is the complementary you of your over the baseline
0:36:39otherwise dishonest users you features and the idiot there is a someone else
0:36:45sdc features
0:36:50it doesn't say challenge data was created from the rio your presentation visual quality of
0:36:56the score was somewhat cold or
0:37:00leading to improve upon the last challenge it doesn't line in addition to this once
0:37:05you weighted and all
0:37:07acoustic and global calibration
0:37:10once we use these two similarly enrollment listings and devices we establish right
0:37:19the remainder of this work are similarly directly on that
0:37:24we choose a the one sure on the slide
0:37:28realistic environment winkler only holding the noise putting aside for now the additive noise
0:37:35we really a decision we consider perfect microphones
0:37:39and
0:37:41only at the recording this meeting about a five user
0:37:47and for variability representation
0:37:51we can see the that there are
0:37:53it's carry out that the single session as that used a
0:37:57and will only of the device quite in this case the last speaker
0:38:07the physical access scenario assumes use in it is the leading to convey such as
0:38:13illustrated in fig
0:38:16there was a single iteration which please this is then this it will it is
0:38:20also s
0:38:22is this the data will environment distinction room size or categorize in two different
0:38:28in the remote's label
0:38:30i will rule
0:38:32we may be able
0:38:33and see that actual
0:38:36the position of the aec easily see that by the yellow cross
0:38:41circle in the three or whatever position of the to go is illustrated by the
0:38:46blue star
0:38:48well i assess it is harder
0:38:51maybe by the okay well we'll see change a distance yes for the microphone
0:39:00it is also illustrated in the table environment definition there are three categories or at
0:39:06least and
0:39:07and unlabeled a short distance be making this that and see that at least
0:39:15each physical space system to explain that in addition variability are according to the difference
0:39:20between space
0:39:22which can be seen as a wall ceiling and the for submission coefficients
0:39:28as well as the position interval
0:39:31the level overrated variation used busy fighting the or the is sixty two variation by
0:39:37the by are
0:39:40it's fifty whatever item of definition
0:39:42they are the result is six is the u
0:39:46a little i shall we menu and
0:39:49see i recognition
0:39:52it is this is the microphone and that okay or writing reading the visual speech
0:39:58there was a shown are so well
0:40:02we think that although there is an environment as
0:40:06you can see that symbol on the right
0:40:12the man and language for the that's a month it is also illustrated in this
0:40:17paper
0:40:19but something that is modeled by making and then recording over one of five as
0:40:25this
0:40:26and but are sending their according to be is the microphone
0:40:31according are assumed to be made in one over the three zones used to people
0:40:38each representing a different vowel the oldest the problem or
0:40:45in the state in table are a definition if they are labeled character i shows
0:40:52this task of the medium distance and
0:40:56largest
0:40:58in addition to the variation lately we release let us define the means for recording
0:41:04and presentation devices
0:41:08we can see that only the presentation
0:41:11no speaker
0:41:12encoding only and better living in the last speaker if there are four selected
0:41:19we use the categorisation
0:41:21and without any
0:41:24but if there
0:41:25that would be
0:41:26i and it
0:41:27currency one
0:41:30this case we or they have online replaying configuration as you can see and the
0:41:36table
0:41:37on the right
0:41:40the simulation once either two containers all the speakers
0:41:44each with a different range of the whole by about we mean frequency and maybe
0:41:49a linear calibration
0:41:52the first
0:41:53a typical vector category represent the mean dillydallying in full band lot speaker
0:42:00i one last speaker and a megabyte bound we the icsi and units
0:42:07and the being able to more linear or racial a study
0:42:12and one hundred
0:42:14addition
0:42:15and if you're you can see an illustration of set of the higher money frequency
0:42:21responses
0:42:23for i don't be noise model
0:42:25the little device estimated using desynchronized we design a linear system identification
0:42:33based on a linear convolution
0:42:36each one in the finger is the a linear component
0:42:40while from age to if i
0:42:43i the higher wouldn't nonlinear components
0:42:48the blue where the shaded region represent the right boundary
0:42:57is it is still real devices from which measurement where the again for simulation or
0:43:03a clear presentation
0:43:05the first table on the left indicates a multi device is why on the right
0:43:10in the case of interest
0:43:13device that will signifies which type of the magazines
0:43:17what are some all but is a little speaker
0:43:21right most column in the case
0:43:24if the device were used for the simulation of dance in the training and development
0:43:30sets were not devices
0:43:32or evaluations and i don't devices
0:43:39this figure shows again at least commission for the different laws speakers
0:43:45device
0:43:46used for this evaluation
0:43:49the top plot shows a by means of the glottal sure the lower one of
0:43:54the binary but we are the mean and frequency
0:43:58the bottom plot the should ideally a linear calibration
0:44:02in the range of the d
0:44:04or by about
0:44:07devices are sort the wheat the wideband
0:44:15this figure shows baseline results for maybe a scenario of the is useful to two
0:44:21thousand nineteen database
0:44:23results are used to read and fourteen you important to be in configuration
0:44:27one you acoustic environments
0:44:29and for to monitor a standard on arrays here's something german equal error rate between
0:44:35target and zero for impostor trials that is the blood spatter
0:44:40and target and replaceable from the area they leave are
0:44:44i mean don't wanna mixture on the stand-alone replace moving in terms of equal error
0:44:49rate
0:44:51for baseline a be one and b two
0:44:55and the bottom panel there is a combine is the and cm results use created
0:45:00in terms of the me
0:45:03e it is yes
0:45:05for this result we guess they the to the is anyone interview medium
0:45:10as for the previous challenges expecting clear
0:45:14and moreover the worst the screens are
0:45:18two or swings high when the device scenes and a little darker to talk be
0:45:23stuff
0:45:29its own can now the challenge results this figure shows the profiles for the baseline
0:45:36this system b
0:45:37zero two
0:45:39and the best the
0:45:41performing primary system for the in the means you're fine
0:45:46and the seen teams single system
0:45:49it is also shown the second best performing the single system for a in the
0:45:56for immorality
0:45:58forty five
0:45:59so the lowest equal error rate is zero point two
0:46:03percent
0:46:05that is a greater us out
0:46:08however for this results it is clear that there is a substantial gaps between
0:46:14primary and single system
0:46:17a four
0:46:19so this means that fusion is important
0:46:25is line shows the one the mean the team this year and equal error rate
0:46:30the results from one before you conditions
0:46:33to the in the age scenario
0:46:36the first screening feel boring the on the x-axis and then don't whether or not
0:46:41the system are the nn based or three systems
0:46:45while the second denotes whether or not the systems are instance systems
0:46:50which combine more all
0:46:52so systems
0:46:53or single system
0:46:56we cannot the for really there is a manager you all the n and beast
0:47:01and the in symbol systems
0:47:04in addition to is also clear that the new word error rate and mean this
0:47:07are measurements that are not correlated
0:47:12as you can see in these two are red and blue
0:47:19in this like the it is shown all the results for the thirty nine hour
0:47:24in the evaluation set for the top then brown many solutions
0:47:28first of all we can see that the baseline is the equal error rate
0:47:33that means no smoking
0:47:35is two point five percent
0:47:37when we need class i think moving at a the is this is then becomes
0:47:43what inaudible
0:47:45again if the individual tax someone else a degree is the performance
0:47:52that are easy to detect
0:47:54there reminding the against you
0:47:57us some degree the easy performance
0:48:03and i difficult to the data they want in the or ranch a physical
0:48:08and one only one that is the a seventeen
0:48:12as in this entire on the knees the but is very difficult to detect
0:48:16that is the one in the utterance to scroll
0:48:22so let's evolution no i the challenge results for but these figures show that provides
0:48:29for the baseline system be zero one
0:48:33the best performing primary system fourteen d u and the same teams of the systems
0:48:41the lowest the equal error rate here used zero point four
0:48:45the is indeed we results
0:48:50was it to invade
0:48:52here there is less a discrepancy between primary and single system
0:48:57so fusion since that is not so we bought
0:49:03this is my shoulder while the mean dcf decoder ring the results for one if
0:49:08is shown that to the each and you
0:49:12and anything point as before on the x-axis denote a unit based in the nn
0:49:18three or and channel and
0:49:20the known in some other systems
0:49:23not of the to as for any
0:49:26cole p the there is a manager he or and bees and the instance systems
0:49:37it is like a this on or on the results for all the nine a
0:49:42single evaluation set for the door then primary submission
0:49:47and we can see that the baseline is the query
0:49:51well seems keys needs solos moving is he going for example for stack
0:49:57when we in class looking at a this is then
0:50:01because
0:50:01wouldn't it
0:50:03so looking at these i
0:50:06we can see that the performance is increases
0:50:10where
0:50:12the distance back to okay becomes greater
0:50:15so there are very fancy one
0:50:18and decreases when the quietly of the device we got better
0:50:22so real routes suitable
0:50:29it is nice on all of the silence now four or other than twenty seven
0:50:34that environments and evaluation sets again for the a the parameter estimation
0:50:41so looking at least and over individual environments we can see that the performance is
0:50:46the graces where the room i recall
0:50:50really
0:50:51so the received go
0:50:53in case is when they are very the given variational model because higher
0:50:58c
0:50:58and increase when the to go to easily distance becomes higher
0:51:04getting
0:51:05see what
0:51:09so to summarise a system that doesn't like being focus on the
0:51:14but eagerly and yes or voice conversion
0:51:20a simple even if one would be evaluated
0:51:23we have a show that to there is this is then the i wanna normal
0:51:29to squatting task
0:51:32we have defined and limiting the dcf was just moving on to measure performance on
0:51:38a c d
0:51:39so instead of a doing these the on the standard on one dimensional
0:51:45we have seen a transition from features to classifiers so and unit order to into
0:51:53and that
0:51:54and one double the fused system with the biggest challenges
0:52:00don't demand countermeasures are very
0:52:03how to the speech sounds are
0:52:08very natural
0:52:10is the recognition accuracy very clear by detection again be proven to work this time
0:52:16of by only and stage
0:52:20generalization is in missing
0:52:22much more as to be done
0:52:26so i don't to the union a and for decision
0:52:30the is this will two thousand
0:52:32then t one
0:52:38so but for finnish thing to do not i like to wish to some softer
0:52:42each for speaker recognition grunting using from us at all
0:52:50it appears to keep the from a is
0:52:54and my results to identically to overcome my as well
0:52:58currently silently to from the university
0:53:03you can finally two databases for easy and the disposing
0:53:10i thought winter the is additional database misleading
0:53:13and nist and the are star burst in that the speaker recognition database
0:53:23a right don't
0:53:24and the text dependent speaker recognition database
0:53:29we also the a e for it is simply a database from
0:53:36and the speaker wire a new speech and boxers
0:53:45so here you can find some of the for this thing
0:53:49matlab implementation of training and the scope of this common conditions
0:53:54this is used as you features
0:53:56and the three these coding systems that an easy to a last challenge
0:54:07you know website you can find the matlab client on implementation of the teens yes
0:54:15and the in your with the regarding the is a you please easy the a
0:54:22one website
0:54:23we need you are cool
0:54:27last time at least i like to shoot due to budget
0:54:31where i'm the principal investigator region two d measurement recognition also not only speech
0:54:39a disapproving and
0:54:40closing phase information
0:54:43classifiers and respect
0:54:45thus nazis ultimately increase the number eighty three networks
0:54:49and the domain instruments increment because representing volume i mean and he uttering networks
0:54:56and the second respect
0:54:58use a friend gentlemen project
0:55:01and is completely means he or more secure and presenter's the remote embodiment person authentication
0:55:11thank you for listening and see you the you at session