0:00:14she and effect a lot of my presentation is what i would missing with i-vectors
0:00:19a perceptron analysis of i-vector based falsely accepted trials and decide in collaboration with people
0:00:26from their phonetic lot of the c as i c would the research
0:00:34solution for many years at establishing the spain
0:00:37so plus not talking about
0:00:42i-vectors
0:00:46yes we will not i i-vectors but tones
0:00:50and
0:00:51those i-vectors give us a compact an elegant solution for every utterance can be represented
0:00:57in a fixed the dimension vector
0:01:01they also a given us a great an efficient performance of that a wide range
0:01:06of the original and a last two
0:01:09perform a state to apply state-of-the-art and but the recognition techniques
0:01:14and more the recently we are able to perform speaker recognition without point it is
0:01:20a really great
0:01:24we have we can avoid a lot of problems
0:01:28and especially and i think that in the point
0:01:32we don't produce calibrated likelihood ratios to forensic speaker recognition when we have lots of
0:01:38i think that in accumulating a for this we have seen a nice paper we
0:01:43wanted from that if i own do that's not that's
0:01:48in this paper some but what if you feel you just a and have
0:01:52wally score be in this paper has gone given a step farther when they have
0:01:56not only over a being able to calculate an icon regularly richer when they have
0:02:03i have recordings from the of these channel intercept assistant but they also have obtain
0:02:08the day i select aggregation for to collapse the all that was to do so
0:02:14they have an assessed not just a little bit about all the pros to sell
0:02:18this is a
0:02:20a great
0:02:22we as a starting point but we have to look a little more in detail
0:02:27about
0:02:28i-vectors
0:02:30and they explicitly courses lead to ignore a high-level and source little information
0:02:37so the speaker and information
0:02:39and is reduced to reach the short term but this is has a lot of
0:02:44advantages for features to for conditional for real points
0:02:49some users and imitate also so
0:02:52but be still a spectral only detection decisions
0:02:57probably will be uncorrelated with human perception this morning joe i'd like to this issue
0:03:04of a possible loss of credibility of the system if the it's a very user
0:03:09if i boardman ldc perceive rate disagreements between what the system is doing and what
0:03:17what they can see that they can see that humans are pristine
0:03:22moreover a we have almost that of that ignorance on the or you know those
0:03:27detection errors
0:03:29and when we have also you know system we are simply trying to restore system
0:03:33probabilistic but we don't do not fit the specific with them
0:03:37and we can have a transparent estimates is that which is very good but finally
0:03:42we if we have a roast we cannot display at all what's the recent of
0:03:47the
0:03:49of the art
0:03:49and it's very important to have to be able to provide explanations of all the
0:03:53wires system is working set a specific way
0:03:57and just a final reminder we as you decide systems usually on average error rate
0:04:04but from the user's perspective
0:04:06and they perceive performance like a baby case by case so it can be done
0:04:10larger or even a single trial the system will be affected as a
0:04:15as a whole
0:04:17so what we in that for the paper wants to select a set of i
0:04:24bet or based for the s if we try to problem
0:04:28sorry ten and it's a eight sre ten
0:04:32and we're gonna some a team of us find useful additions force the not english
0:04:38a great
0:04:40and
0:04:41the objective was to explore to better understand what they do with their with that
0:04:47a date down and all
0:04:50that it just a and sre that's
0:04:54as we have a and we might with that of data what target type of
0:05:00types of different that they think they could find and also the number of different
0:05:06types of different that they can have taken finding a single signal a trial
0:05:11and the first of all a display where this is not a paper on the
0:05:16speaker recognition by humans
0:05:17both one of these you know in advance that day speakers in every time a
0:05:22different
0:05:23so
0:05:23all what we are asking that is to highlight difference that they've find in the
0:05:29and between the two utterances but without any a decision then used fourteen yes to
0:05:36see what they can find a in a
0:05:39and
0:05:40in trials where the i-vector has provided a
0:05:44line ratio greater than one
0:05:48as they have a difficult time for analysis and we're not to select a subset
0:05:53of trials
0:05:55so we selected we will use the scores from our submission to nist two thousand
0:06:01and ten
0:06:02and what we did was a outlier proper selection
0:06:06first of all we to be a sixteen and a false acceptance that we actually
0:06:12had
0:06:12and with the it to eight
0:06:15with the eight is a set
0:06:17and but also as those trials were specifically selected to be a special difficult for
0:06:25humans just in case that was at peace stuff on it for that for the
0:06:30analysis we also selected fifty different forces us a second trials from the sre can
0:06:38and in that case of we had thousands of different
0:06:44trials with the condition was selected yes those with no likelihood ratios in the range
0:06:49from three to five with the translates into the results for all between the two
0:06:54one hundred and fifty also so to those were a big are for example systems
0:06:59that we usually
0:07:01and how when we use our i-vector systems with
0:07:05and eight now with the real by a lot of availability
0:07:10and after those we yes end and all sixty six trials and they are there
0:07:15are short rehearing not the about the mean this but trial they select it does
0:07:21for a little work and eighteen trials nine male and female for them probably it's
0:07:26a it's a and fourteen from a test everything
0:07:37this is the final this which is in the paper just i want the soda
0:07:41because we will and referred to every trial using the them
0:07:46the number of the target id
0:07:49ability of which one of the speakers
0:07:53second disclaimer i'm not of an addition that i even have problems with english roll
0:07:59okay i would be talking about but of things that my colleagues is therefore that
0:08:04takes a lot declared it so yes
0:08:06my apology that buttons if i have i say something not right
0:08:12and this is the rate of features that they will explore they will we be
0:08:17noted by really deformation type temporal characteristics what extent means that what the characteristics degree
0:08:23of the solid deep or something like than all the type of non-linguistic features or
0:08:28what robert was impressions of
0:08:31so that they will just
0:08:33what they will extend
0:08:36we don't like the selected trials is to perform that detail during the at both
0:08:41about one hour per one of the trials and we focus on the full feature
0:08:46which are presented all along the conversation
0:08:50i would still some samples
0:08:52but that is a
0:08:54the feature that the difference is that we are that they're finding out present along
0:08:58the whole conversation
0:09:02and those comparison will be maybe linguistically k compare compatible segment example select you think
0:09:08that set consisting of motown and finally some of the observation would be confidence through
0:09:15acoustically or estimate a and then
0:09:20by seasonal i used in mentioning that might expect so you don't seem a spectrogram
0:09:27so the last part of my presentation will be simply so and some of the
0:09:32use a file
0:09:34in every case i went so on a number of the trial with the where
0:09:43the audio can from and also the likelihood ratio in that do not value the
0:09:48degree of support that the ipod or used a given
0:09:52the same speaker hypothesis so we know in advance they are different
0:09:56this the i-vectors is that we say
0:09:59and then the same of these c same speaker and we will see it for
0:10:04every trial
0:10:05and the that the that fault
0:10:08degree of support of that are that can easily and english
0:10:12all possible this is a case without a very high misleading value on the three
0:10:19just and the operator what we use an obtain even for targets
0:10:25and in that case for example what they found is that this for speech a
0:10:31lot of the whole conversation is
0:10:33and not different
0:10:35no but we do you wanna go well
0:10:39the it's for the blue line
0:10:42for the right one
0:10:44i really but i four
0:10:48a sound like different by the that are over a regular or you are well
0:10:55i really i four
0:11:02and a set of features that they then used
0:11:05you just about the long as variability
0:11:08in the collective synthesis people usually tends to decrease the energy at the end up
0:11:13there is at least that's happened with the for speaker in that case
0:11:24our that the second speaker in that try out is
0:11:27keeping the same stress can do you and we'll especially for to keep that log
0:11:34in this
0:11:42and this is consequently repeated during the whole conversation
0:11:48in this case and which has which had a celebration of at a smaller value
0:11:53obviously value and there's a
0:11:57only dysphonic voice you once only one of the sides of the conversation is that
0:12:09they have no idea what are okay
0:12:15they have no idea what like are okay
0:12:22is that is for the one
0:12:24well there are no but neural network grammar
0:12:30well there are no but you'll never bigger
0:12:34for example you that are compared to the one light both phase right
0:12:39but
0:12:40and this is the spectral analysis of the of that powering latt uses a
0:12:46without hi everyone would ratio on you know we have
0:12:50much lower
0:12:54another type of and situation that would be found is the president of creaky voice
0:12:58for sample this is not very usual find in a speaker to the second one
0:13:03here we just peaks do all the conversation with really voice
0:13:08i
0:13:11i normal rate and this
0:13:15second one no you know
0:13:18no you know
0:13:22this is not very frequent but this thing present in this case and it's very
0:13:25quickly but what is quite usual is that the resulting solution of creaky voice at
0:13:32the end of the of the phrase would like your
0:13:35we will pop up a sample here in that case work like ratio measly like
0:13:42results about fifty
0:13:47two
0:13:51one segment well
0:13:54well
0:13:56well
0:13:58well
0:14:02we also found issues about sorry more boys system where you the voice difficult is
0:14:08to haul the bit it's a similar segment with and that type of speech you
0:14:15can see the
0:14:16tennessee of the mean value is quite similar however the second one we have
0:14:21you use the oscillation problems to maintain that
0:14:27i together i
0:14:32i get a very i
0:14:36second one
0:14:37we will be known
0:14:42no
0:14:51also a feature what's file was about the speech rate
0:14:56you for somebody in that case there are two different speaker which sold at different
0:15:02levels of a of activity
0:15:04what about how would be better marketing
0:15:08moreover
0:15:12it was bigger really
0:15:14we were able to leave
0:15:20this also issues all known hyperarticulation for example
0:15:24the phase
0:15:25really different see if you're selling you know
0:15:29one the other one i for like you know
0:15:33well
0:15:34almost basis some
0:15:36also this can be found in other cases with the
0:15:42without using any of a key and where the formant a three of on here
0:15:47it's much more the about more standard for speaker
0:15:52your
0:15:55second
0:15:56huh
0:15:58the form of a second formant is much lower than the
0:16:04signal for one for speaker
0:16:06also that there may be found differences well the specific but there's of realisation some
0:16:12first personable one pretty because the finding difference and a type of s that the
0:16:17speaker reviews
0:16:19for example in that case and the as in that speaker starts of the five
0:16:27hundred you're while the as in the second speaker this
0:16:32start above
0:16:34three thousand system or a standard student s
0:16:38i
0:16:45also cases where the problems or differences in the a degree of summarisation
0:16:53sample here i
0:16:56this is like that
0:16:57you don't want together
0:17:01and that of kind of nice of voice when and in this case the other
0:17:06one is i per thousand since we have a goal or something
0:17:15also that uses about impaired melodic voices
0:17:18so regular
0:17:23no we in
0:17:27what is the one i know you know
0:17:35in some cases the file extralinguistic ensures that for example the noisy reading everything to
0:17:40use that speaker
0:17:42you can hear
0:17:44that you are construction some parties
0:17:49for
0:17:58for example
0:18:01well as well
0:18:08so what while the second one that's it's already and noisy breathing at all
0:18:12they're also presents all squats or
0:18:16strong not control of the o
0:18:21g
0:18:26e
0:18:29or not the case of some of the presence of rectly voice
0:18:33e and o
0:18:42i go off all gonna
0:18:47so i'm finally this is they comparisons of the of a
0:18:51this work where and the idea is that if you look and you all some
0:18:55top weight you can find the amount of times that and one given feature is
0:19:02file
0:19:03and but its moral about the look trial by trials or columns of the table
0:19:09and us see that
0:19:11for every trial there are
0:19:13there's an average of about four different types of different that a file
0:19:19especially health interest to last if we want to make a diplomatic pursues to detect
0:19:25something some any kind of features are possible feature related to phonation type well phone
0:19:33creaky also and those the
0:19:37like to a specific but there are some presentation of the specific sound
0:19:42so do you might well
0:19:44yes a we have shown that percent all analyses initial null correlation with the that
0:19:51backdoor false acceptances
0:19:53and
0:19:54there is detectable a useful information goals trials that just produce away from poland uses
0:20:01what one bs recognition rate is
0:20:04furthermore there's like
0:20:06a relational
0:20:07and specifically the but the realisation that bit of a specific cells
0:20:12but also at would that those could provide an
0:20:17we try to reach no signals transcription of the whole utterances and they could be
0:20:22used to provide some kind of soft information or
0:20:26and
0:20:28this what specific highlight the inter some provide an objective measurements about this for you
0:20:34not the spectral features especially for speaker
0:20:38thank you
0:21:00just listening to
0:21:01second creaky wanna sell like was actually clipping happening in the first
0:21:06creaky voice
0:21:08solves one it
0:21:09perhaps the reason the system to see the same because audio clip like three
0:21:16was there any analysis on when you when people listening to these false like taking
0:21:22part of the audio acquisition and one that was quality as well
0:21:28there was no it's okay
0:21:30especially analysis of brain processing of the of the data we just select the data
0:21:34as it was and what is given to them and it what have phone from
0:21:40the phone at finding just what the what they what they did so
0:21:45how can you tell them so
0:21:57what's the variance from the sets consist of experts on
0:22:02to ten
0:22:05that's good
0:22:06there was a very high actually the second one was a student of the rate
0:22:12was just you from one to and
0:22:15maybe they provide for they come from the same school of listening
0:22:19and then the degree of agreement ones
0:22:24impressing we will be working completely separate
0:22:29i we have to say that there were no this is chosen there were no
0:22:32scoring sorry what does i found difference on i five difference on but the degree
0:22:37of
0:22:38but i can say that it was almost exactly the same maybe there was one
0:22:43of the differences and that one of the informant the of the on
0:22:53i was wondering
0:22:54since you only used
0:22:57non-target trials
0:22:59yes you have conducted the same experiment with the same from the tuition non-target trials
0:23:04how many of those differences they would also something especially the prosodic differences
0:23:12of course there will find a lot of then what's
0:23:14we are trying to do is to look for clues we rolled analysis nowhere to
0:23:20look for
0:23:21for a different of information and of course those prosodic and just prosodic information that
0:23:28prosodic information is very easily and modify a and b and you can depend a
0:23:34lot on the on the type of conversation
0:23:36that's why a i stress the idea of the issues of
0:23:41voice production and specific buttons of religious the which can be much more dependent upon
0:23:46the speaker but
0:23:48of course this part of the word that could be don't and of course they
0:23:52would because when
0:23:55i suppose like then participate in that kind of a humans just this evaluation they
0:24:01also did not
0:24:09yes as the result of this analysis the use it just the but kind of
0:24:14features that we used
0:24:17system the future
0:24:20which so you mention the prosody given duration what do you suggest
0:24:28that we look at for improving system
0:24:33i'm not suggesting anything special i just giving the information what they found but what
0:24:38i'm saying is that the for example the one noise
0:24:41those voice quality features around
0:24:43a specific but doesn't really say some of some a has a good degree of
0:24:49parameters that can be
0:24:50the properly detected
0:24:52let's see if they can improve the overall system