0:00:14 | she and effect a lot of my presentation is what i would missing with i-vectors |
---|
0:00:19 | a perceptron analysis of i-vector based falsely accepted trials and decide in collaboration with people |
---|
0:00:26 | from their phonetic lot of the c as i c would the research |
---|
0:00:34 | solution for many years at establishing the spain |
---|
0:00:37 | so plus not talking about |
---|
0:00:42 | i-vectors |
---|
0:00:46 | yes we will not i i-vectors but tones |
---|
0:00:50 | and |
---|
0:00:51 | those i-vectors give us a compact an elegant solution for every utterance can be represented |
---|
0:00:57 | in a fixed the dimension vector |
---|
0:01:01 | they also a given us a great an efficient performance of that a wide range |
---|
0:01:06 | of the original and a last two |
---|
0:01:09 | perform a state to apply state-of-the-art and but the recognition techniques |
---|
0:01:14 | and more the recently we are able to perform speaker recognition without point it is |
---|
0:01:20 | a really great |
---|
0:01:24 | we have we can avoid a lot of problems |
---|
0:01:28 | and especially and i think that in the point |
---|
0:01:32 | we don't produce calibrated likelihood ratios to forensic speaker recognition when we have lots of |
---|
0:01:38 | i think that in accumulating a for this we have seen a nice paper we |
---|
0:01:43 | wanted from that if i own do that's not that's |
---|
0:01:48 | in this paper some but what if you feel you just a and have |
---|
0:01:52 | wally score be in this paper has gone given a step farther when they have |
---|
0:01:56 | not only over a being able to calculate an icon regularly richer when they have |
---|
0:02:03 | i have recordings from the of these channel intercept assistant but they also have obtain |
---|
0:02:08 | the day i select aggregation for to collapse the all that was to do so |
---|
0:02:14 | they have an assessed not just a little bit about all the pros to sell |
---|
0:02:18 | this is a |
---|
0:02:20 | a great |
---|
0:02:22 | we as a starting point but we have to look a little more in detail |
---|
0:02:27 | about |
---|
0:02:28 | i-vectors |
---|
0:02:30 | and they explicitly courses lead to ignore a high-level and source little information |
---|
0:02:37 | so the speaker and information |
---|
0:02:39 | and is reduced to reach the short term but this is has a lot of |
---|
0:02:44 | advantages for features to for conditional for real points |
---|
0:02:49 | some users and imitate also so |
---|
0:02:52 | but be still a spectral only detection decisions |
---|
0:02:57 | probably will be uncorrelated with human perception this morning joe i'd like to this issue |
---|
0:03:04 | of a possible loss of credibility of the system if the it's a very user |
---|
0:03:09 | if i boardman ldc perceive rate disagreements between what the system is doing and what |
---|
0:03:17 | what they can see that they can see that humans are pristine |
---|
0:03:22 | moreover a we have almost that of that ignorance on the or you know those |
---|
0:03:27 | detection errors |
---|
0:03:29 | and when we have also you know system we are simply trying to restore system |
---|
0:03:33 | probabilistic but we don't do not fit the specific with them |
---|
0:03:37 | and we can have a transparent estimates is that which is very good but finally |
---|
0:03:42 | we if we have a roast we cannot display at all what's the recent of |
---|
0:03:47 | the |
---|
0:03:49 | of the art |
---|
0:03:49 | and it's very important to have to be able to provide explanations of all the |
---|
0:03:53 | wires system is working set a specific way |
---|
0:03:57 | and just a final reminder we as you decide systems usually on average error rate |
---|
0:04:04 | but from the user's perspective |
---|
0:04:06 | and they perceive performance like a baby case by case so it can be done |
---|
0:04:10 | larger or even a single trial the system will be affected as a |
---|
0:04:15 | as a whole |
---|
0:04:17 | so what we in that for the paper wants to select a set of i |
---|
0:04:24 | bet or based for the s if we try to problem |
---|
0:04:28 | sorry ten and it's a eight sre ten |
---|
0:04:32 | and we're gonna some a team of us find useful additions force the not english |
---|
0:04:38 | a great |
---|
0:04:40 | and |
---|
0:04:41 | the objective was to explore to better understand what they do with their with that |
---|
0:04:47 | a date down and all |
---|
0:04:50 | that it just a and sre that's |
---|
0:04:54 | as we have a and we might with that of data what target type of |
---|
0:05:00 | types of different that they think they could find and also the number of different |
---|
0:05:06 | types of different that they can have taken finding a single signal a trial |
---|
0:05:11 | and the first of all a display where this is not a paper on the |
---|
0:05:16 | speaker recognition by humans |
---|
0:05:17 | both one of these you know in advance that day speakers in every time a |
---|
0:05:22 | different |
---|
0:05:23 | so |
---|
0:05:23 | all what we are asking that is to highlight difference that they've find in the |
---|
0:05:29 | and between the two utterances but without any a decision then used fourteen yes to |
---|
0:05:36 | see what they can find a in a |
---|
0:05:39 | and |
---|
0:05:40 | in trials where the i-vector has provided a |
---|
0:05:44 | line ratio greater than one |
---|
0:05:48 | as they have a difficult time for analysis and we're not to select a subset |
---|
0:05:53 | of trials |
---|
0:05:55 | so we selected we will use the scores from our submission to nist two thousand |
---|
0:06:01 | and ten |
---|
0:06:02 | and what we did was a outlier proper selection |
---|
0:06:06 | first of all we to be a sixteen and a false acceptance that we actually |
---|
0:06:12 | had |
---|
0:06:12 | and with the it to eight |
---|
0:06:15 | with the eight is a set |
---|
0:06:17 | and but also as those trials were specifically selected to be a special difficult for |
---|
0:06:25 | humans just in case that was at peace stuff on it for that for the |
---|
0:06:30 | analysis we also selected fifty different forces us a second trials from the sre can |
---|
0:06:38 | and in that case of we had thousands of different |
---|
0:06:44 | trials with the condition was selected yes those with no likelihood ratios in the range |
---|
0:06:49 | from three to five with the translates into the results for all between the two |
---|
0:06:54 | one hundred and fifty also so to those were a big are for example systems |
---|
0:06:59 | that we usually |
---|
0:07:01 | and how when we use our i-vector systems with |
---|
0:07:05 | and eight now with the real by a lot of availability |
---|
0:07:10 | and after those we yes end and all sixty six trials and they are there |
---|
0:07:15 | are short rehearing not the about the mean this but trial they select it does |
---|
0:07:21 | for a little work and eighteen trials nine male and female for them probably it's |
---|
0:07:26 | a it's a and fourteen from a test everything |
---|
0:07:37 | this is the final this which is in the paper just i want the soda |
---|
0:07:41 | because we will and referred to every trial using the them |
---|
0:07:46 | the number of the target id |
---|
0:07:49 | ability of which one of the speakers |
---|
0:07:53 | second disclaimer i'm not of an addition that i even have problems with english roll |
---|
0:07:59 | okay i would be talking about but of things that my colleagues is therefore that |
---|
0:08:04 | takes a lot declared it so yes |
---|
0:08:06 | my apology that buttons if i have i say something not right |
---|
0:08:12 | and this is the rate of features that they will explore they will we be |
---|
0:08:17 | noted by really deformation type temporal characteristics what extent means that what the characteristics degree |
---|
0:08:23 | of the solid deep or something like than all the type of non-linguistic features or |
---|
0:08:28 | what robert was impressions of |
---|
0:08:31 | so that they will just |
---|
0:08:33 | what they will extend |
---|
0:08:36 | we don't like the selected trials is to perform that detail during the at both |
---|
0:08:41 | about one hour per one of the trials and we focus on the full feature |
---|
0:08:46 | which are presented all along the conversation |
---|
0:08:50 | i would still some samples |
---|
0:08:52 | but that is a |
---|
0:08:54 | the feature that the difference is that we are that they're finding out present along |
---|
0:08:58 | the whole conversation |
---|
0:09:02 | and those comparison will be maybe linguistically k compare compatible segment example select you think |
---|
0:09:08 | that set consisting of motown and finally some of the observation would be confidence through |
---|
0:09:15 | acoustically or estimate a and then |
---|
0:09:20 | by seasonal i used in mentioning that might expect so you don't seem a spectrogram |
---|
0:09:27 | so the last part of my presentation will be simply so and some of the |
---|
0:09:32 | use a file |
---|
0:09:34 | in every case i went so on a number of the trial with the where |
---|
0:09:43 | the audio can from and also the likelihood ratio in that do not value the |
---|
0:09:48 | degree of support that the ipod or used a given |
---|
0:09:52 | the same speaker hypothesis so we know in advance they are different |
---|
0:09:56 | this the i-vectors is that we say |
---|
0:09:59 | and then the same of these c same speaker and we will see it for |
---|
0:10:04 | every trial |
---|
0:10:05 | and the that the that fault |
---|
0:10:08 | degree of support of that are that can easily and english |
---|
0:10:12 | all possible this is a case without a very high misleading value on the three |
---|
0:10:19 | just and the operator what we use an obtain even for targets |
---|
0:10:25 | and in that case for example what they found is that this for speech a |
---|
0:10:31 | lot of the whole conversation is |
---|
0:10:33 | and not different |
---|
0:10:35 | no but we do you wanna go well |
---|
0:10:39 | the it's for the blue line |
---|
0:10:42 | for the right one |
---|
0:10:44 | i really but i four |
---|
0:10:48 | a sound like different by the that are over a regular or you are well |
---|
0:10:55 | i really i four |
---|
0:11:02 | and a set of features that they then used |
---|
0:11:05 | you just about the long as variability |
---|
0:11:08 | in the collective synthesis people usually tends to decrease the energy at the end up |
---|
0:11:13 | there is at least that's happened with the for speaker in that case |
---|
0:11:24 | our that the second speaker in that try out is |
---|
0:11:27 | keeping the same stress can do you and we'll especially for to keep that log |
---|
0:11:34 | in this |
---|
0:11:42 | and this is consequently repeated during the whole conversation |
---|
0:11:48 | in this case and which has which had a celebration of at a smaller value |
---|
0:11:53 | obviously value and there's a |
---|
0:11:57 | only dysphonic voice you once only one of the sides of the conversation is that |
---|
0:12:09 | they have no idea what are okay |
---|
0:12:15 | they have no idea what like are okay |
---|
0:12:22 | is that is for the one |
---|
0:12:24 | well there are no but neural network grammar |
---|
0:12:30 | well there are no but you'll never bigger |
---|
0:12:34 | for example you that are compared to the one light both phase right |
---|
0:12:39 | but |
---|
0:12:40 | and this is the spectral analysis of the of that powering latt uses a |
---|
0:12:46 | without hi everyone would ratio on you know we have |
---|
0:12:50 | much lower |
---|
0:12:54 | another type of and situation that would be found is the president of creaky voice |
---|
0:12:58 | for sample this is not very usual find in a speaker to the second one |
---|
0:13:03 | here we just peaks do all the conversation with really voice |
---|
0:13:08 | i |
---|
0:13:11 | i normal rate and this |
---|
0:13:15 | second one no you know |
---|
0:13:18 | no you know |
---|
0:13:22 | this is not very frequent but this thing present in this case and it's very |
---|
0:13:25 | quickly but what is quite usual is that the resulting solution of creaky voice at |
---|
0:13:32 | the end of the of the phrase would like your |
---|
0:13:35 | we will pop up a sample here in that case work like ratio measly like |
---|
0:13:42 | results about fifty |
---|
0:13:47 | two |
---|
0:13:51 | one segment well |
---|
0:13:54 | well |
---|
0:13:56 | well |
---|
0:13:58 | well |
---|
0:14:02 | we also found issues about sorry more boys system where you the voice difficult is |
---|
0:14:08 | to haul the bit it's a similar segment with and that type of speech you |
---|
0:14:15 | can see the |
---|
0:14:16 | tennessee of the mean value is quite similar however the second one we have |
---|
0:14:21 | you use the oscillation problems to maintain that |
---|
0:14:27 | i together i |
---|
0:14:32 | i get a very i |
---|
0:14:36 | second one |
---|
0:14:37 | we will be known |
---|
0:14:42 | no |
---|
0:14:51 | also a feature what's file was about the speech rate |
---|
0:14:56 | you for somebody in that case there are two different speaker which sold at different |
---|
0:15:02 | levels of a of activity |
---|
0:15:04 | what about how would be better marketing |
---|
0:15:08 | moreover |
---|
0:15:12 | it was bigger really |
---|
0:15:14 | we were able to leave |
---|
0:15:20 | this also issues all known hyperarticulation for example |
---|
0:15:24 | the phase |
---|
0:15:25 | really different see if you're selling you know |
---|
0:15:29 | one the other one i for like you know |
---|
0:15:33 | well |
---|
0:15:34 | almost basis some |
---|
0:15:36 | also this can be found in other cases with the |
---|
0:15:42 | without using any of a key and where the formant a three of on here |
---|
0:15:47 | it's much more the about more standard for speaker |
---|
0:15:52 | your |
---|
0:15:55 | second |
---|
0:15:56 | huh |
---|
0:15:58 | the form of a second formant is much lower than the |
---|
0:16:04 | signal for one for speaker |
---|
0:16:06 | also that there may be found differences well the specific but there's of realisation some |
---|
0:16:12 | first personable one pretty because the finding difference and a type of s that the |
---|
0:16:17 | speaker reviews |
---|
0:16:19 | for example in that case and the as in that speaker starts of the five |
---|
0:16:27 | hundred you're while the as in the second speaker this |
---|
0:16:32 | start above |
---|
0:16:34 | three thousand system or a standard student s |
---|
0:16:38 | i |
---|
0:16:45 | also cases where the problems or differences in the a degree of summarisation |
---|
0:16:53 | sample here i |
---|
0:16:56 | this is like that |
---|
0:16:57 | you don't want together |
---|
0:17:01 | and that of kind of nice of voice when and in this case the other |
---|
0:17:06 | one is i per thousand since we have a goal or something |
---|
0:17:15 | also that uses about impaired melodic voices |
---|
0:17:18 | so regular |
---|
0:17:23 | no we in |
---|
0:17:27 | what is the one i know you know |
---|
0:17:35 | in some cases the file extralinguistic ensures that for example the noisy reading everything to |
---|
0:17:40 | use that speaker |
---|
0:17:42 | you can hear |
---|
0:17:44 | that you are construction some parties |
---|
0:17:49 | for |
---|
0:17:58 | for example |
---|
0:18:01 | well as well |
---|
0:18:08 | so what while the second one that's it's already and noisy breathing at all |
---|
0:18:12 | they're also presents all squats or |
---|
0:18:16 | strong not control of the o |
---|
0:18:21 | g |
---|
0:18:26 | e |
---|
0:18:29 | or not the case of some of the presence of rectly voice |
---|
0:18:33 | e and o |
---|
0:18:42 | i go off all gonna |
---|
0:18:47 | so i'm finally this is they comparisons of the of a |
---|
0:18:51 | this work where and the idea is that if you look and you all some |
---|
0:18:55 | top weight you can find the amount of times that and one given feature is |
---|
0:19:02 | file |
---|
0:19:03 | and but its moral about the look trial by trials or columns of the table |
---|
0:19:09 | and us see that |
---|
0:19:11 | for every trial there are |
---|
0:19:13 | there's an average of about four different types of different that a file |
---|
0:19:19 | especially health interest to last if we want to make a diplomatic pursues to detect |
---|
0:19:25 | something some any kind of features are possible feature related to phonation type well phone |
---|
0:19:33 | creaky also and those the |
---|
0:19:37 | like to a specific but there are some presentation of the specific sound |
---|
0:19:42 | so do you might well |
---|
0:19:44 | yes a we have shown that percent all analyses initial null correlation with the that |
---|
0:19:51 | backdoor false acceptances |
---|
0:19:53 | and |
---|
0:19:54 | there is detectable a useful information goals trials that just produce away from poland uses |
---|
0:20:01 | what one bs recognition rate is |
---|
0:20:04 | furthermore there's like |
---|
0:20:06 | a relational |
---|
0:20:07 | and specifically the but the realisation that bit of a specific cells |
---|
0:20:12 | but also at would that those could provide an |
---|
0:20:17 | we try to reach no signals transcription of the whole utterances and they could be |
---|
0:20:22 | used to provide some kind of soft information or |
---|
0:20:26 | and |
---|
0:20:28 | this what specific highlight the inter some provide an objective measurements about this for you |
---|
0:20:34 | not the spectral features especially for speaker |
---|
0:20:38 | thank you |
---|
0:21:00 | just listening to |
---|
0:21:01 | second creaky wanna sell like was actually clipping happening in the first |
---|
0:21:06 | creaky voice |
---|
0:21:08 | solves one it |
---|
0:21:09 | perhaps the reason the system to see the same because audio clip like three |
---|
0:21:16 | was there any analysis on when you when people listening to these false like taking |
---|
0:21:22 | part of the audio acquisition and one that was quality as well |
---|
0:21:28 | there was no it's okay |
---|
0:21:30 | especially analysis of brain processing of the of the data we just select the data |
---|
0:21:34 | as it was and what is given to them and it what have phone from |
---|
0:21:40 | the phone at finding just what the what they what they did so |
---|
0:21:45 | how can you tell them so |
---|
0:21:57 | what's the variance from the sets consist of experts on |
---|
0:22:02 | to ten |
---|
0:22:05 | that's good |
---|
0:22:06 | there was a very high actually the second one was a student of the rate |
---|
0:22:12 | was just you from one to and |
---|
0:22:15 | maybe they provide for they come from the same school of listening |
---|
0:22:19 | and then the degree of agreement ones |
---|
0:22:24 | impressing we will be working completely separate |
---|
0:22:29 | i we have to say that there were no this is chosen there were no |
---|
0:22:32 | scoring sorry what does i found difference on i five difference on but the degree |
---|
0:22:37 | of |
---|
0:22:38 | but i can say that it was almost exactly the same maybe there was one |
---|
0:22:43 | of the differences and that one of the informant the of the on |
---|
0:22:53 | i was wondering |
---|
0:22:54 | since you only used |
---|
0:22:57 | non-target trials |
---|
0:22:59 | yes you have conducted the same experiment with the same from the tuition non-target trials |
---|
0:23:04 | how many of those differences they would also something especially the prosodic differences |
---|
0:23:12 | of course there will find a lot of then what's |
---|
0:23:14 | we are trying to do is to look for clues we rolled analysis nowhere to |
---|
0:23:20 | look for |
---|
0:23:21 | for a different of information and of course those prosodic and just prosodic information that |
---|
0:23:28 | prosodic information is very easily and modify a and b and you can depend a |
---|
0:23:34 | lot on the on the type of conversation |
---|
0:23:36 | that's why a i stress the idea of the issues of |
---|
0:23:41 | voice production and specific buttons of religious the which can be much more dependent upon |
---|
0:23:46 | the speaker but |
---|
0:23:48 | of course this part of the word that could be don't and of course they |
---|
0:23:52 | would because when |
---|
0:23:55 | i suppose like then participate in that kind of a humans just this evaluation they |
---|
0:24:01 | also did not |
---|
0:24:09 | yes as the result of this analysis the use it just the but kind of |
---|
0:24:14 | features that we used |
---|
0:24:17 | system the future |
---|
0:24:20 | which so you mention the prosody given duration what do you suggest |
---|
0:24:28 | that we look at for improving system |
---|
0:24:33 | i'm not suggesting anything special i just giving the information what they found but what |
---|
0:24:38 | i'm saying is that the for example the one noise |
---|
0:24:41 | those voice quality features around |
---|
0:24:43 | a specific but doesn't really say some of some a has a good degree of |
---|
0:24:49 | parameters that can be |
---|
0:24:50 | the properly detected |
---|
0:24:52 | let's see if they can improve the overall system |
---|