0:00:14 | she this work was done in collaboration with a very large number of colleagues and |
---|
0:00:20 | everyone the latter part work so |
---|
0:00:24 | i'd like to think |
---|
0:00:27 | disarray and george daniel jack tommy alvin alan mark and dog |
---|
0:00:37 | so the goal of the challenge was to support and encourage development of new methods |
---|
0:00:42 | for speaker detection utilizing i-vectors in the intent was to explore new ideas and machine |
---|
0:00:49 | learning for used in speaker recognition |
---|
0:00:53 | and to trying to the field more accessible for people outside of the audio processing |
---|
0:00:56 | community and to improve the performance of the technology |
---|
0:01:05 | the chunk format for people who don't know |
---|
0:01:08 | was to use i-vectors us also the sure been in audio was to distribute the |
---|
0:01:12 | i-vectors themselves and it was all hosted on a web platform so was entirely online |
---|
0:01:18 | the registration the system submission and receiving results all online |
---|
0:01:26 | the reason for using i-vectors in the web platform was to attempt to expand the |
---|
0:01:32 | number and types of participants including ones from the ml community |
---|
0:01:36 | and to allow iterative submissions with the fast turnaround and order to support research progress |
---|
0:01:43 | during the actual evaluation |
---|
0:01:48 | another think that was that different from what people may be accustomed to with the |
---|
0:01:51 | regular sre was that's a large development set of unlabeled i-vectors was distributed to be |
---|
0:02:00 | used for dev data the intent there was to encourage new creative approaches the modelling |
---|
0:02:06 | and in particular |
---|
0:02:08 | the use of clustering to improve performance |
---|
0:02:12 | in addition to these things one thing we were hoping to do was to set |
---|
0:02:16 | a precedence or at least have a proof of concept for future evaluations where there |
---|
0:02:22 | can be web based registration data distribution potentially and results submission trying to make this |
---|
0:02:28 | more efficient and more user friendly |
---|
0:02:32 | after the community |
---|
0:02:36 | so |
---|
0:02:38 | the objective straight the data selection was to include multiple training sessions for each target |
---|
0:02:43 | speaker in the main evaluation test in a recent histories |
---|
0:02:47 | an optional test is involved multiple training sessions but |
---|
0:02:52 | in this challenge we wanted to include that for everyone to do is the main |
---|
0:02:59 | focus |
---|
0:03:01 | also i same handset target trials and cross sex nontarget trials both the which or |
---|
0:03:08 | unusual |
---|
0:03:10 | for the regular sre |
---|
0:03:12 | also something different was taking i-vectors from a log normal distribution as opposed to |
---|
0:03:20 | some discrete uniform |
---|
0:03:25 | durations the reason for this was filters more realistic and it's the challenge that's |
---|
0:03:32 | people seemed eager to address |
---|
0:03:34 | and also well just varying the duration allows us to do |
---|
0:03:39 | post evaluation analysis |
---|
0:03:43 | so the task is speaker detection which hopefully everybody here knows by the third data |
---|
0:03:48 | for the c |
---|
0:03:49 | and the system i was evaluated over a set of trials where you trial compared |
---|
0:03:55 | a target speaker model in this case this was a set of five i-vectors and |
---|
0:04:00 | it test speech segment comprised of a single i-vector |
---|
0:04:05 | the system determines whether or not the speaker and the test segment as a target |
---|
0:04:08 | speaker about put in a single real number |
---|
0:04:12 | and no decision was necessary |
---|
0:04:15 | the trial outputs are then compare to ground truth to compute a performance measure which |
---|
0:04:19 | for the i-vector challenge was in dcf |
---|
0:04:24 | hopefully people know what target trials and non-target trials and misses and false alarms are |
---|
0:04:29 | it does anyone not know that |
---|
0:04:33 | okay if not come see me afterwards and measure was dcf which is essentially just |
---|
0:04:40 | the miss rates a time plus one hundred times of false alarm rate |
---|
0:04:46 | and the official overall measure was the mindcf |
---|
0:04:54 | seen here |
---|
0:04:57 | so the challenge i-vectors were produced with the system developed jointly we between johns hopkins |
---|
0:05:02 | and mit lincoln labs and it uses the standard mfccs and don't the acoustic features |
---|
0:05:10 | and use the gmm train set |
---|
0:05:17 | the source data were the ldc mixer corpora particular mixtures one three seven as well |
---|
0:05:22 | as remakes and included around sixty thousand telephone call sides from about six thousand speakers |
---|
0:05:28 | and the duration of these calls were up to five minutes drawn from a log |
---|
0:05:34 | normal distribution |
---|
0:05:35 | with the mean of nearly forty seconds |
---|
0:05:39 | for each selected segment participants were provided with this example dimensional i-vector as well as |
---|
0:05:45 | the duration from of the speech from which the i'd draw a vector was extracted |
---|
0:05:52 | so this is the data and then the data was partitioned into a development set |
---|
0:05:56 | and enrollment test set |
---|
0:05:59 | after the development partition the calls were from speakers without test data |
---|
0:06:04 | and consisted of round three six thousand telephone call sides from around five thousand speakers |
---|
0:06:10 | and as i said earlier it was unlabeled so |
---|
0:06:15 | no speaker like bowls we're given with the development partition |
---|
0:06:21 | for the enrollment and test |
---|
0:06:24 | calls were from speakers with at least five calls from different phone numbers and at |
---|
0:06:28 | least eight calls from a single phone number consisted of a about thirteen hundred target |
---|
0:06:34 | speakers |
---|
0:06:36 | i'm sort target models |
---|
0:06:38 | almost ten thousand test i-vectors and the target trials we're limited to ten same intent |
---|
0:06:44 | different phone number calls per speaker and non-target trials came from other target speakers as |
---|
0:06:49 | well as a five hundred speakers who are not |
---|
0:06:53 | other target speakers two hundred fifty males and |
---|
0:06:56 | fifty females |
---|
0:07:00 | the trials consisted of all possible pairs of a target speaker and the test i-vector |
---|
0:07:07 | about twelve and half a million trials |
---|
0:07:10 | and included cross sex nontarget trials as willow same number |
---|
0:07:14 | target trials |
---|
0:07:16 | the trials were divided into two randomly selected subsets that someone asked about this the |
---|
0:07:21 | speakers did overlap between the progress subset and the evaluation subset |
---|
0:07:29 | forty percent was used for a progress subset which |
---|
0:07:36 | was what was used to monitor progress and people familiar maybe not from where i |
---|
0:07:41 | should say with the |
---|
0:07:42 | i challenge there was a |
---|
0:07:46 | a progress board where people could see how they were doing and how other people |
---|
0:07:52 | we're doing |
---|
0:07:53 | and that was actually |
---|
0:07:59 | updated using the progress set and sixty percent of the data was held out |
---|
0:08:04 | into the end of the evaluation period |
---|
0:08:07 | and then the system submissions were scored for the official results |
---|
0:08:13 | using this remaining sixty percent |
---|
0:08:18 | so some structure to the evaluation system output for each trial could be based only |
---|
0:08:24 | on the trials model and test i-vectors as well as the durations provided and the |
---|
0:08:29 | provided development data |
---|
0:08:32 | normalization over multiple test segments are target speakers was not allowed |
---|
0:08:36 | use of evaluation data from for nontarget speaker modeling was not allowed |
---|
0:08:41 | and training system parameters using data not provided as part of the challenge was also |
---|
0:08:46 | model out one two and three these or |
---|
0:08:51 | pretty typical for the nist develops for is actually knew |
---|
0:08:55 | in the intent was to remove data engineering and also encourage participation from so it's |
---|
0:09:00 | a don't have a lot of their own speech data |
---|
0:09:05 | so in terms of dissipation there about three hundred registrants from about fifty countries |
---|
0:09:11 | hundred and forty of the registrants from hundred and five unique sites |
---|
0:09:15 | i'm at least one valid submissions so there were some |
---|
0:09:19 | some number people registered but worked able to some of the system |
---|
0:09:25 | the numbers submissions actually exceeded eight thousand if we compare these numbers to a street |
---|
0:09:29 | well we do see a really large increase in participation which are excited c |
---|
0:09:38 | in addition to receiving data |
---|
0:09:40 | a baseline system was distributed with the evaluation |
---|
0:09:47 | it used a variant of cosine scoring accuracy the five steps estimate a global mean |
---|
0:09:53 | and covariance and the unlabeled data |
---|
0:09:56 | update that's mean and variance by center and whiten you know a project them onto |
---|
0:10:01 | a unit sphere |
---|
0:10:02 | and that for each model |
---|
0:10:03 | i average it's five i-vectors and project those on the unit sphere and then compute |
---|
0:10:08 | the inner product |
---|
0:10:12 | one thing to note is because the dev data was unlabeled at the b d |
---|
0:10:16 | c n and lda were |
---|
0:10:18 | not possible to use |
---|
0:10:21 | in addition to that there was an oracle system that was not provided but kept |
---|
0:10:26 | a g h u |
---|
0:10:28 | which have access to the development speaker data will development data speaker labels |
---|
0:10:35 | and the |
---|
0:10:37 | a system was gender dependent with a four dimensional speaker space all of the i-vectors |
---|
0:10:44 | for each model or let length normalized are then averaged |
---|
0:10:49 | and it discarded i-vectors with duration less than thirty seconds which actually reduce the development |
---|
0:10:55 | set |
---|
0:10:56 | quite a bit |
---|
0:11:00 | and here we see our first result so |
---|
0:11:04 | z a red line as their oracle system |
---|
0:11:08 | and the blue line is the baseline system the solid line is on the evaluation |
---|
0:11:13 | set of trials use of the sixty percent or |
---|
0:11:17 | held out in the dotted line is on the progress set |
---|
0:11:22 | so basically the gap between these lines indicate the |
---|
0:11:26 | potential value of having speaker labels |
---|
0:11:30 | so the hope was to be able to use clustering techniques from the development set |
---|
0:11:35 | up close this gap |
---|
0:11:42 | here we see |
---|
0:11:44 | results |
---|
0:11:45 | so i here |
---|
0:11:48 | is the mindcf on the oracle system and on the baseline system the blue line |
---|
0:11:55 | is the progress set |
---|
0:11:56 | and the red line is the ml set |
---|
0:11:58 | and here we see the |
---|
0:12:00 | top ten performing systems and how they did on the progress set and on the |
---|
0:12:06 | ml sit |
---|
0:12:08 | performance on the eval set was consistently better than progress |
---|
0:12:12 | not exactly sure why then some random variation |
---|
0:12:16 | and seventy five percent of participants submitted a system that outperform the baseline true really |
---|
0:12:21 | please soon as well |
---|
0:12:23 | are we do not time |
---|
0:12:26 | so okay great actually course so |
---|
0:12:31 | oops what skip this |
---|
0:12:34 | accuracy progress over time |
---|
0:12:37 | the green line is on the of al so that |
---|
0:12:42 | and the blue line as on the progress set |
---|
0:12:44 | and the red line is on the progress set to so basically the green line |
---|
0:12:47 | is the very best score observed to date |
---|
0:12:51 | same with the blue line |
---|
0:12:53 | and then the red line is for the system that and it up |
---|
0:12:57 | with the top performance |
---|
0:13:00 | at the end so we see it's |
---|
0:13:05 | history of the performance over time |
---|
0:13:09 | couple thing is that we note it was the performance levelled off after about six |
---|
0:13:12 | weeks we ran this from december |
---|
0:13:15 | through april |
---|
0:13:17 | and basically after six weeks but not much for a progress was observed |
---|
0:13:24 | and also interesting to note was that the leading system |
---|
0:13:28 | did not lead basically from december till february |
---|
0:13:32 | output by it's a period that |
---|
0:13:37 | i taking the lead to stay there |
---|
0:13:42 | here we see performance by gender on the left |
---|
0:13:47 | of each of these is the leading system |
---|
0:13:50 | and on the right is the baseline system |
---|
0:13:54 | one thing kind of interesting to note |
---|
0:13:56 | is the leading system did worse |
---|
0:13:58 | a on same sex trials than on male only and female only i which might |
---|
0:14:04 | be unexpected but |
---|
0:14:07 | i think an explanation for this is that there were calibration issues |
---|
0:14:14 | accuracy performance by same and different phone number |
---|
0:14:18 | here the blue is the baseline |
---|
0:14:21 | on the left the same number of the right is different number |
---|
0:14:24 | and here i guess like with the gender |
---|
0:14:27 | i we see limited degradation in performance to the change and phone number from the |
---|
0:14:30 | leading system |
---|
0:14:32 | so this was very close |
---|
0:14:36 | even compared to the baseline which was fairly close |
---|
0:14:42 | so there's some additional information available you can see the odyssey paper for more results |
---|
0:14:49 | for example more information about the progress over time and gender effects as well same |
---|
0:14:55 | a different phone numbers |
---|
0:14:57 | we also have an interspeech paper that does some analysis of participation |
---|
0:15:01 | i gives us some of these same results but on the progress set the odyssey |
---|
0:15:04 | paper focuses entirely on the ml set |
---|
0:15:08 | and there's the lots of work to do |
---|
0:15:11 | so that we have future paper on duration age another results as you can see |
---|
0:15:15 | those things for additional information you can also please feel free to contact us |
---|
0:15:20 | so some conclusions we thought that the process worked which was very exciting for us |
---|
0:15:26 | the website was brought up and stayed up which was good |
---|
0:15:33 | participation exceeded that of prior sre is |
---|
0:15:37 | which was a of the goal |
---|
0:15:39 | and many states significantly improved on the baseline system |
---|
0:15:44 | further investigation and feedback will be needed |
---|
0:15:47 | in order to determine the extent to which the you participation was from outside of |
---|
0:15:52 | the audio processing community |
---|
0:15:55 | for people who are signed up we |
---|
0:15:59 | eventually asked if they were from the audio processing community but we didn't thing to |
---|
0:16:04 | do that during the initial sign up so that all other cases we don't know |
---|
0:16:10 | whether a the additional participation came from outside the outside the audio processing community or |
---|
0:16:16 | not |
---|
0:16:18 | thousands of submissions provide data for further analysis which we look for to doing |
---|
0:16:25 | and |
---|
0:16:29 | these things include things like clustering of unlabeled data and gender differences across and within |
---|
0:16:34 | trials effects of handsets role of duration |
---|
0:16:41 | and speaking of future work |
---|
0:16:44 | we plan to enhance the online platform for example would like to put analysis tools |
---|
0:16:50 | on the platform for participants to use |
---|
0:16:54 | we expect to |
---|
0:16:57 | offer further online challenges |
---|
0:17:00 | and in part because they're more readily organized and also because |
---|
0:17:05 | it's a possible to efficiently we use of test data |
---|
0:17:09 | but we expect that we use results will affect full fledged evaluations as well or |
---|
0:17:15 | the typical s are used |
---|
0:17:17 | as well for example we'd like to |
---|
0:17:21 | have increasingly web based in user friendly procedures for |
---|
0:17:25 | i registration in for data distribution |
---|
0:17:28 | and it's possible that were use a separate datasets evaluation datasets |
---|
0:17:38 | want a four iterations graph performance in another held-out with limited exposure |
---|
0:17:43 | i we've seen this used in |
---|
0:17:46 | i have passed |
---|
0:17:48 | nist evaluations and it may |
---|
0:17:51 | and see renewed use in a series |
---|
0:17:55 | thank you very much |
---|
0:18:07 | i craig and you pass like twenty one okay |
---|
0:18:12 | i'm wondering with those this seems just the weights is leading system |
---|
0:18:18 | is that the leading system and that's two conditions are same sets |
---|
0:18:22 | a sure that is the same system in |
---|
0:18:26 | in both |
---|
0:18:33 | i used in a reasonable idea oracle was and different directed and what you distribute |
---|
0:18:38 | which one six hundred twenty four hundred one so why you keep the same i-vector |
---|
0:18:45 | the two distributions i lincoln you may be addressed |
---|
0:18:50 | like |
---|
0:19:02 | craig in your final slide you mentioned |
---|
0:19:05 | that the last point that's a data set for iterated use |
---|
0:19:12 | are you thinking of something similar to what you have now the |
---|
0:19:19 | the point i'm getting at is |
---|
0:19:21 | if you want to train for example calibration or fusion |
---|
0:19:26 | then it's then it's very nice to average of feedback for example |
---|
0:19:32 | the derivatives of your system parameters with respect to those schools so |
---|
0:19:40 | you think |
---|
0:19:41 | it would be possible to |
---|
0:19:45 | i'm not sure whether it's in the question is |
---|
0:19:48 | is this an issue of not having speaker labels for development or |
---|
0:19:58 | well |
---|
0:20:00 | we want to be able to train |
---|
0:20:02 | a fusion sure on the type that so can you see that happening or |
---|
0:20:08 | because if you would just give us the data we could do that but if |
---|
0:20:11 | the data stays |
---|
0:20:14 | on the other side and all side |
---|
0:20:17 | then |
---|
0:20:17 | that's a more difficult and then sure more complex |
---|
0:20:26 | right |
---|
0:20:32 | yes and one thing that maybe i should clarifies this was really meant in the |
---|
0:20:37 | context of sre in other nist evaluations sometimes |
---|
0:20:41 | the reuse dataset from one you're to another here |
---|
0:20:45 | up of also have some |
---|
0:20:48 | i guess what's called the progress set but they use a different sense then we |
---|
0:20:52 | are using it here |
---|
0:20:54 | where people won't get |
---|
0:20:58 | the key for that |
---|
0:20:59 | but they will have the key for the review set |
---|
0:21:02 | as editors your question or |
---|
0:21:04 | okay |
---|
0:21:09 | we question i just wondered is not relevant to do according to the rules but |
---|
0:21:14 | those thirty nine nodes |
---|
0:21:17 | are all models are a little different speakers where they're not there were some speakers |
---|
0:21:23 | because there was a distortion weighted it would be or not |
---|