0:00:06 | okay um |
---|
0:00:08 | and he said my name is laura so i'm a P H E student at the university of california berkeley |
---|
0:00:14 | and i also work at the international |
---|
0:00:16 | computer science |
---|
0:00:17 | institute or icsi um as many of you know |
---|
0:00:20 | and i would like to |
---|
0:00:20 | just uh |
---|
0:00:21 | also acknowledge my caught and charged ions it |
---|
0:00:24 | who was uh |
---|
0:00:25 | a fundamental part of of this work |
---|
0:00:28 | um |
---|
0:00:30 | so i just uh |
---|
0:00:31 | quick overview of pretty standard start out with a |
---|
0:00:35 | what we're trying to do and why |
---|
0:00:37 | uh we just got |
---|
0:00:38 | um related work |
---|
0:00:40 | um and go our our approach |
---|
0:00:42 | to the problem uh give you the results |
---|
0:00:45 | uh |
---|
0:00:45 | do a little bit additional analysis |
---|
0:00:47 | and conclude um |
---|
0:00:49 | with a summary and |
---|
0:00:51 | future work |
---|
0:00:55 | so i think we can |
---|
0:00:56 | all agree that uh automatic speaker recognition |
---|
0:00:59 | that some performance |
---|
0:01:00 | depends on a number of factors |
---|
0:01:02 | uh one of watch uh are intrinsic speaker characteristics |
---|
0:01:07 | um |
---|
0:01:07 | so there's no designs that if |
---|
0:01:10 | you know as humans we notice that certain sneaker sound more like |
---|
0:01:13 | similarly there's |
---|
0:01:14 | uh |
---|
0:01:16 | and you have that |
---|
0:01:17 | uh |
---|
0:01:17 | system's automatic systems will perform better or worse |
---|
0:01:21 | for different speakers |
---|
0:01:22 | um |
---|
0:01:23 | so the goal of this work |
---|
0:01:25 | it's too project watch |
---|
0:01:27 | speaker pairs |
---|
0:01:28 | well be difficult for automatic |
---|
0:01:30 | speaker recognition systems two distinct |
---|
0:01:32 | uh we did some preliminary work um |
---|
0:01:35 | what yeah that |
---|
0:01:37 | speaker pairs that are are hard for one system are also hard for others |
---|
0:01:41 | it's and |
---|
0:01:42 | um |
---|
0:01:43 | and of course you can use the system and |
---|
0:01:45 | select speaker pairs and you'll probably do really well |
---|
0:01:48 | but uh we wanted to |
---|
0:01:49 | it's a away from using any one system and and said um |
---|
0:01:53 | have a general approach |
---|
0:01:55 | and just use features that will hopefully capture uh |
---|
0:01:59 | oh |
---|
0:02:00 | degree of |
---|
0:02:00 | uh |
---|
0:02:01 | speaker similarity |
---|
0:02:03 | um |
---|
0:02:04 | the motivation |
---|
0:02:05 | uh |
---|
0:02:06 | besides |
---|
0:02:07 | being an interesting task |
---|
0:02:09 | it is |
---|
0:02:09 | to potentially better focus |
---|
0:02:11 | uh |
---|
0:02:12 | the research and reduce the amount of data needed to estimate |
---|
0:02:15 | system performance but |
---|
0:02:21 | so there's a couple times of |
---|
0:02:22 | related work um the first has to do with the idea of different speakers uh causing different problem |
---|
0:02:28 | um the infamous |
---|
0:02:30 | uh |
---|
0:02:30 | george washington |
---|
0:02:31 | you paper um |
---|
0:02:33 | categorise speakers based on system performance |
---|
0:02:37 | so you have |
---|
0:02:39 | fig out |
---|
0:02:40 | um |
---|
0:02:41 | you call the large number of false rejections as target speakers |
---|
0:02:46 | you have lamb |
---|
0:02:47 | uh who cause a large number of false acceptances as target speakers |
---|
0:02:53 | well as you call a large number of false acceptances and impostor speakers |
---|
0:02:58 | and finally or default well behaved |
---|
0:03:01 | she |
---|
0:03:02 | um |
---|
0:03:03 | in this work |
---|
0:03:04 | uh we don't actually distinguish between them available |
---|
0:03:07 | uh |
---|
0:03:08 | since we're looking at speaker pairs |
---|
0:03:09 | um |
---|
0:03:11 | but |
---|
0:03:11 | we want more on the title because |
---|
0:03:13 | hunting for lance didn't |
---|
0:03:15 | didn't sound so good |
---|
0:03:18 | um a couple other |
---|
0:03:20 | there's been other other work done on uh dealing with these speakers |
---|
0:03:24 | but it may be difficult |
---|
0:03:25 | um there's been work that shown that |
---|
0:03:28 | oh their performance differences |
---|
0:03:30 | between high and low pitch |
---|
0:03:31 | speakers |
---|
0:03:32 | um and then there been some uh worked on |
---|
0:03:35 | that |
---|
0:03:36 | uh tried |
---|
0:03:37 | two |
---|
0:03:37 | uh |
---|
0:03:38 | well the method to deal with this |
---|
0:03:40 | problem speakers |
---|
0:03:44 | the other elements of related work |
---|
0:03:46 | uh that's relevant |
---|
0:03:47 | is |
---|
0:03:47 | um |
---|
0:03:48 | and the whole |
---|
0:03:49 | features that are used |
---|
0:03:50 | two |
---|
0:03:51 | describe speakers or characterise speakers |
---|
0:03:53 | so you can draw varies from a lot of different types of work obviously uh |
---|
0:03:57 | speaker recognition approaches have use a variety of features |
---|
0:04:01 | uh certainly not an exhaustive list here but things like pitch and energy distributions are dynamic |
---|
0:04:06 | um |
---|
0:04:07 | prosodic statistics |
---|
0:04:09 | uh jitter and shimmer |
---|
0:04:11 | and |
---|
0:04:12 | in looking at |
---|
0:04:13 | perceptual speaker characterisation or discrimination your find |
---|
0:04:16 | a lot of formant frequencies and bandwidths and dynamic features |
---|
0:04:20 | and um |
---|
0:04:23 | other acoustic parameters that influence voice individuality include the pitch frequency contour and fluctuation |
---|
0:04:29 | again the formant frequencies and long term average but |
---|
0:04:36 | so our approaches |
---|
0:04:37 | as |
---|
0:04:38 | fairly straightforward |
---|
0:04:39 | um basically we compute feature values over some speech data |
---|
0:04:44 | uh |
---|
0:04:44 | corresponding to marry speaker |
---|
0:04:46 | and then using these feature values compute a measure similarity uh for all speaker pairs |
---|
0:04:53 | and |
---|
0:04:54 | the and looking at these measures look at the uh |
---|
0:04:58 | speaker pairs |
---|
0:04:59 | that have the highest and now we |
---|
0:05:01 | um |
---|
0:05:02 | values |
---|
0:05:03 | in terms of these |
---|
0:05:04 | uh similarity measures |
---|
0:05:05 | and compare for performance and those speakers |
---|
0:05:08 | uh to all |
---|
0:05:14 | so the features |
---|
0:05:15 | we consider here uh first of all |
---|
0:05:17 | pitch |
---|
0:05:18 | that sadistic |
---|
0:05:19 | um i mean median |
---|
0:05:21 | range and mean average slope |
---|
0:05:24 | much we |
---|
0:05:25 | you know |
---|
0:05:25 | okay |
---|
0:05:26 | um |
---|
0:05:27 | jitter and shimmer are the relative at an |
---|
0:05:30 | the average perturbation of generic |
---|
0:05:32 | and a five point amplitude perturbation quite |
---|
0:05:34 | question version |
---|
0:05:39 | uh formant frequency statistics |
---|
0:05:41 | you mean and median of the first three formant |
---|
0:05:43 | um |
---|
0:05:44 | i'll be doing that we work with it |
---|
0:05:46 | eight kilohertz |
---|
0:05:47 | so um |
---|
0:05:48 | although |
---|
0:05:48 | higher formants |
---|
0:05:49 | might be useful we we didn't calculate them here |
---|
0:05:55 | uh and he nonunion energy |
---|
0:06:00 | uh long term average |
---|
0:06:02 | spectrum energy statistics |
---|
0:06:03 | uh including the mean standard deviation |
---|
0:06:06 | range |
---|
0:06:07 | slow |
---|
0:06:07 | and local peak day |
---|
0:06:10 | um |
---|
0:06:11 | and we did a fourteenth order lpc analysis |
---|
0:06:14 | and uh found the frequencies |
---|
0:06:17 | from |
---|
0:06:17 | the coefficient right |
---|
0:06:19 | uh both with and without a minimum magnitude requirement which is essentially a limiting the bandwidth |
---|
0:06:26 | and uh then we to calling frequency it and |
---|
0:06:28 | in the middle |
---|
0:06:29 | histogram |
---|
0:06:32 | and finally we have B mode and median |
---|
0:06:35 | spectral |
---|
0:06:41 | so we have |
---|
0:06:42 | all these features well what measures do we use |
---|
0:06:45 | um for the scalar features of what |
---|
0:06:47 | almost all of them are |
---|
0:06:49 | uh we simply took the absolute or percent difference |
---|
0:06:54 | um also we in addition to using the formant frequencies individually we looked at some of the formant frequencies |
---|
0:07:02 | and we also looked at doctors of |
---|
0:07:04 | formant frequencies and |
---|
0:07:05 | but euclidean distance between the vector |
---|
0:07:08 | and finally for the histograms of frequencies |
---|
0:07:11 | uh |
---|
0:07:12 | we calculated the correlation |
---|
0:07:15 | as a matter |
---|
0:07:16 | so there's there's |
---|
0:07:17 | two different ways you can compute the single measure for speaker pair |
---|
0:07:21 | uh the first is |
---|
0:07:23 | to take for every speaker |
---|
0:07:24 | take all their uh feature values over the |
---|
0:07:27 | conversation sides but are available |
---|
0:07:29 | and just |
---|
0:07:30 | get an average feature value over the conversation |
---|
0:07:33 | and then compute the measure between these average values for each speaker |
---|
0:07:38 | um the other approaches to take |
---|
0:07:40 | and the conversation by conversation basis between two speaker pairs |
---|
0:07:44 | for two speakers |
---|
0:07:45 | uh compute the distance measure first and then averaged over the conversation pairs |
---|
0:07:51 | and um |
---|
0:07:52 | and the result types and i just present whatever method |
---|
0:07:56 | gave |
---|
0:07:56 | better |
---|
0:07:59 | larger different |
---|
0:08:05 | so the data that we use |
---|
0:08:07 | um |
---|
0:08:08 | really feature measure calculation and |
---|
0:08:11 | speaker pair selection |
---|
0:08:12 | uh we use p2p of neatness |
---|
0:08:15 | followup evaluation data |
---|
0:08:17 | um so this is all interview data |
---|
0:08:20 | um |
---|
0:08:21 | which is recorded on microphones me limit it to just |
---|
0:08:24 | you have a your microphone |
---|
0:08:26 | quality purposes |
---|
0:08:28 | and uh just uh the sign out um almost all the speakers have four conversations available |
---|
0:08:33 | um there's a handful with |
---|
0:08:35 | three or five |
---|
0:08:36 | but |
---|
0:08:37 | um |
---|
0:08:38 | because this is |
---|
0:08:39 | that's multiple conversation |
---|
0:08:41 | one |
---|
0:08:42 | and then once we have the speaker pairs selected um |
---|
0:08:46 | we evaluate performance using the uh data from the |
---|
0:08:50 | nist two thousand the evaluation short too short three condition |
---|
0:08:54 | uh so this |
---|
0:08:55 | data varies from the |
---|
0:08:57 | uh other did the pilot data in |
---|
0:08:59 | a couple respect |
---|
0:09:01 | um |
---|
0:09:02 | in addition to |
---|
0:09:02 | possibly being an interview |
---|
0:09:04 | uh it can also be |
---|
0:09:06 | speech from a telephone conversation |
---|
0:09:09 | and in addition to having uh the lab of the your microphone |
---|
0:09:12 | channel there are other microphones |
---|
0:09:14 | and as well the telephone |
---|
0:09:22 | so um |
---|
0:09:23 | available we had uh |
---|
0:09:26 | submissions that were shared by participating site |
---|
0:09:29 | and so thank you to everyone who share their submission |
---|
0:09:32 | um |
---|
0:09:33 | be sure to short precondition originally had i think maybe around ninety thousand miles or so um |
---|
0:09:39 | i had to remove the child that |
---|
0:09:41 | correspond to speakers that weren't in the selection data |
---|
0:09:44 | and that's you that your left |
---|
0:09:46 | about fifty five thousand trial |
---|
0:09:49 | and then furthermore when you just |
---|
0:09:51 | sub select |
---|
0:09:51 | oh and only keep trials corresponding to some percentage of speaker pairs |
---|
0:09:56 | uh you got around four thousand or eleven thousand trials love |
---|
0:10:00 | i know |
---|
0:10:02 | and |
---|
0:10:03 | we only keep target trials for speakers to show up in one of the |
---|
0:10:11 | so how do we evaluate the system performance |
---|
0:10:14 | um |
---|
0:10:15 | they're various metrics you can use what we look at here are to be uh |
---|
0:10:19 | minimum detection cost function |
---|
0:10:21 | and |
---|
0:10:21 | which of course is the |
---|
0:10:23 | a weighted |
---|
0:10:24 | some of uh |
---|
0:10:26 | with relative weights |
---|
0:10:27 | for errors |
---|
0:10:29 | uh what this is all done with the um |
---|
0:10:31 | two thousand a cost |
---|
0:10:32 | so it's not the low false alarm like the two thousand ten evaluation |
---|
0:10:37 | and then since we're looking at |
---|
0:10:38 | impostor speaker pairs uh we look at |
---|
0:10:41 | T false alarm rate which of course is |
---|
0:10:43 | simply |
---|
0:10:44 | for a given decision threshold the number of false alarm errors |
---|
0:10:49 | that occur are out of it |
---|
0:10:50 | total number of possible |
---|
0:10:52 | target track nine target |
---|
0:10:56 | so |
---|
0:10:58 | for every other system submission that we have |
---|
0:11:01 | we first |
---|
0:11:01 | uh |
---|
0:11:03 | just looking at the trials for the most |
---|
0:11:05 | or least similar speaker pairs |
---|
0:11:07 | we can keep the change in dcf relative to |
---|
0:11:10 | what it is for all speaker pairs |
---|
0:11:13 | and then uh |
---|
0:11:15 | take all these |
---|
0:11:16 | um |
---|
0:11:17 | system |
---|
0:11:18 | differences and average over the system |
---|
0:11:20 | so the results are just |
---|
0:11:22 | to |
---|
0:11:23 | a typical overall try |
---|
0:11:24 | and |
---|
0:11:25 | from the system |
---|
0:11:26 | um |
---|
0:11:27 | and then with the false alarm rate we |
---|
0:11:29 | uh for the all speakers' we look at a decision threshold that |
---|
0:11:33 | uh generates a false alarm rate of one |
---|
0:11:35 | sign |
---|
0:11:35 | and then at the same decision threshold um |
---|
0:11:39 | see what the false alarm rate is for the most and least |
---|
0:11:41 | similar speaker pairs |
---|
0:11:44 | and of course if we're actually taking uh if more similar speaker pairs |
---|
0:11:48 | actually corresponds to |
---|
0:11:50 | more difficult to distinguish speaker pairs |
---|
0:11:52 | then we expect these changes |
---|
0:11:54 | and the dcf |
---|
0:11:55 | and |
---|
0:11:56 | false alarm rate to be |
---|
0:12:02 | so |
---|
0:12:03 | here the results uh when you look at |
---|
0:12:05 | one |
---|
0:12:05 | sound of speaker pairs |
---|
0:12:07 | uh in each case |
---|
0:12:08 | the |
---|
0:12:09 | top row |
---|
0:12:10 | corresponds |
---|
0:12:11 | to uh |
---|
0:12:12 | the |
---|
0:12:13 | least similar speaker pairs |
---|
0:12:15 | so |
---|
0:12:16 | performance is improving |
---|
0:12:18 | and this road is |
---|
0:12:19 | um |
---|
0:12:20 | corresponds to most |
---|
0:12:22 | similar speaker pairs so that |
---|
0:12:24 | the |
---|
0:12:24 | performance is getting worse |
---|
0:12:26 | um |
---|
0:12:27 | so we notice that we are able to find uh |
---|
0:12:30 | features and and measures that |
---|
0:12:33 | we also like |
---|
0:12:34 | uh |
---|
0:12:35 | speaker pairs |
---|
0:12:36 | with the desired |
---|
0:12:37 | um |
---|
0:12:37 | and and see |
---|
0:12:39 | um |
---|
0:12:41 | if we then |
---|
0:12:42 | compare uh performance on one side |
---|
0:12:45 | to performance on five percent you can see that it's less |
---|
0:12:49 | pronounced when you include |
---|
0:12:50 | more speaker pairs |
---|
0:12:52 | um |
---|
0:12:53 | in some cases you |
---|
0:12:55 | you have these negative |
---|
0:12:56 | or |
---|
0:12:57 | opposite trends from what you |
---|
0:12:58 | back |
---|
0:13:04 | oh i i |
---|
0:13:05 | i pretty much mention all these points the only thing to note is that |
---|
0:13:08 | um |
---|
0:13:09 | changes in performance are not uniform |
---|
0:13:12 | across site submissions |
---|
0:13:13 | so that is |
---|
0:13:14 | uh |
---|
0:13:14 | one issue |
---|
0:13:20 | um okay so here's would adopt her for one system um when we use |
---|
0:13:25 | uh be euclidean distance between |
---|
0:13:28 | uh vectors of the first three formant |
---|
0:13:31 | uh dyslexic |
---|
0:13:32 | figure pair |
---|
0:13:33 | um |
---|
0:13:34 | you solid line |
---|
0:13:36 | correspond to uh the most similar speaker pairs being you |
---|
0:13:40 | and the dashed lines are very similar |
---|
0:13:43 | uh right is one percent and green is five percent |
---|
0:13:47 | and the black line is |
---|
0:13:48 | the |
---|
0:13:49 | uh case for all speaker pair |
---|
0:13:52 | um |
---|
0:13:54 | we know it is |
---|
0:13:57 | that |
---|
0:13:58 | uh in this particular instance |
---|
0:14:00 | um |
---|
0:14:00 | there's |
---|
0:14:02 | a bigger difference |
---|
0:14:03 | uh when looking at |
---|
0:14:04 | the |
---|
0:14:05 | uh |
---|
0:14:06 | leave |
---|
0:14:06 | similar |
---|
0:14:07 | speaker pairs |
---|
0:14:08 | and the most similar speaker pairs |
---|
0:14:11 | uh are much closer to |
---|
0:14:12 | formance overall speaker |
---|
0:14:15 | um |
---|
0:14:16 | although that doesn't happen all of the time it is |
---|
0:14:19 | uh |
---|
0:14:20 | certainly the general tendency |
---|
0:14:22 | uh |
---|
0:14:22 | to have this larger |
---|
0:14:23 | larger gap in this direction |
---|
0:14:29 | and |
---|
0:14:30 | here's another |
---|
0:14:31 | example that |
---|
0:14:31 | shows |
---|
0:14:33 | that it doesn't always hold |
---|
0:14:34 | um |
---|
0:14:36 | and this is |
---|
0:14:36 | uh a different system and a different feature measure that's the percent difference of median energy in this case |
---|
0:14:43 | and um |
---|
0:14:48 | you you get better separation here |
---|
0:14:50 | uh it there is and how much separation there is |
---|
0:14:53 | and |
---|
0:14:56 | across |
---|
0:14:58 | so |
---|
0:14:58 | we've been able to do some stuff but we expect we could probably do even better |
---|
0:15:03 | if we use uh more knowledge |
---|
0:15:05 | speaker system |
---|
0:15:06 | so |
---|
0:15:08 | um |
---|
0:15:09 | we decided to just |
---|
0:15:10 | simply use gmm |
---|
0:15:12 | since uh they show up obviously in a lot of |
---|
0:15:15 | um system |
---|
0:15:16 | um so we adapted uh |
---|
0:15:19 | speaker specific gmm |
---|
0:15:21 | um and then calculated the |
---|
0:15:24 | uh |
---|
0:15:25 | cal divergence |
---|
0:15:26 | between them has to be measured speaker similarity |
---|
0:15:30 | when we do that |
---|
0:15:31 | not surprisingly we get uh |
---|
0:15:33 | better results |
---|
0:15:35 | um |
---|
0:15:36 | the previous |
---|
0:15:37 | charts all had just one from negative fifty percent to fifty percent |
---|
0:15:40 | so you can see already that |
---|
0:15:42 | there are larger difference |
---|
0:15:44 | watches as i said what we would expect |
---|
0:15:49 | um |
---|
0:15:50 | here's |
---|
0:15:51 | that curve for a system using the key algorithm |
---|
0:15:55 | um |
---|
0:15:57 | again you can see that these are larger differences from the all performance |
---|
0:16:02 | and we again |
---|
0:16:04 | uh see this asymmetry where |
---|
0:16:07 | uh |
---|
0:16:09 | a bigger gap |
---|
0:16:10 | for the dissimilar pairs one for the somewhere |
---|
0:16:20 | so as i mentioned uh we're |
---|
0:16:23 | tend to be more successful at selecting easy to distinguish speaker pairs |
---|
0:16:27 | uh |
---|
0:16:28 | and possibly because these pairs may be easier to |
---|
0:16:31 | fine |
---|
0:16:32 | um |
---|
0:16:32 | one possible explanation would be that |
---|
0:16:35 | if you have a a speaker pair |
---|
0:16:37 | that is very dissimilar |
---|
0:16:39 | and um |
---|
0:16:40 | terms of pitch or formant frequencies |
---|
0:16:42 | that |
---|
0:16:43 | you know a big difference is probably going to mean that |
---|
0:16:46 | the system is not going to |
---|
0:16:47 | to use them |
---|
0:16:48 | um but on the flip side if you're trying to figure out what makes the speaker pair |
---|
0:16:52 | difficult um just like you know any single feature may not be enough |
---|
0:16:56 | to capture |
---|
0:16:58 | um |
---|
0:16:59 | that information |
---|
0:17:06 | so using them the K L divergence measure |
---|
0:17:10 | we took a closer look at |
---|
0:17:11 | uh the speaker pairs but are selected |
---|
0:17:14 | and |
---|
0:17:15 | as we in the mostly similar so in addition to like you know the one percent by |
---|
0:17:19 | the group |
---|
0:17:20 | uh also looked at |
---|
0:17:21 | three percent ten percent twenty percent |
---|
0:17:24 | and um and this data and there were a hundred fifty speakers overall |
---|
0:17:29 | uh |
---|
0:17:29 | leading to uh uh eighteen hundred |
---|
0:17:32 | unique speaker pair |
---|
0:17:34 | for same sex |
---|
0:17:35 | first |
---|
0:17:38 | and uh one thing we noted |
---|
0:17:40 | is that |
---|
0:17:41 | in |
---|
0:17:42 | in the groups of me |
---|
0:17:44 | uh least similar speaker pairs |
---|
0:17:47 | if you look at a group |
---|
0:17:49 | with what the larger values |
---|
0:17:50 | of the divergence |
---|
0:17:52 | um |
---|
0:17:53 | we would expect |
---|
0:17:54 | to be easier to distinguish |
---|
0:17:56 | the majority of them are male |
---|
0:17:57 | um |
---|
0:17:58 | but |
---|
0:17:59 | if you look at any one group about seventy five percent of |
---|
0:18:01 | speaker pairs in the group will be |
---|
0:18:03 | uh mail |
---|
0:18:04 | on average |
---|
0:18:10 | uh to a lesser extent we notice the opposite tendency when we look at |
---|
0:18:14 | uh |
---|
0:18:16 | more similar speaker pairs which |
---|
0:18:18 | uh somewhat tend to be more female |
---|
0:18:21 | um |
---|
0:18:21 | the |
---|
0:18:23 | one |
---|
0:18:24 | and |
---|
0:18:24 | three percent |
---|
0:18:25 | still have more male pairs |
---|
0:18:26 | but |
---|
0:18:28 | the other group |
---|
0:18:29 | have more female |
---|
0:18:34 | um |
---|
0:18:35 | this |
---|
0:18:36 | you know maybe part of the reason why uh |
---|
0:18:38 | system performance |
---|
0:18:39 | typically better |
---|
0:18:40 | and male |
---|
0:18:41 | um |
---|
0:18:42 | and it just in that you know males may may |
---|
0:18:45 | exhibit a greater range of differences |
---|
0:18:48 | between them |
---|
0:18:49 | so that |
---|
0:18:50 | there are likely to be more |
---|
0:18:51 | the similar |
---|
0:18:52 | a male speaker |
---|
0:18:59 | and finally so looking at these groups |
---|
0:19:02 | um we notice that there is a tendency define two types |
---|
0:19:05 | speaker |
---|
0:19:06 | uh there are speakers who frequently appear as members of difficult to distinguish |
---|
0:19:11 | uh |
---|
0:19:11 | speaker pairs |
---|
0:19:12 | and speakers who occur frequently as members of |
---|
0:19:15 | easy to distinguish speakers |
---|
0:19:18 | um |
---|
0:19:20 | in fact there are fifteen speakers you never appear in the most |
---|
0:19:23 | similar group |
---|
0:19:25 | and twenty four speakers you never appear in the most |
---|
0:19:28 | dissimilar group |
---|
0:19:30 | um |
---|
0:19:32 | i forgot |
---|
0:19:32 | but i think this is |
---|
0:19:33 | this twenty four speakers there's ten male and forty female |
---|
0:19:40 | so this |
---|
0:19:41 | uh tends to support the idea that there are these walls in there |
---|
0:19:45 | uh |
---|
0:19:45 | speakers who are |
---|
0:19:47 | are more difficult |
---|
0:19:48 | um |
---|
0:19:48 | or more similar to other speakers |
---|
0:19:55 | so just a summary of what i mentioned |
---|
0:19:57 | uh first of all it is possible |
---|
0:19:59 | project |
---|
0:20:00 | uh what speaker pairs will be difficult for a |
---|
0:20:03 | typical speaker recognition system |
---|
0:20:04 | to distinguish |
---|
0:20:06 | um |
---|
0:20:07 | for the features |
---|
0:20:08 | that we considered here would catch |
---|
0:20:10 | formant frequency of the the best ones seem to be uh the the euclidean dist |
---|
0:20:14 | between the first |
---|
0:20:15 | uh three formant frequency |
---|
0:20:18 | um but the best measure overall was the more uh complex |
---|
0:20:21 | uh |
---|
0:20:22 | cal divergence measure between |
---|
0:20:24 | uh |
---|
0:20:25 | speakers this |
---|
0:20:25 | fig gmm |
---|
0:20:27 | um i mentioned of course that we're typically more successful at identifying dissimilar speaker pairs |
---|
0:20:33 | and that in addition to |
---|
0:20:35 | to being able to um |
---|
0:20:37 | you know finding speaker pairs |
---|
0:20:39 | uh |
---|
0:20:40 | using these measures can provide potentially useful information |
---|
0:20:43 | about a speaker's tendency to be |
---|
0:20:45 | similar or dissimilar to other |
---|
0:20:52 | so future work |
---|
0:20:53 | um |
---|
0:20:55 | one thing to try is testing combinations of multiple feature measures |
---|
0:21:00 | because the the method for selecting similar speaker pairs |
---|
0:21:03 | um i did a little bit of work on this |
---|
0:21:06 | uh where i just |
---|
0:21:07 | basically |
---|
0:21:08 | assigned a rank |
---|
0:21:09 | according to each |
---|
0:21:10 | uh feature manager and then some of the ring |
---|
0:21:12 | over the speaker pairs and and |
---|
0:21:14 | did selection that way |
---|
0:21:16 | and and that that improve |
---|
0:21:18 | result |
---|
0:21:19 | uh another extension is to um instead of focusing on impostor speaker pairs |
---|
0:21:24 | see if you can find uh figure out what |
---|
0:21:27 | target speakers will be difficult |
---|
0:21:28 | uh for the system to correctly right |
---|
0:21:32 | and one thing but um certainly needs to be investigated is |
---|
0:21:36 | uh |
---|
0:21:37 | the you lack |
---|
0:21:38 | assistant see uh |
---|
0:21:39 | behaviour for the things but of speakers across |
---|
0:21:42 | um |
---|
0:21:43 | different |
---|
0:21:44 | uh |
---|
0:21:45 | system |
---|
0:21:46 | um |
---|
0:21:46 | we may be able to find potential trend |
---|
0:21:49 | in behaviour across classes or types of stuff |
---|
0:21:52 | of course with |
---|
0:21:53 | uh |
---|
0:21:54 | the |
---|
0:21:55 | site submissions that we used here uh |
---|
0:21:59 | almost all of the submissions are in fact |
---|
0:22:01 | fusion of multiple systems |
---|
0:22:03 | so might need to do a more of a breakdown |
---|
0:22:05 | uh to |
---|
0:22:07 | um |
---|
0:22:08 | really get out |
---|
0:22:09 | that |
---|
0:22:10 | sure |
---|
0:22:13 | okay |
---|
0:22:14 | that's all i have thank you |
---|
0:22:16 | hmmm |
---|
0:22:24 | sh |
---|
0:22:38 | thank you larry |
---|
0:22:39 | presentation |
---|
0:22:41 | i have a question about your |
---|
0:22:43 | formant extraction |
---|
0:22:44 | uh |
---|
0:22:45 | do you have |
---|
0:22:46 | don |
---|
0:22:47 | nation |
---|
0:22:48 | for all the old volumes |
---|
0:22:50 | or |
---|
0:22:51 | did you controls that's your extraction |
---|
0:22:53 | uh |
---|
0:22:54 | to the east |
---|
0:22:55 | you you you |
---|
0:22:56 | does volume |
---|
0:22:57 | or |
---|
0:22:58 | four |
---|
0:22:59 | different |
---|
0:22:59 | the |
---|
0:23:00 | you do |
---|
0:23:01 | one type of volume |
---|
0:23:03 | because you know |
---|
0:23:04 | it's |
---|
0:23:04 | my question is |
---|
0:23:06 | it is uh |
---|
0:23:07 | this |
---|
0:23:07 | use the volume |
---|
0:23:09 | or you you do |
---|
0:23:11 | um extraction according to the volume |
---|
0:23:14 | now so uh we didn't it was just the over the entire file so it's it's definitely you could probably |
---|
0:23:19 | get much better estimates |
---|
0:23:21 | and what we what we actually did |
---|
0:23:23 | because |
---|
0:23:24 | the the the problem is that |
---|
0:23:26 | uh |
---|
0:23:27 | uh you have a lot of disturbance according to the volume for now |
---|
0:23:30 | so |
---|
0:23:31 | uh but i think that |
---|
0:23:33 | uh |
---|
0:23:34 | i think that it is more the sample |
---|
0:23:36 | and |
---|
0:23:36 | no |
---|
0:23:37 | phonological information |
---|
0:23:39 | that's value |
---|
0:23:40 | yeah |
---|
0:23:40 | the speaker information |
---|
0:23:42 | so of course uh yeah of course this is |
---|
0:23:44 | convolving the the phonetic the |
---|
0:23:46 | phonetic |
---|
0:23:47 | with |
---|
0:23:47 | with the speaker |
---|
0:23:48 | okay thank you |
---|
0:23:55 | oh |
---|
0:23:57 | oh |
---|
0:23:58 | oh |
---|
0:23:59 | hmmm |
---|
0:24:01 | oh |
---|
0:24:05 | oh |
---|
0:24:06 | oh |
---|
0:24:07 | oh |
---|
0:24:08 | two |
---|
0:24:11 | what |
---|
0:24:12 | oh |
---|
0:24:13 | hmmm |
---|
0:24:14 | oh |
---|
0:24:16 | oh |
---|
0:24:17 | oh |
---|
0:24:18 | oh |
---|
0:24:19 | oh |
---|
0:24:20 | i |
---|
0:24:21 | pairs of what |
---|
0:24:22 | oh |
---|
0:24:23 | extracts or just |
---|
0:24:25 | oh |
---|
0:24:26 | uh_huh |
---|
0:24:28 | oh |
---|
0:24:29 | oh |
---|
0:24:31 | oh |
---|
0:24:32 | oh |
---|
0:24:33 | sure |
---|
0:24:36 | oh |
---|
0:24:37 | fig |
---|
0:24:39 | oh |
---|
0:24:39 | two |
---|
0:24:40 | oh |
---|
0:24:41 | oh |
---|
0:24:42 | oh |
---|
0:24:42 | hmmm |
---|
0:24:43 | four |
---|
0:24:44 | oh |
---|
0:24:45 | oh |
---|
0:24:47 | or |
---|
0:24:48 | oh |
---|
0:24:52 | oh |
---|
0:24:54 | sure |
---|
0:24:56 | sure |
---|
0:24:59 | thanks |
---|
0:25:04 | hmmm |
---|
0:25:06 | no |
---|
0:25:08 | we will |
---|
0:25:08 | we should |
---|
0:25:09 | uh |
---|
0:25:11 | four |
---|
0:25:12 | the |
---|
0:25:12 | the |
---|
0:25:13 | the |
---|
0:25:14 | yeah |
---|
0:25:16 | uh |
---|
0:25:17 | you |
---|
0:25:19 | uh |
---|
0:25:19 | uh |
---|
0:25:22 | uh |
---|
0:25:24 | right |
---|
0:25:25 | uh |
---|
0:25:25 | good |
---|
0:25:27 | oh |
---|
0:25:28 | i'm sorry |
---|
0:25:30 | are you talking about |
---|
0:25:32 | i |
---|
0:25:33 | figure |
---|
0:25:33 | paper |
---|
0:25:34 | and i think you mean |
---|
0:25:36 | uh |
---|
0:25:38 | oh |
---|
0:25:40 | cool |
---|
0:25:42 | yeah |
---|
0:25:44 | yeah i know but it was definitely the case but it was |
---|
0:25:48 | oh |
---|
0:25:49 | right |
---|
0:25:52 | hmmm |
---|
0:25:53 | hmmm |
---|
0:25:58 | oh |
---|