0:00:14 | hello everyone minus my name is that's on the flute |
---|
0:00:18 | i work at to identify and i will talk to about |
---|
0:00:22 | automatic speaker recognition in a forensic voice comparison |
---|
0:00:27 | such i'm the user of automatic speaker recognition technology a not a developer |
---|
0:00:32 | which will give me a unique perspective which i hope will be insightful for you |
---|
0:00:37 | and in this study into representativeness is a constant that's really important |
---|
0:00:43 | in doing actual cases in forensic voice comparison |
---|
0:00:47 | my to go also share are |
---|
0:00:49 | they are heroes who can actually developed |
---|
0:00:51 | automatic speaker recognition systems they work are also great research and they have their |
---|
0:00:56 | system that was used for the study for this study |
---|
0:01:01 | but an just the humble user and i will talk about automatic speaker recognition from |
---|
0:01:05 | that perspective |
---|
0:01:07 | so forensic voice comparison you will typically have an offender recording from the police and |
---|
0:01:11 | somebody did something bad in this recording |
---|
0:01:14 | and identity of the speakers unknown and |
---|
0:01:18 | there we will be a suspect that we should think okay this guy must be |
---|
0:01:21 | the same as the offender so the suspect recording and |
---|
0:01:25 | the recording come from everywhere |
---|
0:01:28 | the importance start is we get two recordings |
---|
0:01:31 | one of the has a contested speaker identity and the other one that's just a |
---|
0:01:35 | suspect no nobody distance |
---|
0:01:38 | and the question is always pay are disguise |
---|
0:01:40 | the same person are these people the same person |
---|
0:01:43 | of course we translate this into hypotheses so |
---|
0:01:48 | we were gonna bayesian framework |
---|
0:01:53 | but it but all boils down to is the same going or not and when |
---|
0:01:57 | you use automatic speaker recognition value |
---|
0:01:59 | chuck in the recordings into your into your system you give it some users submit |
---|
0:02:04 | data and reference normalization code word level |
---|
0:02:06 | locater about in the book if you a score |
---|
0:02:09 | and this score |
---|
0:02:11 | so that exists in the void |
---|
0:02:13 | there's no way of |
---|
0:02:15 | telling what a score means that could be seventeen and nobody knows how gender so |
---|
0:02:19 | you need and relevant population so you look at your |
---|
0:02:23 | potential rah relevant population recordings of original speaker identity |
---|
0:02:30 | and you check my case recordings are the blue tire blew guys so my relevant |
---|
0:02:35 | population blue people and i compared those blue people |
---|
0:02:40 | the in the same manner as i did not the case |
---|
0:02:43 | and it will use the same speaker scores and different speaker scores are used to |
---|
0:02:47 | can be made to distribution and then i can bring back okay score |
---|
0:02:52 | and here in this example i can see i've in a war over about four |
---|
0:02:57 | because the intersection with the green line and orange line data i ratio for |
---|
0:03:03 | and this for that's a likelihood ratio and |
---|
0:03:07 | now we have we don't have meaningless score anymore we have and meaningful number likelihood |
---|
0:03:12 | ratio this is |
---|
0:03:13 | an expression of the weight of evidence |
---|
0:03:15 | it can actually be used in case work are in court |
---|
0:03:19 | the judge can |
---|
0:03:20 | weight is in this decision or a decision |
---|
0:03:24 | about the case as a whole |
---|
0:03:26 | okay let's backtrack |
---|
0:03:28 | there was this choice of relevant potential relevant population and i said okay let's look |
---|
0:03:33 | at the colour of the guys |
---|
0:03:36 | but reality is a bit more compact and just for colours or maybe i should |
---|
0:03:40 | have checked for whether they were wearing sunglasses and you would get another |
---|
0:03:44 | another relevant population or maybe i should have checked |
---|
0:03:48 | but today have had some or maybe the combination of these two |
---|
0:03:52 | and |
---|
0:03:53 | there's the earlier |
---|
0:03:55 | results i got |
---|
0:03:57 | but when and when taking for hence it might be that the distributions were shifted |
---|
0:04:01 | and the actual resulting of our will be way lower than that had before or |
---|
0:04:05 | had checked for sunglasses it might is just to the other way |
---|
0:04:10 | and okay than other or kind of i would've |
---|
0:04:13 | checks for every single metadata would think off |
---|
0:04:17 | colour hats and the glasses i would probably not have sufficient data to even do |
---|
0:04:21 | this |
---|
0:04:23 | so you can see this is a major impact on the result of the case |
---|
0:04:26 | and |
---|
0:04:29 | this is a |
---|
0:04:30 | a real problem in forensic voice comparison because when i was talking about hence when |
---|
0:04:34 | i was talking about |
---|
0:04:37 | i sunglasses i actually meant of course it's conditions |
---|
0:04:41 | case recording conditions and that's norm list and i just even you some of the |
---|
0:04:46 | double my have this when you think of it |
---|
0:04:48 | could be close to infinity |
---|
0:04:52 | so that's a real problem you don't really know what to select for and even |
---|
0:04:56 | we didn't |
---|
0:04:58 | this list |
---|
0:04:59 | look at raw recording type there |
---|
0:05:01 | in there there's multiple categories and within those categories there's |
---|
0:05:06 | even cellular |
---|
0:05:07 | so there elements to look for and it's just not clear |
---|
0:05:11 | should this should some of these things could dish safely be ignored because there are |
---|
0:05:15 | no impact on the use or |
---|
0:05:17 | at all |
---|
0:05:18 | or on it may be really crucial and then it's really important you don't wanna |
---|
0:05:22 | forget it because then you get this wrong |
---|
0:05:25 | likelihood ratio that could |
---|
0:05:27 | potentially need to one miscarriage of justice |
---|
0:05:31 | so in order to do research into this relevant population problem neglect the database it's |
---|
0:05:36 | called in a v freda |
---|
0:05:38 | forensically realistic into device audio it's got two hundred and fifty male speakers |
---|
0:05:43 | and the other characteristics here are just the target audience of forensic voice comparison in |
---|
0:05:48 | the netherlands basically and their speech was recorded on multiple devices simultaneously so every utterance |
---|
0:05:55 | of speech is recorded in different ways |
---|
0:05:58 | and i have an example of this |
---|
0:06:03 | and they'll go there is setting and he's talking on the phone which isn't will |
---|
0:06:07 | not a participant |
---|
0:06:08 | and |
---|
0:06:09 | she scheme or headset |
---|
0:06:12 | a text-dependent i for the subset of the testing for |
---|
0:06:18 | and there was a |
---|
0:06:19 | no improvement due to stupidity i financed data suggesting for |
---|
0:06:26 | and i guess to |
---|
0:06:34 | and their the microphone on the other side room |
---|
0:06:43 | and that's please kindly provided actual into sets of the telephone |
---|
0:06:50 | and however |
---|
0:06:53 | and this is still of a video by i phone which is |
---|
0:06:58 | this text recording |
---|
0:07:00 | so this is a list of the recording devices and |
---|
0:07:04 | it says they're inside only for the two four three microphones |
---|
0:07:09 | and i will explain this right now |
---|
0:07:11 | so |
---|
0:07:12 | every participant two days of recording everyday had eight recording sessions |
---|
0:07:17 | for them are inside for them are outside |
---|
0:07:21 | all those inside an outsider as it was divided in the silent backgrounds and noisy |
---|
0:07:25 | background and for incitement just no sound or |
---|
0:07:29 | a white noise radio |
---|
0:07:31 | and making noise for the noisy background outside and |
---|
0:07:34 | the actual location wherever so the roses sort of silent place |
---|
0:07:39 | and there is a busy place writing central forensic them as you can see |
---|
0:07:45 | and then the was the other variation where the actual telephone are used as eigen |
---|
0:07:49 | up or and i phone and this made up this made eight conversations per day |
---|
0:07:53 | and there's two days of those and |
---|
0:07:57 | the conversations are five minutes of spontaneous telephone speech and |
---|
0:08:01 | we actually transcribe half of it the i from recordings which helped us added recordings |
---|
0:08:07 | you consider speech nonspeech information available |
---|
0:08:09 | and look at the numbers and |
---|
0:08:12 | you can see per speaker has about one hour twenty minutes speech duration that the |
---|
0:08:16 | worked of the recording the duration of longer because for every |
---|
0:08:20 | speech utterance does not of course |
---|
0:08:23 | so why they do it is of course that's forensically relevant to the speaker demographics |
---|
0:08:28 | and like i said but the real cool part if the simultaneous recordings and this |
---|
0:08:35 | makes the influence of recording device possible more specifically the relevance of data |
---|
0:08:43 | that's recorded by different recording to tax |
---|
0:08:46 | the system we used for this studies vocalise bucks of a research |
---|
0:08:51 | it's the x factor system and |
---|
0:08:54 | and then really cool feature value in visible eyes the speakers you can see in |
---|
0:08:58 | the bottom right that's i-vector extractors problem and down to three dimensions |
---|
0:09:04 | you also have the option to do earlier generations i-vector gmm you the option to |
---|
0:09:10 | and not use mfccs but use all the phonetic features |
---|
0:09:16 | and they have a speciality other than normal stuff so there's reference organisation which is |
---|
0:09:21 | very standard that it you can also submit data |
---|
0:09:25 | for the ap lda to the better to the case conditions |
---|
0:09:31 | so three to me to one and thirty five speakers from three to they all |
---|
0:09:35 | the recordings were added that's speech and come to forty seconds |
---|
0:09:40 | and they were divided into two groups there's test data and there's the reference normalization |
---|
0:09:46 | cohort and we also did experiments ments without reference normalization cohort |
---|
0:09:51 | and i should say for every speaker every day there's five recordings for five the |
---|
0:09:55 | first five devices the smartphone video was not included but the other five or they're |
---|
0:10:01 | so |
---|
0:10:02 | you know what in you target and localise and this is what you get |
---|
0:10:07 | and these are complex equal error are convex or equal error rates and as you |
---|
0:10:13 | can see when you do a matching experiment you get quite good numbers this does |
---|
0:10:17 | not compared to i-vector performance at all |
---|
0:10:20 | it really better you can actually see that the first three devices that high quality |
---|
0:10:26 | close microphones perform |
---|
0:10:28 | pretty well against each other even then mismatched |
---|
0:10:31 | this is not |
---|
0:10:33 | quite true for the other two devices if you mismatch could |
---|
0:10:36 | recordings but the for microphone of the telephone instead you really start to notice |
---|
0:10:41 | and of course if you do too but to a clear recording types compared those |
---|
0:10:46 | in a mismatch that's gonna be the |
---|
0:10:49 | that's performing the worst |
---|
0:10:51 | note if i do this with an i-vector system |
---|
0:10:54 | this four five ish equal error rate actually what you get |
---|
0:11:00 | all over the place and the one that's highlighted now would be probably ten percent |
---|
0:11:04 | equal error rate |
---|
0:11:05 | also note that reference normalization actually helps |
---|
0:11:08 | so the lower the lower equal error rate is |
---|
0:11:11 | almost everywhere a bit lower than one the optimal |
---|
0:11:15 | so back to the original question can we do research that finds out vector something |
---|
0:11:20 | to recruit sure from things can be ignored when selecting relevant population data |
---|
0:11:25 | so we to the one hundred thirty five every the speakers that we divided them |
---|
0:11:30 | in most cases so |
---|
0:11:31 | those with the actual bayes recordings |
---|
0:11:34 | and background data with the this in three ways and the results are pulled off |
---|
0:11:38 | the ropes |
---|
0:11:40 | when you compare levi's one for those more cases it would make sense to use |
---|
0:11:45 | device about one as a background data because then you matching background data make matching |
---|
0:11:49 | relevant population data |
---|
0:11:51 | you could also use the other devices and then you would have mismatch in background |
---|
0:11:55 | data |
---|
0:11:56 | so we did this for every device type in the more cases and every device |
---|
0:12:00 | tie in the background data and the relevant population data the menu twenty five |
---|
0:12:05 | sets of ours which are represented as see the loss and if you look at |
---|
0:12:11 | this table the diagonal means that we |
---|
0:12:14 | use the right relevant population data so that more case type wasn't device one so |
---|
0:12:19 | the handset so we don't the headset recordings |
---|
0:12:22 | for the relevant population like that and so one |
---|
0:12:26 | can you look at the first row here's basically invalid what you should to when |
---|
0:12:30 | you have a case |
---|
0:12:32 | in device one |
---|
0:12:33 | it may make sense to use device one but you can also see that the |
---|
0:12:36 | guys two and three are just as good |
---|
0:12:39 | device for is a bit worse that's the form one little reverberation in it and |
---|
0:12:44 | the telephone is definitely bad |
---|
0:12:47 | that issue you each penalty and performance |
---|
0:12:50 | and the same holds for the other two close microphones |
---|
0:12:56 | accept maybe device three but it seem to be quite interchangeable and device for is |
---|
0:13:01 | in between and device five is just out of the question that gives you a |
---|
0:13:05 | penalty for performance |
---|
0:13:07 | like to look at the two |
---|
0:13:09 | the real recordings again |
---|
0:13:11 | they better be represented by themselves as you can see in the numbers the lowest |
---|
0:13:16 | civil are is for the matching background data and there's nothing that really comes close |
---|
0:13:21 | so graphically represented |
---|
0:13:23 | that means the three high quality microphones can sort of represent each other |
---|
0:13:29 | as can be seen between a rose and |
---|
0:13:32 | the for microphone which is still a direct microphone but it's far away in the |
---|
0:13:37 | recording traversed |
---|
0:13:38 | that's |
---|
0:13:39 | an intermediate one and the telephone intercept definitely don't |
---|
0:13:42 | use it |
---|
0:13:43 | to represent |
---|
0:13:45 | the microphones or vice versa |
---|
0:13:49 | so that an answer to a question |
---|
0:13:52 | for what in broad recording type you cannot blows over a telephone and the right |
---|
0:13:57 | mackerel that's definitely a crucial difference there but what indirect microphone the brand or type |
---|
0:14:03 | of make type of microphone is really not so important that's what these results seem |
---|
0:14:09 | to suggest |
---|
0:14:11 | so these results were not very surprising to me |
---|
0:14:16 | but it shows you type of research you |
---|
0:14:18 | a big interest to the user's social automatic speaker recognition because it gives you a |
---|
0:14:24 | guideline |
---|
0:14:25 | or how to choose relevant population data and it gives you basically a guideline how |
---|
0:14:30 | to use it is are properly |
---|
0:14:33 | making it available for and the |
---|