0:00:14hello everyone minus my name is that's on the flute
0:00:18i work at to identify and i will talk to about
0:00:22automatic speaker recognition in a forensic voice comparison
0:00:27such i'm the user of automatic speaker recognition technology a not a developer
0:00:32which will give me a unique perspective which i hope will be insightful for you
0:00:37and in this study into representativeness is a constant that's really important
0:00:43in doing actual cases in forensic voice comparison
0:00:47my to go also share are
0:00:49they are heroes who can actually developed
0:00:51automatic speaker recognition systems they work are also great research and they have their
0:00:56system that was used for the study for this study
0:01:01but an just the humble user and i will talk about automatic speaker recognition from
0:01:05that perspective
0:01:07so forensic voice comparison you will typically have an offender recording from the police and
0:01:11somebody did something bad in this recording
0:01:14and identity of the speakers unknown and
0:01:18there we will be a suspect that we should think okay this guy must be
0:01:21the same as the offender so the suspect recording and
0:01:25the recording come from everywhere
0:01:28the importance start is we get two recordings
0:01:31one of the has a contested speaker identity and the other one that's just a
0:01:35suspect no nobody distance
0:01:38and the question is always pay are disguise
0:01:40the same person are these people the same person
0:01:43of course we translate this into hypotheses so
0:01:48we were gonna bayesian framework
0:01:53but it but all boils down to is the same going or not and when
0:01:57you use automatic speaker recognition value
0:01:59chuck in the recordings into your into your system you give it some users submit
0:02:04data and reference normalization code word level
0:02:06locater about in the book if you a score
0:02:09and this score
0:02:11so that exists in the void
0:02:13there's no way of
0:02:15telling what a score means that could be seventeen and nobody knows how gender so
0:02:19you need and relevant population so you look at your
0:02:23potential rah relevant population recordings of original speaker identity
0:02:30and you check my case recordings are the blue tire blew guys so my relevant
0:02:35population blue people and i compared those blue people
0:02:40the in the same manner as i did not the case
0:02:43and it will use the same speaker scores and different speaker scores are used to
0:02:47can be made to distribution and then i can bring back okay score
0:02:52and here in this example i can see i've in a war over about four
0:02:57because the intersection with the green line and orange line data i ratio for
0:03:03and this for that's a likelihood ratio and
0:03:07now we have we don't have meaningless score anymore we have and meaningful number likelihood
0:03:12ratio this is
0:03:13an expression of the weight of evidence
0:03:15it can actually be used in case work are in court
0:03:19the judge can
0:03:20weight is in this decision or a decision
0:03:24about the case as a whole
0:03:26okay let's backtrack
0:03:28there was this choice of relevant potential relevant population and i said okay let's look
0:03:33at the colour of the guys
0:03:36but reality is a bit more compact and just for colours or maybe i should
0:03:40have checked for whether they were wearing sunglasses and you would get another
0:03:44another relevant population or maybe i should have checked
0:03:48but today have had some or maybe the combination of these two
0:03:52and
0:03:53there's the earlier
0:03:55results i got
0:03:57but when and when taking for hence it might be that the distributions were shifted
0:04:01and the actual resulting of our will be way lower than that had before or
0:04:05had checked for sunglasses it might is just to the other way
0:04:10and okay than other or kind of i would've
0:04:13checks for every single metadata would think off
0:04:17colour hats and the glasses i would probably not have sufficient data to even do
0:04:21this
0:04:23so you can see this is a major impact on the result of the case
0:04:26and
0:04:29this is a
0:04:30a real problem in forensic voice comparison because when i was talking about hence when
0:04:34i was talking about
0:04:37i sunglasses i actually meant of course it's conditions
0:04:41case recording conditions and that's norm list and i just even you some of the
0:04:46double my have this when you think of it
0:04:48could be close to infinity
0:04:52so that's a real problem you don't really know what to select for and even
0:04:56we didn't
0:04:58this list
0:04:59look at raw recording type there
0:05:01in there there's multiple categories and within those categories there's
0:05:06even cellular
0:05:07so there elements to look for and it's just not clear
0:05:11should this should some of these things could dish safely be ignored because there are
0:05:15no impact on the use or
0:05:17at all
0:05:18or on it may be really crucial and then it's really important you don't wanna
0:05:22forget it because then you get this wrong
0:05:25likelihood ratio that could
0:05:27potentially need to one miscarriage of justice
0:05:31so in order to do research into this relevant population problem neglect the database it's
0:05:36called in a v freda
0:05:38forensically realistic into device audio it's got two hundred and fifty male speakers
0:05:43and the other characteristics here are just the target audience of forensic voice comparison in
0:05:48the netherlands basically and their speech was recorded on multiple devices simultaneously so every utterance
0:05:55of speech is recorded in different ways
0:05:58and i have an example of this
0:06:03and they'll go there is setting and he's talking on the phone which isn't will
0:06:07not a participant
0:06:08and
0:06:09she scheme or headset
0:06:12a text-dependent i for the subset of the testing for
0:06:18and there was a
0:06:19no improvement due to stupidity i financed data suggesting for
0:06:26and i guess to
0:06:34and their the microphone on the other side room
0:06:43and that's please kindly provided actual into sets of the telephone
0:06:50and however
0:06:53and this is still of a video by i phone which is
0:06:58this text recording
0:07:00so this is a list of the recording devices and
0:07:04it says they're inside only for the two four three microphones
0:07:09and i will explain this right now
0:07:11so
0:07:12every participant two days of recording everyday had eight recording sessions
0:07:17for them are inside for them are outside
0:07:21all those inside an outsider as it was divided in the silent backgrounds and noisy
0:07:25background and for incitement just no sound or
0:07:29a white noise radio
0:07:31and making noise for the noisy background outside and
0:07:34the actual location wherever so the roses sort of silent place
0:07:39and there is a busy place writing central forensic them as you can see
0:07:45and then the was the other variation where the actual telephone are used as eigen
0:07:49up or and i phone and this made up this made eight conversations per day
0:07:53and there's two days of those and
0:07:57the conversations are five minutes of spontaneous telephone speech and
0:08:01we actually transcribe half of it the i from recordings which helped us added recordings
0:08:07you consider speech nonspeech information available
0:08:09and look at the numbers and
0:08:12you can see per speaker has about one hour twenty minutes speech duration that the
0:08:16worked of the recording the duration of longer because for every
0:08:20speech utterance does not of course
0:08:23so why they do it is of course that's forensically relevant to the speaker demographics
0:08:28and like i said but the real cool part if the simultaneous recordings and this
0:08:35makes the influence of recording device possible more specifically the relevance of data
0:08:43that's recorded by different recording to tax
0:08:46the system we used for this studies vocalise bucks of a research
0:08:51it's the x factor system and
0:08:54and then really cool feature value in visible eyes the speakers you can see in
0:08:58the bottom right that's i-vector extractors problem and down to three dimensions
0:09:04you also have the option to do earlier generations i-vector gmm you the option to
0:09:10and not use mfccs but use all the phonetic features
0:09:16and they have a speciality other than normal stuff so there's reference organisation which is
0:09:21very standard that it you can also submit data
0:09:25for the ap lda to the better to the case conditions
0:09:31so three to me to one and thirty five speakers from three to they all
0:09:35the recordings were added that's speech and come to forty seconds
0:09:40and they were divided into two groups there's test data and there's the reference normalization
0:09:46cohort and we also did experiments ments without reference normalization cohort
0:09:51and i should say for every speaker every day there's five recordings for five the
0:09:55first five devices the smartphone video was not included but the other five or they're
0:10:01so
0:10:02you know what in you target and localise and this is what you get
0:10:07and these are complex equal error are convex or equal error rates and as you
0:10:13can see when you do a matching experiment you get quite good numbers this does
0:10:17not compared to i-vector performance at all
0:10:20it really better you can actually see that the first three devices that high quality
0:10:26close microphones perform
0:10:28pretty well against each other even then mismatched
0:10:31this is not
0:10:33quite true for the other two devices if you mismatch could
0:10:36recordings but the for microphone of the telephone instead you really start to notice
0:10:41and of course if you do too but to a clear recording types compared those
0:10:46in a mismatch that's gonna be the
0:10:49that's performing the worst
0:10:51note if i do this with an i-vector system
0:10:54this four five ish equal error rate actually what you get
0:11:00all over the place and the one that's highlighted now would be probably ten percent
0:11:04equal error rate
0:11:05also note that reference normalization actually helps
0:11:08so the lower the lower equal error rate is
0:11:11almost everywhere a bit lower than one the optimal
0:11:15so back to the original question can we do research that finds out vector something
0:11:20to recruit sure from things can be ignored when selecting relevant population data
0:11:25so we to the one hundred thirty five every the speakers that we divided them
0:11:30in most cases so
0:11:31those with the actual bayes recordings
0:11:34and background data with the this in three ways and the results are pulled off
0:11:38the ropes
0:11:40when you compare levi's one for those more cases it would make sense to use
0:11:45device about one as a background data because then you matching background data make matching
0:11:49relevant population data
0:11:51you could also use the other devices and then you would have mismatch in background
0:11:55data
0:11:56so we did this for every device type in the more cases and every device
0:12:00tie in the background data and the relevant population data the menu twenty five
0:12:05sets of ours which are represented as see the loss and if you look at
0:12:11this table the diagonal means that we
0:12:14use the right relevant population data so that more case type wasn't device one so
0:12:19the handset so we don't the headset recordings
0:12:22for the relevant population like that and so one
0:12:26can you look at the first row here's basically invalid what you should to when
0:12:30you have a case
0:12:32in device one
0:12:33it may make sense to use device one but you can also see that the
0:12:36guys two and three are just as good
0:12:39device for is a bit worse that's the form one little reverberation in it and
0:12:44the telephone is definitely bad
0:12:47that issue you each penalty and performance
0:12:50and the same holds for the other two close microphones
0:12:56accept maybe device three but it seem to be quite interchangeable and device for is
0:13:01in between and device five is just out of the question that gives you a
0:13:05penalty for performance
0:13:07like to look at the two
0:13:09the real recordings again
0:13:11they better be represented by themselves as you can see in the numbers the lowest
0:13:16civil are is for the matching background data and there's nothing that really comes close
0:13:21so graphically represented
0:13:23that means the three high quality microphones can sort of represent each other
0:13:29as can be seen between a rose and
0:13:32the for microphone which is still a direct microphone but it's far away in the
0:13:37recording traversed
0:13:38that's
0:13:39an intermediate one and the telephone intercept definitely don't
0:13:42use it
0:13:43to represent
0:13:45the microphones or vice versa
0:13:49so that an answer to a question
0:13:52for what in broad recording type you cannot blows over a telephone and the right
0:13:57mackerel that's definitely a crucial difference there but what indirect microphone the brand or type
0:14:03of make type of microphone is really not so important that's what these results seem
0:14:09to suggest
0:14:11so these results were not very surprising to me
0:14:16but it shows you type of research you
0:14:18a big interest to the user's social automatic speaker recognition because it gives you a
0:14:24guideline
0:14:25or how to choose relevant population data and it gives you basically a guideline how
0:14:30to use it is are properly
0:14:33making it available for and the