next presentation is e

don't by two people from the end of i both school that it and both

in the same room

working on more or less same problem but in with different approaches

so we're going to talk about the database which may be relevant to forensic work

be a the basic paradigm

in forensic speaker recognition or speaker comparison you might say i think as has been

put for by a fourteen then another's earlier but i summarizes here in the formula

for people that like formant

so basically what did you george or did you once or should want we have

to tell them that this is what they want is posterior also bouts about

claims where the defendant is guilty or not

and i actually can be factorized into

to factors

the likelihood ratio which is the first factor on the right hand side and the

prior odds and the idea is that the

for all sorts german by

well that somehow that have to be determined and it we say it's the ports

dropped to do that but will be influenced by lots of other things at the

circumstances but they might include other evidence other evidence which is not relevant to the

speech

so this is just to get your no idea what so what the framework is

so that there's a connection to stop we do in nist most people are maybe

more familiar with in nist questions and i've summarize set here

namely in the forensic case you might say the judge should the jury won't want

to decide that have been it is guilty if those posterior also part

are higher than some reasonable that

whatever the reasonable doubt should be it's related to the

to do cost function you might say

and in nist

it is quite similar except that there we work with the likely richer itself which

is just a ratio of the posterior and the prior odds

and that should be bigger than some threshold and the threshold only depends on the

cost function or you want included priors on right hand side is also possible

and

if you have well calibrated likelihood ratios than your threshold will be ideal and you'll

be at the

point we know as minimum dcf so that that's the relation where likelihood ratios in

forensic cases unlikely variations in this case is limited

are

we say

but these stories more about the circumstances because everything is dependent on

the information and in a forensic case it is the case so you have these

weird samples the joe showed us and you don't have to give an expression or

an error rate estimate endowments criteria for a general

a general case or an average over many comparisons no you have to for this

particular case

so

our approach would be

we need data which is similar to the case

so what we've been doing model our approach to dealing with a specific case is

to make a database with what's of the annotation so that we can more or

less select a sub database

which will be is a similar as possible to the case as we can

a b and it the database will also allow us to test where they're circumstances

so mornings for language differences actually matter if we have all that information

so

this is where we move on to the next speaker

and it's gonna be tricky

for this work

right

yes so i'm they've at on the flute also known as david the second

or the first depending on your perspective of course

i'll be talking

to you about

the database itself how we created it which metadata what's included

which

so you get a sense of

of all which restriction star are

using real data some of the metadata or is just uncontrollable and some of it

is

there you go

it's just of

short overview

out

the thing to note here that it's similar in set up to other database that

use real data

so the df as the end of i take it to you know of ten

years ago and i might actually perhaps they're others that i don't know off because

they're all secretive or that i don't know of because i didn't find a

but this time it second six hundred speakers so i hope

it can be contribution in the field

yes but wanna do it is menu validation we want to use automatic speaker recognition

in case work obviously we don't do that yet and we need validation research and

i feel there are people using asr in case work

and i feel

perhaps a bit conservative i need realistic data for calibration and

otherwise

it will be the will be no real improvement because improvement in using asr over

or next to human

approach would to me would be that you can actually measure re reliability and for

that you really need to realistic data

so

it's not our own data we're not

really the formal owner the owner is the prosecution and they gave us permission to

collect data are from the police intercept data and this has some restrictions i'm sure

the first question after this presentation will be a question regarding availability

i am i'm happy to cooperate i think but it's not in our hands entirely

and we only got permission of strict conditions so we had to and on i'm

anonymized the data so we have listened through all the data and no doubt

names and

and stuff and

so

what did we do we received a lot of data

it's just all telephone concentrate conversations so that stereo generally true that one speaker is

someone channel and the other speaker in the other channel

not always the case because there's just so much data and a lot of cables

they're

we split the stereo files and half and we uploaded them into the database and

this is to roll material and this is some hundreds thousand audio files make this

way

and

we had some meta data to go with it

just some general things

and i made a medical tune because it's really realistic data so there's actually

been it's really intercepted

the stress the point idea

i and of the core two

which also means that lot of the speakers in the database don't know data recorded

which as you can imagine

is it is a major point in the privacy and permissions that we got it

at that we got

two

to apply the database and two

distributed or not and

"'kay"

so we took some

it took some processing which to "'cause" about two years we had a because the

chance to hire people to know out

the det personal information like names and addresses

and to actually find isolating speakers which is

the most important part of the job because we just gotta

or a whole a whole a

a big a big power of audio files

and they could sort they could use

the telephone number to listen into the files and then they had to decide for

themselves okay this is john i think and this is johns uncle and okay i

get to know the people around revolving around a telephone number

and this is how a speaker id was created just i

through listening prude telephone number flew content of the audio

and they added to make the metadata in

so these people native speakers we call them

they isolated these speakers and

they

banana whenever the rose doubts the recording was excluded

but still this is not a hunter percent the could be between brother somewhere using

the phone of his twin brother and that there is some confusion

so

we like to call the truth by proxy i mean you can be quite sure

but never hundred percent about speaker id

and another thing

they choose a first the n-best then recordings per speaker then below it is to

five because we were concerned that number of speakers would be too low in the

end

and

they were instructed to take this factors

that first recordings as possible if you take five recordings all from the same day

talking to same person that's a little less interesting then

three recordings of the type and two recordings

but some whispering or car or anything

and those aims of five and ten

perhaps it's my management capabilities i don't know but up to

it just varies a lot the modus is still five so most speakers have five

recordings but there's even one with one hundred and thirty three recordings

which is kind of interesting in itself but

it just looked weights a lot

so this is the unknown imitation other mean just means listening through it and

assessing deciding what information could be deduced able to a real person and that's just

know about so there's all row of samples there that's just no

which isn't labeled

so

you just have to sometimes you have to guess better somebody didn't say anything or

somebody

us said anything set something that's just no doubt

and in doubt it's just

lee you've adapt so when they were and doubts whether this is really

personal information just leave it out

okay and that these people and of their own meta data and metadata

the single most important operate just of course speaker id

and then all these other things but these are all perceptive meta data that it

is on the basis of listening

so

though sometimes

there are some

subjective measures they're like amount of noise they could choose between non

a little and quite a lot

but of course they were more than one

a native speakers that the this job so it's a bit

it's a bit depends on the person what i find it quite a lot of

noise and we try to regulate it's but it's never a it's is one of

the perks of listening and that then subjectivity judge this kind of metadata

okay this was the end for them they this was the end of the job

and we as post processing we

anonymized metadata of course

and

the second the next step is something i

i pretend here that the database all finished with actual have still working on the

second one and the which is to make a clean version because for it to

be comparable to

forensic casework

you will want to leave out all

background stuff and it's background speakers music and like you would do with real case

recordings so i'm labeling older parts but weather's those kind of background noises so we

have a

dirty version and a clean version of the same database in the end

we go

so this is to be database but numbers of just the two

c and as you can see there's a two-to-one rate show for male and female

a lot the native speakers prioritise males in the database because

males are

way more frequent in case work on the female

and i'm sorry to say

or perhaps not

and just some

some statistics

this is interesting the dutch landscape language landscape is not strictly monolingual they're still quite

sizeable minorities in holland mainly from oregon and turkish this end

which means we have some multilingual speakers and

and they come in different flavours their speakers that

speak a mix of turkish for instance and dutch in the same compensation and their

speakers that bill use touch of some computations and turkish in the other conversation and

so we have

quite some possibility due to cross language research with this which is the first experiments

that will be presented is of the type

and i would there some english recordings but don't get your hopes up there is

only six speakers and their

there most of them are not native and

that they like being is like the english that i'm talking now but having their

detection

so number of recordings

on the one hundred and thirty three recordings for the for the largest speaker which

is a

big criminal in holland

course i'm not allowed to

so you with it

and i don't the in terms of trials same source trials and different source trials

but i must admit the different source trials or also cross gender and cross anything

so

the actual usable number of different source trials is probably a little lower

this is the duration to get few if you sense of duration

the

the pink bars are actually the gross length of durations and the blue bars are

after a speech activity detection

and their sum

unexplainable

thinks they're in the pink distribution

for which i don't have an explanation

you can see that minimum duration for telephone conversations to actually make it into the

database of thirty seconds because lower than that there's just a lot of call tones

without an answer and other

rubbish

and the maximum that i talked the native speakers that they could use was a

computation of ten minutes "'cause" otherwise to would have too much work for just one

recording

however my management capabilities

mate made by

where so we that there are still some recordings over six hundred seconds excluded there

but still there

okay

well like i set we repent to use it for validation research and it goes

into two different types there's general validation just

choosing which algorithm is best for us for our ugh case

for our case work

calibration method i'm happy to see that nikos gonna talk about different calibration and types

which will be

applicable soon hopefully

and there's also and that's more relevant case specific validation that's

the all the variation that mister campbell talked about that's real there's real really there

is no

case without something special to with perhaps not as extreme as the examples we heard

but there's always there's always something which

to me means that you need k specific validation so for every case you will

have to be fine which is where which data is representative for my known sample

which of data is with represent the for my reference sample

and

this means

we need data are basically and

this is why we did it

i hope that i database will reflect a lot of cases

of course this is only intercept data

and the real monkey business with it screaming and yelling and running it's probably not

in there

so

it that will restrict which cases you can do

so there's two solutions to broaden

to rolled on the type of cases you can do one find more data and

to wait for you guys to have made an algorithm and find out that some

conditions don't matter any more

so you can

you can to should that's than the evaluation data a little less strict

of course i'm talking faster than the slides

so the database i'm thinking of the mess trials of always try to find a

lot of same source trials and different source trial so those two score distributions will

be a

will be

i still have five minutes left and they've still has a

apart so i'll

the third one loss to mark

please contact me about the film if availability that but it will be heart because

it's not our own data and it's very sensitive

being six hundred intercepted speakers

but you find my email address on the presentation and please contact me

okay

a this was kind of

expected

the running a bit late to know how

right selection

so

i've been talking about exactly what's screen that see anything from the u

so we did in experiments we splits

we take

ten percent of the data set of multiple database

so this is a pretty preliminary mites able to the ideas can we do some

experiments speaker recognition experiments see what influences

are important

this is so more motivation all

go and tell what we did so we looked at a turkish speakers the three

either speaking turkish or

dutch with a turkish actions or a mix of dutch and

turkish

a them here's a slight about

still problem this skewness of the availability of you number of

segments per speaker and have to deal with that

and i had a nice are facing joke about the well-known public to instance you

which i paraphrase us there are so some speakers are more equal than other speakers

how to deal with the different amounts of trials

george actually have it open solution

and old solution

with indicate that there make a debt curve per speaker pair basic

we implemented that you see the influence of that c can be to more about

the paper

quickly go to

the

affect all

say

i'm speaker population for commercial speaker recognition system so here we use a commercial speaker

recognition system that can do some calibration and what you see here

is that if you give the

recognition system some additional material so those were forty five speakers outside the test databases

used here you can go

problem very badly calibrated at the top line sc alarms above one it's just be

useless to the lower lying where you see that the c llr which is a

measure of calibration and discrimination as actually quite close to the minimum attainable c llr

so this shows that the system that we used in you can be about paper

more about the system

did work and the data curves are more or less same so this is really

the reference population only matter it's towards calibration

but i was going to tell you a model wait a minute level

about this is just an answer to one reviewers well what about distribution well yours

distribution

as also the paper

this is my final slide already

so i'm working towards the finishing in time

you see and number of figures

showing different tests

but you can do with the database so out of this ten percent we took

only turkish speakers

we first looked at what is train and test both turkish and you see several

performance measures in some statistics about the number of trials

and the next thing you can do is what if both train and test

or sampling questions

or trace and reference you might say

are both dutch speaking but with a turkish accent

and the numbers actually

they vary a bit so the equal error rate goes up while

except me significant i don't know the remote too many speakers in our in our

sub subspace

but it might be more interesting to look at the fourth line where

what is indicated there is that we

training with

speakers talking turkish and test with speaker spoken dutch but with a turkish accent for

the other way around to these two cases

and then the most

interesting thing there i think is that nothing happens or not so much happens with

the equal error rate that's they've more or less and same ballpark fifteen point eight

percent

some of the first two lines but the calibration

suffers but it doesn't really suffer that much it is actually comparable to what happens

to the two

turkish speakers speaking

dutch so

where there

from this data it looks like calibration is suffering from speakers speaking some of the

language

but

the cross language effects is modes

making things worse i i'd like to

show you the figures you look at yourselves

i think that general conclusion with the figures is quite a lot of self very

so things depend a lot of how you set of experiments at least is data

allows you to do those kind of experiments

but i would be like to conclude more with that the general idea of this

of this work of collecting the database is that

first of all i hope to have shown that

this kind of data is necessary both for answering questions like what is the error

rates at the error rates of the methods for this particular case so one of

the easiest outward conditions

but you could also use the data that this is not shown in this work

with you can use the data to actually make is case specific calibration so once

you know which

factors are influencing and where you which not then you can select

make a selection important to those factors

and used at four k specific calibration

and i given very shortly an experiment showed shown an experiment there with it

language data

okay of this is this is

thank you for that talks the talk you to david and i would like to

ask you about the level the precision of your tagging in metadata

and the reason i'm asking is because in australia we have very similar multi lingual

context with waves of immigrants from certain countries when there's troubles for example we have

whenever there's a war in eleven on we have the immigrants ways arabic accent speaking

english and arabic accent

now firstly that level of that accentedness varies as we all know

and secondly after twenty five years we have a second generation what a native speakers

who don't speak english with an accent that speak their own idiolect of english often

as native speakers do you account for that kind of difference

yes we have not only a language field

but also and native nist field so there will be annotated learned this is native

speaker or quite a good non-native speaker or a bad native speak of the language

spoken

and this idiolect this ad collect would be socially clicks us as a linguist term

would count as native

if it's second or third generation

it's deaf

for the each where was each speaker confined to the common in z

where did they did you have speakers at one across handsets

we have it'll mum the majority of speakers

i know it about telephone number so

is from the same number but there's also some speakers that are using one phone

in one recording and for instance a landline in another recording

i don't have no

in the experiments that show you the non-target conditions are always the same start condition

so

absolutely i make an argument that the doesn't make any sense to different

there "'cause" i c i c conditions is this is sort of the conditioning i

in the likelihood ratio thing so you shouldn't

have to the numerator with different conditions than the moment i don't believe that makes

too much

as well that would try to people put it make a speaker

just pop of david the

my feeling is what the calibration show shows is exactly will limit of speaker recognition

has known as we are not able with how system to detect if we need

new data

to do we calibrations

we wean upset with calibration it just v limit of that someone's of the system

i'm not sure if a entire lots not to question but i

i agree that there so we can test whether particular conditions your noise or whatever

make a difference or not we should be able there without with this data

but if there is a new condition

the can always be a new condition and

if we don't know whether it's of influence we can say whether we archer having

matching data i agree that that's just a doberman work

we should walk on automatic detection

of the case where we are

in a known condition

if we are not able to describe a condition factor by fact or and to

decide and to give to the user

v probability to be compliant with for training set

call choose how system in forensic conditions because has joyce eight reason

a huge amount and we know that huge amount of conditions in forensic

so the way i would approach is not by doing everything completely automatic back actually

having the forensic expert listen to the data and that's no problem is it was

limited amount of data

and the forensic expert can say something sensible like well this is very much like

i've heard before or well there is an enormous bows or there is an enormous

this where that i haven't seen before so i

we should not pdtb

they've it i agree but completely we need to human and human expert to for

that at least would beginning but we need to feed that the human expert was

information of with the system

and another time has norm will has we don't know exactly

which information how system using we can't explain to human expert how to define what

is the

no condition pose system or not it's not enough you have some very interesting levels

here but has

michael saying about the previous questions we always have some question about language what is

exactly but definition of the language of idiolect what is the definition of be conditions

of the distance

is it in the conditions

and we should work without automatic system to deter mined the sensitivity of for system

to each of these

factors

knowing that the u one expert could use the human brain to give a probably

and i hope we will we will do but

just really transgressions that the underlying assumption i think that most people are thinking about

is that when it's intercepted that it's a telephone call but i think in a

lot of forensic applications you have a confidential informant where the body warned microphone

and in those cases the microphone you have lots of issues with a cloth committing

the microphone support could you just say in the audio that you have are all

then telephone calls

and do not

do you have any plans to kind of explore the body one type scenarios because

that actually as a real challenge as well

this is only telephone speech and we are planning a to expand our data

collection to

yep more ended mismatch but in holland telephone intercepted telephone speech is really the majority

of the data so we're covering quite a lot already

but this kind of circumstances or a parked car or anything

we need data that's true