hello everyone minus my name is that's on the flute

i work at to identify and i will talk to about

automatic speaker recognition in a forensic voice comparison

such i'm the user of automatic speaker recognition technology a not a developer

which will give me a unique perspective which i hope will be insightful for you

and in this study into representativeness is a constant that's really important

in doing actual cases in forensic voice comparison

my to go also share are

they are heroes who can actually developed

automatic speaker recognition systems they work are also great research and they have their

system that was used for the study for this study

but an just the humble user and i will talk about automatic speaker recognition from

that perspective

so forensic voice comparison you will typically have an offender recording from the police and

somebody did something bad in this recording

and identity of the speakers unknown and

there we will be a suspect that we should think okay this guy must be

the same as the offender so the suspect recording and

the recording come from everywhere

the importance start is we get two recordings

one of the has a contested speaker identity and the other one that's just a

suspect no nobody distance

and the question is always pay are disguise

the same person are these people the same person

of course we translate this into hypotheses so

we were gonna bayesian framework

but it but all boils down to is the same going or not and when

you use automatic speaker recognition value

chuck in the recordings into your into your system you give it some users submit

data and reference normalization code word level

locater about in the book if you a score

and this score

so that exists in the void

there's no way of

telling what a score means that could be seventeen and nobody knows how gender so

you need and relevant population so you look at your

potential rah relevant population recordings of original speaker identity

and you check my case recordings are the blue tire blew guys so my relevant

population blue people and i compared those blue people

the in the same manner as i did not the case

and it will use the same speaker scores and different speaker scores are used to

can be made to distribution and then i can bring back okay score

and here in this example i can see i've in a war over about four

because the intersection with the green line and orange line data i ratio for

and this for that's a likelihood ratio and

now we have we don't have meaningless score anymore we have and meaningful number likelihood

ratio this is

an expression of the weight of evidence

it can actually be used in case work are in court

the judge can

weight is in this decision or a decision

about the case as a whole

okay let's backtrack

there was this choice of relevant potential relevant population and i said okay let's look

at the colour of the guys

but reality is a bit more compact and just for colours or maybe i should

have checked for whether they were wearing sunglasses and you would get another

another relevant population or maybe i should have checked

but today have had some or maybe the combination of these two

and

there's the earlier

results i got

but when and when taking for hence it might be that the distributions were shifted

and the actual resulting of our will be way lower than that had before or

had checked for sunglasses it might is just to the other way

and okay than other or kind of i would've

checks for every single metadata would think off

colour hats and the glasses i would probably not have sufficient data to even do

this

so you can see this is a major impact on the result of the case

and

this is a

a real problem in forensic voice comparison because when i was talking about hence when

i was talking about

i sunglasses i actually meant of course it's conditions

case recording conditions and that's norm list and i just even you some of the

double my have this when you think of it

could be close to infinity

so that's a real problem you don't really know what to select for and even

we didn't

this list

look at raw recording type there

in there there's multiple categories and within those categories there's

even cellular

so there elements to look for and it's just not clear

should this should some of these things could dish safely be ignored because there are

no impact on the use or

at all

or on it may be really crucial and then it's really important you don't wanna

forget it because then you get this wrong

likelihood ratio that could

potentially need to one miscarriage of justice

so in order to do research into this relevant population problem neglect the database it's

called in a v freda

forensically realistic into device audio it's got two hundred and fifty male speakers

and the other characteristics here are just the target audience of forensic voice comparison in

the netherlands basically and their speech was recorded on multiple devices simultaneously so every utterance

of speech is recorded in different ways

and i have an example of this

and they'll go there is setting and he's talking on the phone which isn't will

not a participant

and

she scheme or headset

a text-dependent i for the subset of the testing for

and there was a

no improvement due to stupidity i financed data suggesting for

and i guess to

and their the microphone on the other side room

and that's please kindly provided actual into sets of the telephone

and however

and this is still of a video by i phone which is

this text recording

so this is a list of the recording devices and

it says they're inside only for the two four three microphones

and i will explain this right now

so

every participant two days of recording everyday had eight recording sessions

for them are inside for them are outside

all those inside an outsider as it was divided in the silent backgrounds and noisy

background and for incitement just no sound or

a white noise radio

and making noise for the noisy background outside and

the actual location wherever so the roses sort of silent place

and there is a busy place writing central forensic them as you can see

and then the was the other variation where the actual telephone are used as eigen

up or and i phone and this made up this made eight conversations per day

and there's two days of those and

the conversations are five minutes of spontaneous telephone speech and

we actually transcribe half of it the i from recordings which helped us added recordings

you consider speech nonspeech information available

and look at the numbers and

you can see per speaker has about one hour twenty minutes speech duration that the

worked of the recording the duration of longer because for every

speech utterance does not of course

so why they do it is of course that's forensically relevant to the speaker demographics

and like i said but the real cool part if the simultaneous recordings and this

makes the influence of recording device possible more specifically the relevance of data

that's recorded by different recording to tax

the system we used for this studies vocalise bucks of a research

it's the x factor system and

and then really cool feature value in visible eyes the speakers you can see in

the bottom right that's i-vector extractors problem and down to three dimensions

you also have the option to do earlier generations i-vector gmm you the option to

and not use mfccs but use all the phonetic features

and they have a speciality other than normal stuff so there's reference organisation which is

very standard that it you can also submit data

for the ap lda to the better to the case conditions

so three to me to one and thirty five speakers from three to they all

the recordings were added that's speech and come to forty seconds

and they were divided into two groups there's test data and there's the reference normalization

cohort and we also did experiments ments without reference normalization cohort

and i should say for every speaker every day there's five recordings for five the

first five devices the smartphone video was not included but the other five or they're

so

you know what in you target and localise and this is what you get

and these are complex equal error are convex or equal error rates and as you

can see when you do a matching experiment you get quite good numbers this does

not compared to i-vector performance at all

it really better you can actually see that the first three devices that high quality

close microphones perform

pretty well against each other even then mismatched

this is not

quite true for the other two devices if you mismatch could

recordings but the for microphone of the telephone instead you really start to notice

and of course if you do too but to a clear recording types compared those

in a mismatch that's gonna be the

that's performing the worst

note if i do this with an i-vector system

this four five ish equal error rate actually what you get

all over the place and the one that's highlighted now would be probably ten percent

equal error rate

also note that reference normalization actually helps

so the lower the lower equal error rate is

almost everywhere a bit lower than one the optimal

so back to the original question can we do research that finds out vector something

to recruit sure from things can be ignored when selecting relevant population data

so we to the one hundred thirty five every the speakers that we divided them

in most cases so

those with the actual bayes recordings

and background data with the this in three ways and the results are pulled off

the ropes

when you compare levi's one for those more cases it would make sense to use

device about one as a background data because then you matching background data make matching

relevant population data

you could also use the other devices and then you would have mismatch in background

data

so we did this for every device type in the more cases and every device

tie in the background data and the relevant population data the menu twenty five

sets of ours which are represented as see the loss and if you look at

this table the diagonal means that we

use the right relevant population data so that more case type wasn't device one so

the handset so we don't the headset recordings

for the relevant population like that and so one

can you look at the first row here's basically invalid what you should to when

you have a case

in device one

it may make sense to use device one but you can also see that the

guys two and three are just as good

device for is a bit worse that's the form one little reverberation in it and

the telephone is definitely bad

that issue you each penalty and performance

and the same holds for the other two close microphones

accept maybe device three but it seem to be quite interchangeable and device for is

in between and device five is just out of the question that gives you a

penalty for performance

like to look at the two

the real recordings again

they better be represented by themselves as you can see in the numbers the lowest

civil are is for the matching background data and there's nothing that really comes close

so graphically represented

that means the three high quality microphones can sort of represent each other

as can be seen between a rose and

the for microphone which is still a direct microphone but it's far away in the

recording traversed

that's

an intermediate one and the telephone intercept definitely don't

use it

to represent

the microphones or vice versa

so that an answer to a question

for what in broad recording type you cannot blows over a telephone and the right

mackerel that's definitely a crucial difference there but what indirect microphone the brand or type

of make type of microphone is really not so important that's what these results seem

to suggest

so these results were not very surprising to me

but it shows you type of research you

a big interest to the user's social automatic speaker recognition because it gives you a

guideline

or how to choose relevant population data and it gives you basically a guideline how

to use it is are properly

making it available for and the