hello everyone minus my name is that's on the flute
i work at to identify and i will talk to about
automatic speaker recognition in a forensic voice comparison
such i'm the user of automatic speaker recognition technology a not a developer
which will give me a unique perspective which i hope will be insightful for you
and in this study into representativeness is a constant that's really important
in doing actual cases in forensic voice comparison
my to go also share are
they are heroes who can actually developed
automatic speaker recognition systems they work are also great research and they have their
system that was used for the study for this study
but an just the humble user and i will talk about automatic speaker recognition from
that perspective
so forensic voice comparison you will typically have an offender recording from the police and
somebody did something bad in this recording
and identity of the speakers unknown and
there we will be a suspect that we should think okay this guy must be
the same as the offender so the suspect recording and
the recording come from everywhere
the importance start is we get two recordings
one of the has a contested speaker identity and the other one that's just a
suspect no nobody distance
and the question is always pay are disguise
the same person are these people the same person
of course we translate this into hypotheses so
we were gonna bayesian framework
but it but all boils down to is the same going or not and when
you use automatic speaker recognition value
chuck in the recordings into your into your system you give it some users submit
data and reference normalization code word level
locater about in the book if you a score
and this score
so that exists in the void
there's no way of
telling what a score means that could be seventeen and nobody knows how gender so
you need and relevant population so you look at your
potential rah relevant population recordings of original speaker identity
and you check my case recordings are the blue tire blew guys so my relevant
population blue people and i compared those blue people
the in the same manner as i did not the case
and it will use the same speaker scores and different speaker scores are used to
can be made to distribution and then i can bring back okay score
and here in this example i can see i've in a war over about four
because the intersection with the green line and orange line data i ratio for
and this for that's a likelihood ratio and
now we have we don't have meaningless score anymore we have and meaningful number likelihood
ratio this is
an expression of the weight of evidence
it can actually be used in case work are in court
the judge can
weight is in this decision or a decision
about the case as a whole
okay let's backtrack
there was this choice of relevant potential relevant population and i said okay let's look
at the colour of the guys
but reality is a bit more compact and just for colours or maybe i should
have checked for whether they were wearing sunglasses and you would get another
another relevant population or maybe i should have checked
but today have had some or maybe the combination of these two
and
there's the earlier
results i got
but when and when taking for hence it might be that the distributions were shifted
and the actual resulting of our will be way lower than that had before or
had checked for sunglasses it might is just to the other way
and okay than other or kind of i would've
checks for every single metadata would think off
colour hats and the glasses i would probably not have sufficient data to even do
this
so you can see this is a major impact on the result of the case
and
this is a
a real problem in forensic voice comparison because when i was talking about hence when
i was talking about
i sunglasses i actually meant of course it's conditions
case recording conditions and that's norm list and i just even you some of the
double my have this when you think of it
could be close to infinity
so that's a real problem you don't really know what to select for and even
we didn't
this list
look at raw recording type there
in there there's multiple categories and within those categories there's
even cellular
so there elements to look for and it's just not clear
should this should some of these things could dish safely be ignored because there are
no impact on the use or
at all
or on it may be really crucial and then it's really important you don't wanna
forget it because then you get this wrong
likelihood ratio that could
potentially need to one miscarriage of justice
so in order to do research into this relevant population problem neglect the database it's
called in a v freda
forensically realistic into device audio it's got two hundred and fifty male speakers
and the other characteristics here are just the target audience of forensic voice comparison in
the netherlands basically and their speech was recorded on multiple devices simultaneously so every utterance
of speech is recorded in different ways
and i have an example of this
and they'll go there is setting and he's talking on the phone which isn't will
not a participant
and
she scheme or headset
a text-dependent i for the subset of the testing for
and there was a
no improvement due to stupidity i financed data suggesting for
and i guess to
and their the microphone on the other side room
and that's please kindly provided actual into sets of the telephone
and however
and this is still of a video by i phone which is
this text recording
so this is a list of the recording devices and
it says they're inside only for the two four three microphones
and i will explain this right now
so
every participant two days of recording everyday had eight recording sessions
for them are inside for them are outside
all those inside an outsider as it was divided in the silent backgrounds and noisy
background and for incitement just no sound or
a white noise radio
and making noise for the noisy background outside and
the actual location wherever so the roses sort of silent place
and there is a busy place writing central forensic them as you can see
and then the was the other variation where the actual telephone are used as eigen
up or and i phone and this made up this made eight conversations per day
and there's two days of those and
the conversations are five minutes of spontaneous telephone speech and
we actually transcribe half of it the i from recordings which helped us added recordings
you consider speech nonspeech information available
and look at the numbers and
you can see per speaker has about one hour twenty minutes speech duration that the
worked of the recording the duration of longer because for every
speech utterance does not of course
so why they do it is of course that's forensically relevant to the speaker demographics
and like i said but the real cool part if the simultaneous recordings and this
makes the influence of recording device possible more specifically the relevance of data
that's recorded by different recording to tax
the system we used for this studies vocalise bucks of a research
it's the x factor system and
and then really cool feature value in visible eyes the speakers you can see in
the bottom right that's i-vector extractors problem and down to three dimensions
you also have the option to do earlier generations i-vector gmm you the option to
and not use mfccs but use all the phonetic features
and they have a speciality other than normal stuff so there's reference organisation which is
very standard that it you can also submit data
for the ap lda to the better to the case conditions
so three to me to one and thirty five speakers from three to they all
the recordings were added that's speech and come to forty seconds
and they were divided into two groups there's test data and there's the reference normalization
cohort and we also did experiments ments without reference normalization cohort
and i should say for every speaker every day there's five recordings for five the
first five devices the smartphone video was not included but the other five or they're
so
you know what in you target and localise and this is what you get
and these are complex equal error are convex or equal error rates and as you
can see when you do a matching experiment you get quite good numbers this does
not compared to i-vector performance at all
it really better you can actually see that the first three devices that high quality
close microphones perform
pretty well against each other even then mismatched
this is not
quite true for the other two devices if you mismatch could
recordings but the for microphone of the telephone instead you really start to notice
and of course if you do too but to a clear recording types compared those
in a mismatch that's gonna be the
that's performing the worst
note if i do this with an i-vector system
this four five ish equal error rate actually what you get
all over the place and the one that's highlighted now would be probably ten percent
equal error rate
also note that reference normalization actually helps
so the lower the lower equal error rate is
almost everywhere a bit lower than one the optimal
so back to the original question can we do research that finds out vector something
to recruit sure from things can be ignored when selecting relevant population data
so we to the one hundred thirty five every the speakers that we divided them
in most cases so
those with the actual bayes recordings
and background data with the this in three ways and the results are pulled off
the ropes
when you compare levi's one for those more cases it would make sense to use
device about one as a background data because then you matching background data make matching
relevant population data
you could also use the other devices and then you would have mismatch in background
data
so we did this for every device type in the more cases and every device
tie in the background data and the relevant population data the menu twenty five
sets of ours which are represented as see the loss and if you look at
this table the diagonal means that we
use the right relevant population data so that more case type wasn't device one so
the handset so we don't the headset recordings
for the relevant population like that and so one
can you look at the first row here's basically invalid what you should to when
you have a case
in device one
it may make sense to use device one but you can also see that the
guys two and three are just as good
device for is a bit worse that's the form one little reverberation in it and
the telephone is definitely bad
that issue you each penalty and performance
and the same holds for the other two close microphones
accept maybe device three but it seem to be quite interchangeable and device for is
in between and device five is just out of the question that gives you a
penalty for performance
like to look at the two
the real recordings again
they better be represented by themselves as you can see in the numbers the lowest
civil are is for the matching background data and there's nothing that really comes close
so graphically represented
that means the three high quality microphones can sort of represent each other
as can be seen between a rose and
the for microphone which is still a direct microphone but it's far away in the
recording traversed
that's
an intermediate one and the telephone intercept definitely don't
use it
to represent
the microphones or vice versa
so that an answer to a question
for what in broad recording type you cannot blows over a telephone and the right
mackerel that's definitely a crucial difference there but what indirect microphone the brand or type
of make type of microphone is really not so important that's what these results seem
to suggest
so these results were not very surprising to me
but it shows you type of research you
a big interest to the user's social automatic speaker recognition because it gives you a
guideline
or how to choose relevant population data and it gives you basically a guideline how
to use it is are properly
making it available for and the