next presentation is e
don't by two people from the end of i both school that it and both
in the same room
working on more or less same problem but in with different approaches
so we're going to talk about the database which may be relevant to forensic work
be a the basic paradigm
in forensic speaker recognition or speaker comparison you might say i think as has been
put for by a fourteen then another's earlier but i summarizes here in the formula
for people that like formant
so basically what did you george or did you once or should want we have
to tell them that this is what they want is posterior also bouts about
claims where the defendant is guilty or not
and i actually can be factorized into
to factors
the likelihood ratio which is the first factor on the right hand side and the
prior odds and the idea is that the
for all sorts german by
well that somehow that have to be determined and it we say it's the ports
dropped to do that but will be influenced by lots of other things at the
circumstances but they might include other evidence other evidence which is not relevant to the
speech
so this is just to get your no idea what so what the framework is
so that there's a connection to stop we do in nist most people are maybe
more familiar with in nist questions and i've summarize set here
namely in the forensic case you might say the judge should the jury won't want
to decide that have been it is guilty if those posterior also part
are higher than some reasonable that
whatever the reasonable doubt should be it's related to the
to do cost function you might say
and in nist
it is quite similar except that there we work with the likely richer itself which
is just a ratio of the posterior and the prior odds
and that should be bigger than some threshold and the threshold only depends on the
cost function or you want included priors on right hand side is also possible
and
if you have well calibrated likelihood ratios than your threshold will be ideal and you'll
be at the
point we know as minimum dcf so that that's the relation where likelihood ratios in
forensic cases unlikely variations in this case is limited
are
we say
but these stories more about the circumstances because everything is dependent on
the information and in a forensic case it is the case so you have these
weird samples the joe showed us and you don't have to give an expression or
an error rate estimate endowments criteria for a general
a general case or an average over many comparisons no you have to for this
particular case
so
our approach would be
we need data which is similar to the case
so what we've been doing model our approach to dealing with a specific case is
to make a database with what's of the annotation so that we can more or
less select a sub database
which will be is a similar as possible to the case as we can
a b and it the database will also allow us to test where they're circumstances
so mornings for language differences actually matter if we have all that information
so
this is where we move on to the next speaker
and it's gonna be tricky
for this work
right
yes so i'm they've at on the flute also known as david the second
or the first depending on your perspective of course
i'll be talking
to you about
the database itself how we created it which metadata what's included
which
so you get a sense of
of all which restriction star are
using real data some of the metadata or is just uncontrollable and some of it
is
there you go
it's just of
short overview
out
the thing to note here that it's similar in set up to other database that
use real data
so the df as the end of i take it to you know of ten
years ago and i might actually perhaps they're others that i don't know off because
they're all secretive or that i don't know of because i didn't find a
but this time it second six hundred speakers so i hope
it can be contribution in the field
yes but wanna do it is menu validation we want to use automatic speaker recognition
in case work obviously we don't do that yet and we need validation research and
i feel there are people using asr in case work
and i feel
perhaps a bit conservative i need realistic data for calibration and
otherwise
it will be the will be no real improvement because improvement in using asr over
or next to human
approach would to me would be that you can actually measure re reliability and for
that you really need to realistic data
so
it's not our own data we're not
really the formal owner the owner is the prosecution and they gave us permission to
collect data are from the police intercept data and this has some restrictions i'm sure
the first question after this presentation will be a question regarding availability
i am i'm happy to cooperate i think but it's not in our hands entirely
and we only got permission of strict conditions so we had to and on i'm
anonymized the data so we have listened through all the data and no doubt
names and
and stuff and
so
what did we do we received a lot of data
it's just all telephone concentrate conversations so that stereo generally true that one speaker is
someone channel and the other speaker in the other channel
not always the case because there's just so much data and a lot of cables
they're
we split the stereo files and half and we uploaded them into the database and
this is to roll material and this is some hundreds thousand audio files make this
way
and
we had some meta data to go with it
just some general things
and i made a medical tune because it's really realistic data so there's actually
been it's really intercepted
the stress the point idea
i and of the core two
which also means that lot of the speakers in the database don't know data recorded
which as you can imagine
is it is a major point in the privacy and permissions that we got it
at that we got
two
to apply the database and two
distributed or not and
"'kay"
so we took some
it took some processing which to "'cause" about two years we had a because the
chance to hire people to know out
the det personal information like names and addresses
and to actually find isolating speakers which is
the most important part of the job because we just gotta
or a whole a whole a
a big a big power of audio files
and they could sort they could use
the telephone number to listen into the files and then they had to decide for
themselves okay this is john i think and this is johns uncle and okay i
get to know the people around revolving around a telephone number
and this is how a speaker id was created just i
through listening prude telephone number flew content of the audio
and they added to make the metadata in
so these people native speakers we call them
they isolated these speakers and
they
banana whenever the rose doubts the recording was excluded
but still this is not a hunter percent the could be between brother somewhere using
the phone of his twin brother and that there is some confusion
so
we like to call the truth by proxy i mean you can be quite sure
but never hundred percent about speaker id
and another thing
they choose a first the n-best then recordings per speaker then below it is to
five because we were concerned that number of speakers would be too low in the
end
and
they were instructed to take this factors
that first recordings as possible if you take five recordings all from the same day
talking to same person that's a little less interesting then
three recordings of the type and two recordings
but some whispering or car or anything
and those aims of five and ten
perhaps it's my management capabilities i don't know but up to
it just varies a lot the modus is still five so most speakers have five
recordings but there's even one with one hundred and thirty three recordings
which is kind of interesting in itself but
it just looked weights a lot
so this is the unknown imitation other mean just means listening through it and
assessing deciding what information could be deduced able to a real person and that's just
know about so there's all row of samples there that's just no
which isn't labeled
so
you just have to sometimes you have to guess better somebody didn't say anything or
somebody
us said anything set something that's just no doubt
and in doubt it's just
lee you've adapt so when they were and doubts whether this is really
personal information just leave it out
okay and that these people and of their own meta data and metadata
the single most important operate just of course speaker id
and then all these other things but these are all perceptive meta data that it
is on the basis of listening
so
though sometimes
there are some
subjective measures they're like amount of noise they could choose between non
a little and quite a lot
but of course they were more than one
a native speakers that the this job so it's a bit
it's a bit depends on the person what i find it quite a lot of
noise and we try to regulate it's but it's never a it's is one of
the perks of listening and that then subjectivity judge this kind of metadata
okay this was the end for them they this was the end of the job
and we as post processing we
anonymized metadata of course
and
the second the next step is something i
i pretend here that the database all finished with actual have still working on the
second one and the which is to make a clean version because for it to
be comparable to
forensic casework
you will want to leave out all
background stuff and it's background speakers music and like you would do with real case
recordings so i'm labeling older parts but weather's those kind of background noises so we
have a
dirty version and a clean version of the same database in the end
we go
so this is to be database but numbers of just the two
c and as you can see there's a two-to-one rate show for male and female
a lot the native speakers prioritise males in the database because
males are
way more frequent in case work on the female
and i'm sorry to say
or perhaps not
and just some
some statistics
this is interesting the dutch landscape language landscape is not strictly monolingual they're still quite
sizeable minorities in holland mainly from oregon and turkish this end
which means we have some multilingual speakers and
and they come in different flavours their speakers that
speak a mix of turkish for instance and dutch in the same compensation and their
speakers that bill use touch of some computations and turkish in the other conversation and
so we have
quite some possibility due to cross language research with this which is the first experiments
that will be presented is of the type
and i would there some english recordings but don't get your hopes up there is
only six speakers and their
there most of them are not native and
that they like being is like the english that i'm talking now but having their
detection
so number of recordings
on the one hundred and thirty three recordings for the for the largest speaker which
is a
big criminal in holland
course i'm not allowed to
so you with it
and i don't the in terms of trials same source trials and different source trials
but i must admit the different source trials or also cross gender and cross anything
so
the actual usable number of different source trials is probably a little lower
this is the duration to get few if you sense of duration
the
the pink bars are actually the gross length of durations and the blue bars are
after a speech activity detection
and their sum
unexplainable
thinks they're in the pink distribution
for which i don't have an explanation
you can see that minimum duration for telephone conversations to actually make it into the
database of thirty seconds because lower than that there's just a lot of call tones
without an answer and other
rubbish
and the maximum that i talked the native speakers that they could use was a
computation of ten minutes "'cause" otherwise to would have too much work for just one
recording
however my management capabilities
mate made by
where so we that there are still some recordings over six hundred seconds excluded there
but still there
okay
well like i set we repent to use it for validation research and it goes
into two different types there's general validation just
choosing which algorithm is best for us for our ugh case
for our case work
calibration method i'm happy to see that nikos gonna talk about different calibration and types
which will be
applicable soon hopefully
and there's also and that's more relevant case specific validation that's
the all the variation that mister campbell talked about that's real there's real really there
is no
case without something special to with perhaps not as extreme as the examples we heard
but there's always there's always something which
to me means that you need k specific validation so for every case you will
have to be fine which is where which data is representative for my known sample
which of data is with represent the for my reference sample
and
this means
we need data are basically and
this is why we did it
i hope that i database will reflect a lot of cases
of course this is only intercept data
and the real monkey business with it screaming and yelling and running it's probably not
in there
so
it that will restrict which cases you can do
so there's two solutions to broaden
to rolled on the type of cases you can do one find more data and
to wait for you guys to have made an algorithm and find out that some
conditions don't matter any more
so you can
you can to should that's than the evaluation data a little less strict
of course i'm talking faster than the slides
so the database i'm thinking of the mess trials of always try to find a
lot of same source trials and different source trial so those two score distributions will
be a
will be
i still have five minutes left and they've still has a
apart so i'll
the third one loss to mark
please contact me about the film if availability that but it will be heart because
it's not our own data and it's very sensitive
being six hundred intercepted speakers
but you find my email address on the presentation and please contact me
okay
a this was kind of
expected
the running a bit late to know how
right selection
so
i've been talking about exactly what's screen that see anything from the u
so we did in experiments we splits
we take
ten percent of the data set of multiple database
so this is a pretty preliminary mites able to the ideas can we do some
experiments speaker recognition experiments see what influences
are important
this is so more motivation all
go and tell what we did so we looked at a turkish speakers the three
either speaking turkish or
dutch with a turkish actions or a mix of dutch and
turkish
a them here's a slight about
still problem this skewness of the availability of you number of
segments per speaker and have to deal with that
and i had a nice are facing joke about the well-known public to instance you
which i paraphrase us there are so some speakers are more equal than other speakers
how to deal with the different amounts of trials
george actually have it open solution
and old solution
with indicate that there make a debt curve per speaker pair basic
we implemented that you see the influence of that c can be to more about
the paper
quickly go to
the
affect all
say
i'm speaker population for commercial speaker recognition system so here we use a commercial speaker
recognition system that can do some calibration and what you see here
is that if you give the
recognition system some additional material so those were forty five speakers outside the test databases
used here you can go
problem very badly calibrated at the top line sc alarms above one it's just be
useless to the lower lying where you see that the c llr which is a
measure of calibration and discrimination as actually quite close to the minimum attainable c llr
so this shows that the system that we used in you can be about paper
more about the system
did work and the data curves are more or less same so this is really
the reference population only matter it's towards calibration
but i was going to tell you a model wait a minute level
about this is just an answer to one reviewers well what about distribution well yours
distribution
as also the paper
this is my final slide already
so i'm working towards the finishing in time
you see and number of figures
showing different tests
but you can do with the database so out of this ten percent we took
only turkish speakers
we first looked at what is train and test both turkish and you see several
performance measures in some statistics about the number of trials
and the next thing you can do is what if both train and test
or sampling questions
or trace and reference you might say
are both dutch speaking but with a turkish accent
and the numbers actually
they vary a bit so the equal error rate goes up while
except me significant i don't know the remote too many speakers in our in our
sub subspace
but it might be more interesting to look at the fourth line where
what is indicated there is that we
training with
speakers talking turkish and test with speaker spoken dutch but with a turkish accent for
the other way around to these two cases
and then the most
interesting thing there i think is that nothing happens or not so much happens with
the equal error rate that's they've more or less and same ballpark fifteen point eight
percent
some of the first two lines but the calibration
suffers but it doesn't really suffer that much it is actually comparable to what happens
to the two
turkish speakers speaking
dutch so
where there
from this data it looks like calibration is suffering from speakers speaking some of the
language
but
the cross language effects is modes
making things worse i i'd like to
show you the figures you look at yourselves
i think that general conclusion with the figures is quite a lot of self very
so things depend a lot of how you set of experiments at least is data
allows you to do those kind of experiments
but i would be like to conclude more with that the general idea of this
of this work of collecting the database is that
first of all i hope to have shown that
this kind of data is necessary both for answering questions like what is the error
rates at the error rates of the methods for this particular case so one of
the easiest outward conditions
but you could also use the data that this is not shown in this work
with you can use the data to actually make is case specific calibration so once
you know which
factors are influencing and where you which not then you can select
make a selection important to those factors
and used at four k specific calibration
and i given very shortly an experiment showed shown an experiment there with it
language data
okay of this is this is
thank you for that talks the talk you to david and i would like to
ask you about the level the precision of your tagging in metadata
and the reason i'm asking is because in australia we have very similar multi lingual
context with waves of immigrants from certain countries when there's troubles for example we have
whenever there's a war in eleven on we have the immigrants ways arabic accent speaking
english and arabic accent
now firstly that level of that accentedness varies as we all know
and secondly after twenty five years we have a second generation what a native speakers
who don't speak english with an accent that speak their own idiolect of english often
as native speakers do you account for that kind of difference
yes we have not only a language field
but also and native nist field so there will be annotated learned this is native
speaker or quite a good non-native speaker or a bad native speak of the language
spoken
and this idiolect this ad collect would be socially clicks us as a linguist term
would count as native
if it's second or third generation
it's deaf
for the each where was each speaker confined to the common in z
where did they did you have speakers at one across handsets
we have it'll mum the majority of speakers
i know it about telephone number so
is from the same number but there's also some speakers that are using one phone
in one recording and for instance a landline in another recording
i don't have no
in the experiments that show you the non-target conditions are always the same start condition
so
absolutely i make an argument that the doesn't make any sense to different
there "'cause" i c i c conditions is this is sort of the conditioning i
in the likelihood ratio thing so you shouldn't
have to the numerator with different conditions than the moment i don't believe that makes
too much
as well that would try to people put it make a speaker
just pop of david the
my feeling is what the calibration show shows is exactly will limit of speaker recognition
has known as we are not able with how system to detect if we need
new data
to do we calibrations
we wean upset with calibration it just v limit of that someone's of the system
i'm not sure if a entire lots not to question but i
i agree that there so we can test whether particular conditions your noise or whatever
make a difference or not we should be able there without with this data
but if there is a new condition
the can always be a new condition and
if we don't know whether it's of influence we can say whether we archer having
matching data i agree that that's just a doberman work
we should walk on automatic detection
of the case where we are
in a known condition
if we are not able to describe a condition factor by fact or and to
decide and to give to the user
v probability to be compliant with for training set
call choose how system in forensic conditions because has joyce eight reason
a huge amount and we know that huge amount of conditions in forensic
so the way i would approach is not by doing everything completely automatic back actually
having the forensic expert listen to the data and that's no problem is it was
limited amount of data
and the forensic expert can say something sensible like well this is very much like
i've heard before or well there is an enormous bows or there is an enormous
this where that i haven't seen before so i
we should not pdtb
they've it i agree but completely we need to human and human expert to for
that at least would beginning but we need to feed that the human expert was
information of with the system
and another time has norm will has we don't know exactly
which information how system using we can't explain to human expert how to define what
is the
no condition pose system or not it's not enough you have some very interesting levels
here but has
michael saying about the previous questions we always have some question about language what is
exactly but definition of the language of idiolect what is the definition of be conditions
of the distance
is it in the conditions
and we should work without automatic system to deter mined the sensitivity of for system
to each of these
factors
knowing that the u one expert could use the human brain to give a probably
and i hope we will we will do but
just really transgressions that the underlying assumption i think that most people are thinking about
is that when it's intercepted that it's a telephone call but i think in a
lot of forensic applications you have a confidential informant where the body warned microphone
and in those cases the microphone you have lots of issues with a cloth committing
the microphone support could you just say in the audio that you have are all
then telephone calls
and do not
do you have any plans to kind of explore the body one type scenarios because
that actually as a real challenge as well
this is only telephone speech and we are planning a to expand our data
collection to
yep more ended mismatch but in holland telephone intercepted telephone speech is really the majority
of the data so we're covering quite a lot already
but this kind of circumstances or a parked car or anything
we need data that's true