good morning everybody a well nigh
contribution here and will be more most focus on keeping
overview of some rating guidelines that have been developed in the last two years
concerning directly or indirectly a speaker recognition systems or semi automatic speaker recognition we human
intervention the feature extraction mainly
and then the message format that before
doing something with speaker recognition in court in europe at least we should read this
guidelines because they're being generated after process of consensus among some community so i think
they're relevant community so it's that's a message phone we want to do something you're
if you're not from europe i thing at least it they deserve a re the
not to know what's going on in you know or environment
well the first one is
and the so called m c guideline for evaluative reporting in forensic science most of
you probably already know
eight was released in two thousand and fifteen i'll talk about later
second one is
this works
something wrong
i don't know what's going on
second one is a gallon that we have developed in a collaboration were then if
i and with consensus roommate additions on validation of light of racial methods
and for forensic evidence evaluation
and the first guideline is a guideline that has been
released
something's wrong with the computers
right
that's for the best for the windows system
okay
with a one are some recent guidelines on but the logic and islands for back
practising for a six you madam adding an automatic speaker recognition also develop by m
c in europe and network forensic sciences to do particular the forensic speech analysis work
we're concerned in the first the first one is probably to the three of them
are available second one is already published in forensic science international from the third are
in this repository of documents from m c
and some critical combinations of this guideline are about expressing conclusions in court in general
not only in speaker recognition but in forensic science in general their recommendation for all
forensic science fields
and there's some critical recommendations in the guideline that have especially stressed
first one is that the expression of conclusions must be probabilistic
somewhere breast cancer recommend in their server must gain in this in the guideline
that i recommended to transform the probabilistic statement at a form of likelihood ratios in
terms of formal equivalence and what is absolutely stressed is that okay absolute statements should
be avoided
like identification exclusion categorical statements
second one is that when the one has to the finally hypothesis in the case
that's a same die different guy or this guy comes from this voice comes from
this guy all these speech segment comes from another person in with this characteristics
one has to consider at least one alternative
can be many of them but at least one
and a clear definition of the database also
a is also mandatory because the definition of interactive defines
what is the data we have to handle in order to compute this weight of
evidence
there's one it'd findings must be evaluated to given each of all the buttons is
so that lead as to
somehow kind of
well likelihood for each hypothesis only two hypothesis case we try to a where we're
going to a likelihood ratio
for the one it said that the conclusions of this breaks in terms of support
of hypothesis instead of probability of the processes this support to the hypothesis that putting
read this the way of
it is quite easy way to avoid some fallacies in a reasoning
and it for so as to suppress are support are the weight of the evidence
in terms of aligning racial rather than a posterior probability ratio
so support is an important work you want to avoid this kind of classes
so they will last one is that a data driven approaches should be the
final goal
but in the meantime their many people that cannot role in the lower tiers to
data driven approaches so
the guy lighter considers a they use of subject the subjective judgement is subject to
probabilities and so on
but it is recommended that data-driven this is kind of
a long-term goal
there's also an example in speaker recognition is not an example of what speaker recognition
soon should be because the generate the some controversy in the into the m c
four six p channel your analysis group because there are many ways of doing speaker
recognition
this is an example you should on automatic case it was generated by people from
what if it will that they used automatic speaker recognition for doing this but it's
not
exclusive
just a guy templeton given example
how to do this in a given particular scenario with a given particular weight of
special conclusions which speaker
well the second that nine is a guideline validation we have been developing with people
identify and people that the a professor
and this guideline is aimed
to recommend everybody in forensic science that is you the likelihood ratios to go to
works
a objective evaluation procedures which is not the case
typically in many forensic science fields
here a speaker recognition we use that definitely in a in this conference everybody
use a experimental environment to validate their methods
but the two questions here first i if you're not used to that how to
do it
which
somehow i it comes to perform as measuring how performance at messrs should be interpreted
which perform and messrs i relevant
and the second one is okay i have a validated by a system is in
performance measure so
how to put that into play in order to make one technique
to be able to go to court some recommendations regarding laboratory accreditation laboratory and
okay procedures and so on
the guideline is very
particular it can create but i'm not gonna go into more many details the of
the thing is just determining if an implied a correlation matrix is able to be
used in court
and everything should be documented
we are in the process of a stellar accent is island into allies just and
therefore biometrics
d mlps meeting these but there are some of the people here collaborating from start
and or laboratories related to i so
and we proposed in a tile i some relevant characteristic this table is not intended
to that you read the table but you can see somehow cor eer thinks that
we are used to here so we contributed this into the general for as you
feel but this performance measures are not
limited to this once just a proposal
so that the guideline is supposed to be open that sense
so everybody can contribute would more performance measures these are the minimum requirement that we
understand that the validation process should contact regarding
performance measure and also there's a high stress
and most of my colleagues would talk about it
about the use of relative relevant for a six data so laboratory data it's okay
using a nist evaluation is nice
but
the last we follow with a critical the performance measuring in forensic
fourth conditions which is extremely tricky can stay
an extremely tricky issue and that like colleagues will talk about it later so
finally this l m c guideline for forensic automatic or semiautomatic on automatic speaker recognition
that was laid by pretty led by under the got a within the forensic speech
utterances working group
and it is guile anything have is compatible with the m c guideline for reporting
is also compatible with the validation guideline that we have been talking about before
and also address
many other issues
like a the most used technologies and matters with that the state-of-the-art methods that reliable
the most used features we have the features that typically used in hearing in different
approaches which are more reliable audio preprocessing how what is information if you might be
a human being in the process as well as well so it's based techniques guideline
and they have been developed within the for six a speech about it has what
many of us here have been developing having contributing to that so it's a guideline
that presents a high degree of agreement today i mean
okay that was my can be
thank you then and just and namely not so we have one minute for a
small or we question
in the case will have more time
when all the fast talk
any question
for then
and the guidelines
we don't continue with k
yes
i
so
dennis
is going to the one you know with his presentation
okay
window
how do you full screen
a good morning everybody
i'm that jonathan
from sweden
work for a company always be
and also
why the university of garber
currently at
i'm gonna talk a little bit about
a credit the small forensic speaker comparison
which we are
so the company
company we performed case work for around eleven years
been to sweden norway and us
approximately fourteen cases
almost all the more swedish cases
there are three
people employed
all the most part time
all employed by the university as well
and we are the sub contract or of the swedish national forensic centre
basically we handle more or less all the cases
sweden
a small area
just give you some short
just quickly talk about an implied methods mentioned them
and then i'm gonna talk about the evaluations for accreditation where daniel stuff comes in
very briefly what a forensic conclusion in sweden looks like and
quite a few questions
to put up there
so before explaining the three parts very briefly there's of course screening processes so and
fc screen means that
and that's developed over the years of course of these days it's basically
screen part of the cases are round fifty percent
and that happens and fc
these days basically
before it used to be a lot more screening an in house for us
but not in a t does it and one more because it's cheaper for them
and then there's always the second screening done
at our place as well and then
hunting comes from one station with joe the others we always say keep open so
that we can actually one samples during the analysis of web even if we take
taken on the analysis
job
the first part of the analysis is the linguistic phonetic perceptual analysis
i also
these days and some cases a
it's also could begin with a light dusting depending on how many people are embolus
on linguistic part is you know go through different steps of perceptual evaluation
it try to keep it in and some kind of bayesian manner so how do
we treat covered by a small
keep very brief you go through it once with and you bias yourself actually for
the one hypothesis and then you go through it again and you bias yourself for
the other i
and two people always doing this and third person
in most cases and to the by test
now the three more or less to the point at a private case
some level
also
matter cost and how much working pretty to case
second part is stiff you acoustic measurements that we still do
and are part of the standard protocol ones articulation rate basically produced a little per
second
fundamental frequency measures few of them graph and then the long-term formant analysis
which is basically nowadays handle more or less automatically
and well
also put into an i-vector system
and
and third party cycles than the automatic system so currently there are two systems are
active
we're
evaluating one system and as one system researcher for systems altogether
guidelines when it comes to the evaluation for accreditation
we've been
fiddling around in the dark basically not knowing what to do exactly and i think
false we we're
we very much appreciate the work that's done but by and say and that's true
but also maybe especially since we're in a tight schedule mouse are next a deadline
for accreditation is like
to over
so when they regardless on when i was that and a five month ago meeting
would be da
and only but this work with the dog you know rudolph
that guidelines really important for us to how to treat the validation of automatic systems
it doesn't solve and everything of course and that's a lot of you can discuss
that very much but at least there are some guidelines now we can follow and
we know what to do basically for the accreditation at least and then you
people discussing
so that some of these are just example some of the plots that they
a suggestion the guidelines for some that it all looks five and you know it
can get the figures for each of those plots
these are some example of the problems you can start running into well from this
is from doubled from the flu to identify
you created directly maps for the results in this case it's a little are means
but also for equal error rate and so on four different testing that you don't
and huge telephone database so
more or less like see what happens when the training samples are more than one
and when the test sample is more or less or shorter and shorter
what happens
in the evaluation process
but if you consider all those plots
and all those figures you can and accreditation process you can realise that
is gonna be quite many pages if you also very brief you don't have to
read all this
consider how many validation start
very quickly went through that during these eleven years we've done over a hundred evaluations
and if you consider all those different the conditions and so on different durations like
microphone distant microphone mobile recordings with and without phase cover in an outsider car indoor
outdoor different languages different compression with done more less all those
with different datasets and some simulations but
you can imagine what a large document that would be
document in all those evaluations for the accreditation process
the perceptual phonetic analysis also has to be evaluated
currently we well it's been a difficulty for us because we're we've been to before
and we know pretty much that a to we have to some extent at least
so we been trying to evaluate each other
back and forth over the years now we are third person she goes through basically
training
testing because even though your the phd like speech pathology in her case
and you a great year still have to evaluate everything and you're not really used
to do forensic analysis on telephone material
she had to go through training phase the testing phase and then aligned evaluations
as we started that the small scale of course because extremely time consuming the last
almost like
twenty three speaker took are some three days to form the analyses
just quickly showing you what the
the national forensic centre verbal scale looks like nine point ordinal scale conclusions
two hypotheses
so from level loss for two-level minus four zero in the middle
and it goes something syllable-level plus four isn't like the results are extremely much more
probable is the main approaches to compare the alternative
a mind of two there's also more probable the alternative hypothesis to compare two main
behind each level there is a standard likelihood ratios
important to remember
even if you do all these evaluations and you put this probably thousand pages document
for accreditation every cases uni how much can you actually inferred from all the evaluations
you've done
to each and every case
is not easy at all
even though it looks really don't know it's evaluation
see a lot of stuff to think about still even though you go threat accreditation
process and you get this down problem
evaluation is not like the evaluation stops
and that just in general pattern that out there as well
we what is need to have a transparent report
still don't know that there's
something that we need to discuss much more
and who has to be able to understand this report
is it the actual
the jury or judge the
actually another expert probably which that's how are basically the
i think that's pretty quick
thank
excellent i mean we have time for a couple of quest
nico
like we did not
why
it's data mining for its well it's just two examples because of output all them
they're the slide look crazy and
so i just like plus or minus to give you an example i could have
taken the minus four
i suppose is probably more of a common that will get it to later but
based on
the preview so far seeing with first to talk
but one concern i have nothing wrong that is that
the big of the forces all the data right so that the data and there's
a lot of that are going to about guidelines accreditation so for one thing is
gonna be everybody keeps a the data is the problem but it keeps
kind of putting
of the near that if i guidelines and accreditation now it's gonna look like it's
more official time disconcerting later it is not really quest the discuss of how we're
actually ever gonna get our hands around the data issue
one leader answer all to me
well what i can tell us that
there is a lot of data
but of course we can cover all these conditions that amount of data but
to me also it's this the sensitivity of the data
so i can tell you there's a lot of data i can't really tell you
about how it's collected what data is its own because it's all kept behind
secrecy to too large extent and
that's also depose specially in sweden to do when you publish things
a lot of evaluation that we don't over the years we can publish because it's
i hope that is actually going to change now but it's the it's gonna
huge problem i can't really i can give
well if i probably something i have to be able to give the data actually
to another researcher if he asked for
or making can i
to this intuition and actually use the data error or something to for falsify ability
thing
but
if you can do that you can't really publish anything so
and that's gonna difficulty but now we're
probably we maybe can do that anyway because organization it's changed please it's also
but will see
thank you let's go to our next us the you can sense
some talk a little bit up about some aspects of a word could be a
became we do speaker recognition since the seventies and early days it was done automatically
but the technology wasn't really
and ready and
that the method used was the autumn
auditory in acoustic-phonetic method starting from the eighties
and since about two thousand five use both this onto an acoustic method a compact
with that
plus also automatic speaker recognition
a just a few slides you
so you heard about from daniel about these
guidelines
for
as semi-automatic an automatic speaker recognition
and just again repeating into two of the aspects or one is
the outcome of an automatic or semiautomatic method is
the likelihood ratio so
it's all about
it it's and systems that output like ratios
and another important aspect that then dimension as well this is that validation
of a like information method has to be performed with speech samples
that are typical representation of the speech material frantically boundaries confronted with an everyday work
so it's gonna be
forensically the relevance
information
these kinds are accessible even here on the on the website you might have noticed
that this is link using all the
it gets you to well
the and c website and there are four documents on there so as one of
all documents
on the nist website
now
since we have you those guidelines are we have to sort of a
practice what we preach so we have two
get busy
collecting the forensic data forensically relevant data and we've been
starting doing this a while ago one of those activities have been published and the
odyssey two thousand twelve
in our activity and ongoing
and another q this is our collaboration with the end of high
on
they have a
not really is they have good
compiles vienna five fruits corpus that was document and all those in two thousand fourteen
and we have a special license to work with them was off work to look
at this going many restrictions and so forth
also for in terms of what kind of data we have the best coverage is
for matching conditions
involving telephone intercept data
what's more difficult is about condition so especially mismatched conditions one type of conditionally frequently
have is
comparing
terrace videos
the people making announcements to public disguising the phase this and
is encouraging people to come to their
training hams and stuff like that
as opposed to telephone intercepted recording us all these guys callhome and then there's interception
it would be captured telephone section
so this would be an indispensable in terms of technology but also the speech style
so this guy i read something for example make involvement or learned it's at all
it's different from a natural telephone conversations
so we do have somebody remote we it's more difficult to collect the data
in other challenges language so we have case work in several languages and we want
to can cover them
and we do collect data from different languages but there is a limit to what
we can do in is an impact as a parallel strategy
we also investigate the affects both the size and that the type of the effect
if the is
mismatch in terms of the data we have so one type of situation is if
we have a
a testing corpus were but we don't have the right reference population for that we
have to use a reference for lid from another language what is the effect in
terms of shifting the like ratios will be used the incorrect
reference population as not a big effects
so these kind of effects are to some extent predictable
this what we also to took too to capture this language should languages is a
big issue
it's a
we don't we can just one
language a it's a it's in several languages we want to cover that
this one more practical problems not
no that to move more conceptual
problem and it issues
the one that's combining
different kind of every this there is quantifiable evidence
like a ratios coming from automatic or semiautomatic systems that's what the guy like
well
there's also qualitative evidence coming from the auditory phonetic and acoustic phonetic method
and we use both kind of evidence i mean some partitions an answer to this
just work with quantifiable evidence
others work with both have of evidence and the question is that how to combine
the two
and the since not everything is quantifiable if we use both methods eventually it we
have to be something
some strength of evidence statements that are not entirely
quantitative so in the in the end of it
one components qualitative the quality of the their entries that has to be qualitative be
because it doesn't
can you cannot calculated or way through there is some qualitative aspect so that's standing
problem
not unsolvable of course
but it was all those in students to use both like the ratio producing methods
and qualitative methods the other one that's the most painful problem probably is this one
here about the a colour the interfacing with the core so
you can do audio stuff and would well or not so well but i one
cases judgement they have to go to court and interface with people from record and
they have of and have different mindset and different expectations and so forth and the
situation that we have in germany is
the courts in germany still expect posteriors statements
so the expects things like what you have your table
well the identity and or not identity cannot be assessed or is probable are highly
probable very are probable
can be assumed was near certainty that this is sort of stuff they used to
and that it still expects
no this of course this that discusses and everything but there is sort of psychology
inertia against switching to a bayesian framework
the v ideal
idea about the bayesian framework is
the speech experts supplies like reissues over prevalence of a forensic experts to the same
and then the courts applied prior all calculate posterior also from the prior art
and all the like iterations that coming from the expert so that would be the
ideal scenario
that's still and there's against implementing it and all and the netherlands sweden
you have much five and then we didn't only
i don't know if you sort of can
especially point three how state of the or some on that one
but this is since topic for discussion
just i'm just a interfaces so that
this it's not this is
the and then and this expectations coming from the core about sort of things they
want and so forth
that's basically but
i've system model
good thank you very much we have time for a couple questions
could you can just say something about how you actually at the moment go about
combining
they quantative on the qualitative data is the sum
explicit statement about how you do that and how you integrate any kind of
relationships between those two types of evidence
we went to do with for the automatic is a thing of a here for
example
this is a plot coming from the guidelines and
and four we have is that we have
and i
well
i
i
i
i
i
i
i
i
i
five
i
i
the resistance against the bayesian
paradigm
could it could it could vocabulary contribute to all the could german words for like
to drive so prior also still you have
i
i think easy i think using john colour you have to explain the concepts and
everything
no i think it's not
language a little or no so probably but
as more as regression process on the core
the reason i'm asking in my home language awfully cons we don't really have words
for we have four probability
voice kind look but okay this enables as well that's why is that it but
once all the because there's not even though this things with your likelihood and probability
is just a sign or this is like overcomes were i got no idea of
this a posteriori
probability of the cost i don't know how to set
i
since the guy border vocabulary
my comment on there are two sort of got up again and again
which contribute to
at least at least partially to pull this with
interfacing with the legal profession
one of them is support
and the other one is the use of speaker recognition
no if you keep on talking about speaker recognition is not surprising that the cool
thing should one speaker recognition
right and do not this isn't speaker recognition you giving them elected version the speaker
recognition comes with the posterior
i think it's okay for us we understand that i think but
of course the legal profession is something to the if you keep on talking about
forensic speaker recognition
then
so surprising that i'm the one to the size of the sense we will
and secondly this
the one of the things that really gets by backup is this will support
in the likelihood ratio supports the hypothesis
it doesn't well
the like to the meaning of the likelihood ratio is the
hypothesis merges with the post with you when you take two parts into account mm
it can be reversed you know the last iteration of the thousand be robust
it has the meaning else it has a that has no meaning
apps the problems sight talking about this likelihood ratio of support for the prosecution hypothesis
the trouble thing to support a language
this is i know that's what people use i think it's a very bad choice
i
i
what's the same think then
they didn't this is this is you talking you're talking about you the trying to
say something about
no trying to say something about the posterior in the in the absence of the
prior
and i'm not that there are plenty of other words but the but it's a
it seems to the standard itself as i
expression i and i again a way that we discuss later but i think that
the grim some core implicitly stays
there is no a consideration of all information for supporting previous opinion but you use
it in conjunction with
support for the hypothesis the not
the results are more likely
not that are i understand lately sentiment over the whole thing but if you say
my likelihood ratio to give support for the prosecution hypothesis well the defence hypothesis that
no one is a i mean how could happen that the wording that's been used
i understand the problem
i would like to stress is not the likelihood ratio what supports
is the findings also for via
with of evidence which is quantified in a range well the findings of different
s
okay so next
having
good morning
the title for like till today
he's opening the black box
for forensic automatic speaker recognition and this talk was
a prepared by financially and myself
we're from also wave research
which is e audio not speech rd company based out of oxford and are all
experiences feel is that we develop systems for automatic speaker recognition speaker diarization and audio
fingerprinting
and we've been what in this field
for quite awhile a products all used by law enforcement u k and other agencies
in the u k u is your the middle east
and include them at least you came only the n if i and seventy k
the
topic i'd like to dress
coast with some of the common set of in that come up already
and
it is the fact that
automatic speaker recognition
ease eight black box and this is a comment that what about colleagues
one of our conferences set and it stuck with me
and
i think a lot of this work needs to be attracted to address the fact
that automatic speaker recognition methodology is a black box
well the last few days we being treated to a variety of new algorithms you
techniques in might have i mean variations and modifications of different algorithms
it isn't
any surprise
that these mathematically complex methods
all black box
to the laypeople the juries judges and voice
to a certain extent even to the forensic experts
where using these
now
as we've seen recent advances have been with these
with a large number of variables and does comment earlier about it or being about
the data training and evaluation data the feature modeling and parameter choices if you have
an evaluation you have fifteen systems with
variations of orders where the arguments been placed in one way of the other
and how parameters and tested i have been included in the focus
has been on getting incremental improvements on these loss database
and weighted like to do not
the variability in these databases has been designed all controlled
now
how does this it within the context of opening up this black box if you've
got real forensic casework like some recordings of doing
how do you use and how do you address
the can
but
let's look at the end c guidelines for some sport
now the l c guidelines talk about any expert method
addressing
balance
transparency robustness and logic is on these of we already addressed quite good to go
into them
the things that stick out of balance for example that you have competing hypotheses or
propose a propositions and evidence is considered with respect to these hypotheses and propositions given
of course the prior background
and then there was about loading
and the fact that you know you don't want to
transpose the logical of
evaluating the hypothesis against evaluate the evidence instead
and robustness which is slightly different from the sorted speaker engineering we're talking about robustness
which is how well we did hold up to scrutiny however we really wanted to
cross examination the actual techniques the actual techniques of the use i will build a
problem
and i think
white importantly that something you don't get any black box
its transparency
so
how well with the forensic expert be able to explain the methods
and explain the data and that goes in
a few system that the using
now let's take a very simple straightforward it's expect for tonight used i-vectors in the
same sentence politics a straightforward automatic pipeline wave training the ubm
you've got a whole lot of data that you can put into training the ubm
you choose another
another set of data for training the total variability space
and then you if you using lda p lda you can use even yet another
speaker and that i know was used a lot well in these
and this is just before you it testing in training and validation or equal error
rates and so on
so if you we even got started
you've got data decisions multiple data decisions about the ubm training about the tv matrix
about the l d and the lda
and this is before considering things like what is the relevant population than the likelihood
ratio method and so on it so for this is embedded within the system
and
going back to dogs comment about resolving about data
the system that are developed
with these kind of background data
have to be explicit
about their effects on
the likely to show what least that needs to be transparency about the effects that
the that these are like calibrated
that that's one part of the problem that is sort of the automatic
a black box if you will
somebody could help
now if the u k most
of the forensic speaker recognition case what is performed by forensic conditions
and they have a lot of experience and knowledge they understand the material and send
the language they understand the that idiosyncrasies of that speech the in the centre legal
requirements of their
and
that they want to
include these automatic methods but are all automatic systems give these goals
and how you then
connect
this automatic score that you've got with this knowledge that you have about the fact
that this
speaker says
something that is very particular to a region or space
how do but these things together
okay assuming you even wanted to make your analysis more objective using likelihood ratios and
evaluating before system performance
how do you can to do this
what generally happens all happened was you had to
putting
against that sort of
you had a traditional sort of forensic phonetics based approach look at performance and voice
quality and linguistic
characteristics
and then you have the automatic space
which
which look at the spectrum and
you know street treated as a signal processing problem
because they only against each other
sometimes we don't even sit together at conferences
so
it's not
that kind of needs to go to this common political platform produce
beginning to be accepted which is that the that the bayesian likely iterations and it's
nice because you can have these multiple methods and not approaches and they can put
together in the same direction
i've been working with this problem for quite some years and then be with a
lot of colleagues who work with forensic casework
and i really think the
black box used
quite a quite an important probably creates
you've got situation where the forensic expert has four systems that they haven't elaborately decorated
these four systems for example
and you don't wind able to look in order that automatic system to you all
k-histograms i go back to but you on this is point about every case being
unique
and the expert should be
say system parameters means to use
new data at every step speaker recognition process
and in some sense
i in this
doesn't just go for you know commercial systems
i
x the expert should not be limited to these prepackaged preprocessed manufacturer provided models and
configurations
and they should be able to train the system specifically for the problem domain
and it's it was in this context from table three
that's
that we looked at one point in this is by no means the only good
only way of doing things
but
when you know
we don't that
putting together a not automatic system that was built with the with an open box
architect if you will so one if you flexibility
in the features that you put in so you could use automatic spectral features like
mfccs and so
but it is important but you could also use traditional forensic parameters like formants
and then
a debatable the fate but you can use user provided features again allow i i'll
the strength of these mathematical modelling techniques like i-vector p lda gmm and gmm ubm
and
and you can use and within the context of these lexical features
and
been doing this was that it was you were able to introduce needed all stages
in the i-vector by plane or the gmm-ubm pipeline
and
to a certain extent the system to the conditions of the case now
you lasting is this make
it's this big black box
transparent
no it doesn't
i e ds as complicated as it is
the what it tries to is open it up
to what goes into it and what data was into it and
allows for validation that's more meaningful
in the context of in this case
thanks any so there is only one we questioned
in you know case two
so that has a speaker
anyone very quick and then
and then the question itself
i'm another so i'm by s so i'm sorry for that but this is this
is a very interesting topic the black box thing and so on and i think
that
my opinion of course address trained yes because i think that when forensic expertise going
to court the board if an something he needs to understand what's going on and
what type of with a little additional using what type of algorithms that but using
wasteland that deceased into your specific case yes it's obvious every that's the main in
forensic problem that is every casey's is different and you need to have some ability
but that
but be careful with that because
you create a system where you can tune everything
then you create you make unsolvable the problem that what something before
because if you wanna system that is validated
and the same time you can change everything every time
that we're gonna problem because then you are gonna need to validate this is then
a single case so that for me for me creates
l a big problem and apply them or with a time because you need to
change data and sometimes is not a see the change data in the form of
audio files and so on if every single system every single case that you need
different the parameters of different song also makes more difficult to separate as also so
i think that
we need to find a place where you balance both things a transparency and openness
of the system but also unique list data lies some sort of a specific things
on the system just to the make it
to make the little the validation of the system at
what it does it
okay thank you any thank you in any case we can we can twenty maybe
this is interesting is gaussian
after that as a speaker
and then said well actually it and some of these points in all at the
in the other hand the demo in this challenge so you can also continue with
him
okay i'm gonna tell you about simon introduced to you right multi love our evaluation
or friends or voice comparison
that is being organised by myself and my former phd student of all bands and
so i think we've already talked about doesn't need for evaluation of forensic evidence
this goes across all branches of forensic evidence best been calls since the nineteen sixties
for forensic voice comparison to be evaluated under realistic case what conditions but i think
just by what everybody here said i think this still goes widely unheeded
so in our contribution to this is to run this friends go evaluation which were
calling forensically vol zero one
it's designed to be open to operational friends a greater or trees we especially want
them to partake take part
it's also going to be open to research work
and where providing training and testing data they're representing the conditions of one forensic case
so based a where providing the data but have that has based on a relevant
population for the case it based on the speaking styles for this particular case and
also the particular recording conditions for this
and
we are going to have the papers recording on the evaluation of each system published
in a virtual special is you all of speech communication
so the call for papers the system is not quite setup but i'm hoping it'll
be done maybe ventilate of this week or next week covers your
the
information if you wanna get information that still that's already available you can find it
by going to my website
and you can get started if you wanna start
so there's an introductory paper which is already available dropped of at least is already
available and it includes a description of the data and it includes the rules for
the evaluation
each paper that's evaluating system needs to describe the system in sufficient detail that it
could potentially be replicated
and we're thinking about the level of it could be replicated by forensic practitioners who
have the requisite skills and knowledge and facilities
we're not prototypes deadline on this people working in operational forensic laboratories are very busy
there
their priorities to actually do case work so where giving a two year time period
within which people can evaluate systems and submit
so disclaimer casework conditions very substantially from case the case
basically i'm of the opinion that you're sensually at this stage do have to evaluate
your system on a case by case basis because three conditions also variable from case
the case
and what that means is one should not whatever results one gets out of taking
part in this evaluation one should not assume that those are generalisable to other cases
unless a one can make a case that yes this all the case is very
similar to these the conditions in the in the front to give l zero one
case
so a little bit by the data to based on real cases i said of
the offender recordings of telephone call made your financial institutions call center this is just
something i
this work i just something i still of internet it's a landline recording at the
call center and it has babble and typing background noise it saved in the compressed
format because of course they want to reduce the matter storage that they have its
forty six seconds long and it is clearly an adult male australian english speaker
the suspect recording we should be able to get nice high quality suspect recording yes
okay right okay or no i have a point over there right this is the
actual room but the suspect recording was made in u c v is nice heart
goals and i think the cat the person taking the camera is like in the
opposite corner of the room
right imagine what the reverberation is like and you see this here
is nice fashion
and the microphone is in this box
so
a problems with the suspect recording as well but that's pretty typical of
the sorts of problems that we used we experience in real forensic work
so the data that we're providing a come from a database we collected which is
the whole database is actually available
but this is that this is extracted from that database i we got male australian
english speakers we have multiple non-contemporaneous recordings of each speaker we have multiple speaking tasks
recording session
we've got high quality audio so we recorded we actually had to record
the route speakers from the relevant population we have to record the relevant speaking styles
but then what we've done is with you type of the audio and we simulated
the technical recording conditions that i just mentioned and that's pretty pictures about signal at
most conditions
so we have training data from a hundred five speakers so if you're if you're
nist
definitely used of nist sre is that sounds ridiculously low but day i think availability
of data relevant data is a major problem in forensic voice comparison
and that's
are actually quite a lot of data of compared to what people
can usually manage to get
and the test data comes from a total of sixty one speaker
so i can i have time to show you some preliminary results
based on the data from friends give a zero one
so this is results that of all than i actually did so this is not
part of this special the specialist you in speech communication it's something that we did
previously which is pretty a which is already been submitted but it's on almost exactly
the same data
so it's the in this example is looking at an i-vector system mfccs ubm t
matrix lda ilp lda and then a score to likelihood ratio conversion at the end
using logistic regression
and we trained a two different versions of this system one is using generic data
it's not using the training the first training level is not using the date i
just talked about it using a whole bunch of nist sre data it's about an
order of magnitude more speakers and two orders of magnitude more recordings
and we use the generic data for everything to get to the score to training
all the models to get to the school and then we use the case specific
data for training the model that goes from the score to likelihood ratio so that
logistic regression model at the end
that's a fairly typical way of doing things
because you do all the heart rending upfront here
right we did another system where we use case specific data all the way through
where train the models that get to the scores using k specific data and then
with training the score the likelihood ratio models using k specific data
and here are some results in terms of a zero if you just nosy llr
accuracy of a look at
okay so the case specific data
is the one that performed using k specific there are always through perform much better
than using joe generic data to get to the score and then k specific data
for sparse code a likelihood ratio commercial
and if you like tippett plots use tippett plots there's the generic the gen our
data systems use the k specifics
and if you understand tippett plots that's a huge difference
dive in front of words has already been mentioned his
doing very well in this presentation for not having been here
so he's going to his or his already started doing the evaluation and we've got
some results from him and his kindly allowed us to show the results here he
was testing that works this different user options and bat fox a one user is
a one option is a reference population
we put in either or data from all the hockey put in data from all
hundred five speakers or you like that but select a subset of thirty and he
tried using no impostor data already tried using impostor data from all hundred five train
speakers
we here are the results us summarize if you use
data from all the speakers instead of having better luck select a subset you get
better performance
if you use impostors versus don't use impostors using about this gives you better performance
so that the combination that gets you the best performance at the two
and if you like to but there's a tippet plot one thing that's clear to
notice is when you only using the thirty speakers selected by that works there's a
clear by us here which is then maybe a bias there but it is less
it's less clear
okay scale cask
thank you so we have just time for one question before we move into the
final phase for open questions and all the presentations remember in the session and z
nine forty five so there's less than ten minutes
so if we could begin with some questions for jeff that be great
the if the data was
totally appropriate but
giving it's viable to do a comparison of the two systems that you put up
based on your compare your evaluation
i was prepared for the question
here's the use the best so this
that was the red what the red one is the best of that systems and
the blue one is the best of this just the i-vector systems we did
and
so blue one is better in terms of cmllr and there's the difference
in terms of the tippett plots as well
right and i think and i think cross going back going back to
just our system that there are the versions of our systems i think the but
the big differences where using case relevant data although we threw
where is that was using a lot of generic data to get the score to
likelihood ratio
to get the score level
and i think that fox works better than our system that use generic data at
the beginning but i think we've end work better than that folks because we use
case relevant data all the way through
what's the difference in the likelihood ratios for the data
that's the crucial things
sorry three
what was the outcome in so you the you've compare two systems
but i would like to know what is the difference in the likelihood ratios the
this that the systems gave you the actual comparison
for the actual case yes
well there is
are we haven't we haven't tested that when we did when we did the actual
case we chose one system when we used one system
so we haven't for doing the case work we chose one system we validated the
performance of that one system and we didn't
go out and try a whole bunch about the systems on the actual on the
actual case
right because we do in case work it's it do in case work is not
a research activity were not trying to choose the best one and also the problem
comes up is okay you might say we chose three or four different systems and
then we pick the one that were the best
we will then over training so
where over training on the test set
we've optimize to the test set then rather than to the previously unseen actual suspect
and offender recording
and then there's also the problems of you know well okay you're presented
three different systems which one should we believe
precisely in a that's what i'll ask evolves so the defence counsel yes but and
not that i would've expected to have but suppose one of the systems gives you
a little loglr both minus five on the other one gives you local or four
twenty
right so certainly that's not so what we what we would do what we do
re in our practise is we
we pick the we optimize the system we pick a system that we're gonna use
we optimized to the conditions of the case we don't freeze the system
we then test the system using test data
with that we don't go back and change the system again that's just that's it
that's how well the system works and then the last thing reduced has the actual
suspect and offender recording
so we don't go gee i got an answer g let's see i got a
relatively low likelihood ratio who's paying me the prosecution they want a high one i'm
i'll go back i don't with the system and i can get a better answer
so we keep a straight chronological order to avoid any
and he suggested that we would be doing anything like that
yes i understand that but we're talking about different systems are we know little the
just one wants that all about the freezing of the system but the moment we
comparing systems
that's what tools about so while the results there were comparing says but it's a
whole across a whole bunch of test rats so it's averaged over a whole bunch
of trust us
for is the compare the comparison of the two different systems are based on this
you might decide
that you wanted to use one of this you might decide wanted to use the
best performing system but
in a few cases you would maybe decide to choose one of those systems but
if the conditions of the case in the future different i we then test the
performance of the system under the conditions of that you case
i might have decided on the basis of this case but i'm not taking this
case as the validation for the case what conditions are very different
rhino you're having entries news but i guess my question goes to about michael and
jeff at some point
okay
as you go through your case work
most judges are not experts maybe speech or speaker verification so if you're working for
example a tippet plots do you present there was in core proceedings and if so
how do your difference in prosecuting attorneys actually i'd
program ask about the support about you always plots or how you present results
yes and case you point one in recent years we did included to the plot
together with the case specific thing that's but decided before so when we do explain
everything and try to make it easy and so forth will be not shielding the
the court from both results we we're giving them the results and then but try
to explain assesses that this is used
possible
okay
yes all its stuff that we put in our ports of course we see his
the validation of the system
and typically itself i centric or two lawyer
and then they start from the call me and they start asking me questions was
this mean what is this mean and i have system known okay
i'll come to your office will spend a day together i will go through the
basics with you so that you have got to level of sufficient level of understanding
and then the next day then you can ask specific questions about this particular case
in this particular report and so you know sometime in the mid afternoon which we
get to the level with so we started by doing very basic what's a likelihood
ratio and sometime by mid afternoon we get to the testing level and explaining
what a something like a tippet plot means
and then you get a court
and the court seem to be designed to prevent this transfer of information from the
expert to the trier of fact
because you know if
if you were going to train so if you're going to train somebody you what
you do you might send them something to read beforehand you go you give them
a little lecture you get them to ask the questions you ask them confirmation question
to see the understand but in court it's
the lawyer asked you the questions and you answer only those questions and that
jury isn't allowed to ask questions it's
getting major getting the trier of fact understand this
i a serious problem
i'm not a research it's don't also varies there we have not good solutions of
the one thing
thank you
i just so the suggestion which is to
to stop two
see for like a glacial has a single number
likelihood ratios not number it's a rush you and it's very important to be able
to present
with two parts of the racial the similarity and typicality
it's really important for you do fall because when you all
changing the reference population
could be very interesting the coat two
make link between the similarity typicality pills and
you'll decision the boat v
reference population
but talk
the sum and for some new software perhaps also the buttons will give you in
the very for the actually sure what electricians calculated from the from where the evidence
intersects
with the is a different speaker
with the suspect distribution and then the and the
the distribution coming from the reference population so easy to
two distributions use you the case and then point you could you could see the
how the decorations calculated
the question is then if you i mean this is an important that we call
can request then you please are added to the board or not but can always
can an insider how it is calculated
so
i guess they seek out of the
two pieces that are going on here one thing jeff actually what he was ending
up presenting was talking about the
underlying
accuracy of the system right
the performance of the system and then we have the whole thing about the likelihood
ratio that number that comes out that you what present to the trier of fact
we all think is the
or seems to be the going and way to go
one issue i have with the likelihood ratio
when we talk about being a number is
there is no real ground truth likelihood ratio right
in reality the only ground truth likelihood ratio that we can even calibrate ourselves to
are infinity
i mean zero one right it's
it's either true or not true between those two things we start saying that we
actually have evaluated the likelihood ratio
of six point three
there is no we never actually a value we don't
estimate the likelihood ratio relative to any ground truth likely racial because the ground truth
likelihood ratio lives the polarity
i mean there's right we only evaluated through the posteriors
is the llr stewart posterior
so i guess my question the people to go to court is
what you say is the ground truth how do you say what it means to
be between the two poles i guess
unlike which were ground truth likelihood ratio is what's
what is that
thank you some
might be one
for me and this
this is personal opinion
for me the answers the calibration of the likelihood ratio so it is definitely to
that the only ground truth is like the final label what is to proposition
so
what we have tried to do and
in this validation guideline that comes from the workers have been on precisely here in
speaker recognition
is that okay
then i will racial would be better is not at this supports the right decision
and the decision has to ground truth bold fine like
so
and there's another issue that is the issue of the calibration so calibration helps you
to make better decisions because if you likelihood ratio that calibration calibrated when you buy
them to the vocal imitation changes usually chain
the cost reduced
so
that's one issue of calibration and the other issues the kind that calibration gives you
some kind of tuning imagery to a
generate heavier or lighter weight of the evidence depending on what you're discriminative power
so systems with a very good the car should generally higher likelihood ratios good conditions
right then
system with the one stronger migration systems equipped with the words that occur is that
the two properties of calibration so
on the on one hand you improve your decisions which is the final accuracy mess
you're looking for
and on the other hand you have and it's kind of limiting a entity that
is telling you okay you do not discriminating good they give likelihood ratio should be
model
so that's
that's a true that the performance measure that we have been proposed
e
i mean i know this it's fills a politically for a but it also just
seems that everything that we want to say that we're presenting this likelihood ratio in
talking about scales right for you know bands on it but at the end of
the day
what really talking about is
a decision
which has prior mean you even still are when you calibrate everything's done through
a priors that are there you may say you integrated out we go through all
this the realities the day a six point three you can't say in ground truth
my six point three likely racial estimated was really close to the true likelihood ratio
except it's poles you're going to the heart decisions is the prior so i think
it away were sort of
i think that's what
j a set of twenty two
is your really just try to tell people of all the time to use this
is how often it was saying when it was the same for you know the
true this is what how often it wasn't when it was not you know to
the quality and similarity i just wondering in a sense of breaking
are we making it more complicated going to this issue try to describe the court
are we getting a too complicated by overlaying with so much
issues here in training ourselves and
to not to try to get away from any the priors verses just trying to
give
a simple answer a like so it this forensic
thing one guy setup
and just had a visual way of doing it you put down the dots like
here's all the dots when i ran it was the same here's the dots are
and what it was the same here's dot of when i ran the case data
through and you can visually see where sets relative to
it's true that's the two distributions but in some sense is almost just saying here's
here's what i got my read it when it was i knew the truth here's
what i got there and they were the same it hears with this starts it
you choose you know look at deciding to think it's close to the
one of the other without right overlay so much
issues on putting down to the single number
but i mean he's using equation
one of the things that
in my opinion there is not the to line ratio
likelihood ratios inspirational
kind of support and hopefully then somewhat to doing so that there's another should is
the competence is you so the likelihood ratio it's they're mainly because incompetence so that
the final decision has to be done by someone
that person with five i find in fact asks for some information so how the
guy that have the information that the fact finder has not can communicate his opinion
about that piece of information that he tries to integrate with a whole
that's the main issue about behind language all the way
decision could be made by anyone but this proportion of competence so
the form out the formalities their problems because
leaving everything without performance leads them anything that are consider illogical because the decisions are
made in the reports about that's one issue
the issue about simplicity about complexity i fully agree with you
i think that things are
have to be made much simpler and i was talking about with joe before the
band of yesterday that if you've got a chemical analysis
there's one guy that i expressed his opinion about one comparison of two pieces of
glass using
and scanning electron my microscopically with energy for six x rays and so on
and or by well i it is not the same goal they're trying to be
displaying what's going on inside the microscope of whatever with the energy is present right
so for the it's you know how to say that this agreement on the community
that there are standards regulate in the use of the procedures are great that are
there some kind of make sure
error rate so that comes along with it would be would with the standard over
so in my opinion that the weighted ball so
giving a lot of information to judges is something that can be counterproductive be my
way so the balance between transparency and not biasing communication it's important so i think
i think your argument is
it in some way talking about this issue it is very important issue for me
given things simple are the starting point to go if you want to put a
new method following way
can i can i just a we i think there's lots of details that we
can talk about later but i think
we have to present something which we believe is logically correct first and then we
have to second worry about how to communicate that and it's not appropriate to present
something which we believe is logically incorrect all we which
but
which is easy to present
and the exact example you were giving i think that's one where when we if
the jury looks of that they will immediately jumped to it was him
they will jump to an identification
and so that i think that's a problem
okay we have to move might be this one
yes
jason
is that i just want to common to on the point
proposed by do go in the body and so from then you
are you agree and i think we should be honest when you experts are doing
information report
it's not for the judge
it's only for some over simple if you kick spits which will be able but
the difference side for example to exit mine you information and to give some inputs
to willow your
we have in a g and you in front of the court
the only important things is how you are present in your opinion
it's only based on what you are sitting on the there is no thing to
do with the cued racial you could save my ticket rituals then to any you
know but betterment me or like me you know
so we have to be clear
report on the scientific boats of for some people should find enough information in order
to criticism norwalk
and if you know
information is given by some morning and widely by the expert recall
i will you don't
i have been discussed in many forensic scientist where we always agree that transpires is
important everything has been transparently reported and so on
talking about explanations in court about issues
the balance have to be taken into account for example indiana analysis the nn analysis
the deities they start to use probabilities for reporting and it was a huge mess
for ten twenty years
but i have a more exactly right in things and interpretation fallacies where common
and
that experience tell us that
it has to be a balance between boarding
transparencies important one and when someone comes to core to explain that don't reports
probably is better to keep the simplest writers thing rather than going to complicate things
for me for example having a performance graph with a lot of details
can be
okay for us but when you well that for all your
probably the information that he's taking from that graph is not what you're trying to
express
so the problem is that the level of detail
which are transparent
probably so much detail
is giving a person like listening
a different message then the person that is speaking is given
so the balance has to be there and i'm not saying how to do things
but the balance has to be there might been and i fully agree with you
are we have to be transparent
transparency and
and the level of detail our things that has to be considered
i can do you want me
i can't and on that to planning
just one minute
should i okay no
i think is really important what you're saying and the you can never
sort of leaves of the
responsibility of what you're actually expressing the court
ask some somewhat subjective whatever you do you know you go jeff's
weight of the
that's a danger in that i know if you
i read something about like theory of science or something called like physics and b
and that's very much appears when you when you move into a different paradigm which
did you just salsa system is actually completed different paradigm where argumentation is actually the
thing that they're doing not
i it's not
engineering more signs in the way we are used to it so you do a
lot of analyses but when you end up in corked it's a lot of a
argumentation and you express some opinion on
all the analysis you made so you are actually
you is this a big point with the physics and you that you don't leave
the responsibility to just the number logged on all this i have this system and
the this is the score and you do whatever you want with it because
the communication is equally important on the insecurities noticing i think those
there is a mile like in our system with the nine point scale and there
are some likelihood ratios bands it's not really that important but it's also like historical
and everything that they are used to this kind of system and of course the
in a is much
stronger is a label
much more often have a plus four and so we now our case we're almost
never about the class to for example and
you have to express the a kind of strange that you can get to in
that
that it's all a lot of parts of it no matter if you use automatic
system or it is based on this phonetic analysis is gonna be some subjective part
of it are
i mean even that the things that you of the that produce with automatic system
you know you choose chosen the data are you of this some subjective nist to
all of it so
i think is really important to remember the i think some neto is written a
really good the article on this interior signs on physics and the end because of
you show all these numbers and all these graph and nobody
well understanding cord i promise you
the
the defence lawyer will say something like okay so you actually adjusted your system to
the case that jeff
did you do that and then you probably in the end at one he forces
you when you've been in court for twelve hours a in the chair you that
i did that then is gonna say okay able so it's objected
and then and you're done
so you have to really
think about how you expressed thinking course try to stick to your opinion and what
you based it on
but can remember the physics and b i think it's really important they see number
and the score and they would just all
he's really smart this guy you know see snaps
so okay good thank you so much and i think we wanna go around the
plots for all the panelists