0:00:15next presentation is e
0:00:17don't by two people from the end of i both school that it and both
0:00:21in the same room
0:00:22working on more or less same problem but in with different approaches
0:00:27so we're going to talk about the database which may be relevant to forensic work
0:00:33be a the basic paradigm
0:00:36in forensic speaker recognition or speaker comparison you might say i think as has been
0:00:44put for by a fourteen then another's earlier but i summarizes here in the formula
0:00:49for people that like formant
0:00:52so basically what did you george or did you once or should want we have
0:00:57to tell them that this is what they want is posterior also bouts about
0:01:03claims where the defendant is guilty or not
0:01:05and i actually can be factorized into
0:01:09to factors
0:01:11the likelihood ratio which is the first factor on the right hand side and the
0:01:16prior odds and the idea is that the
0:01:19for all sorts german by
0:01:21well that somehow that have to be determined and it we say it's the ports
0:01:27dropped to do that but will be influenced by lots of other things at the
0:01:31circumstances but they might include other evidence other evidence which is not relevant to the
0:01:38speech
0:01:40so this is just to get your no idea what so what the framework is
0:01:43so that there's a connection to stop we do in nist most people are maybe
0:01:49more familiar with in nist questions and i've summarize set here
0:01:53namely in the forensic case you might say the judge should the jury won't want
0:01:58to decide that have been it is guilty if those posterior also part
0:02:03are higher than some reasonable that
0:02:06whatever the reasonable doubt should be it's related to the
0:02:10to do cost function you might say
0:02:13and in nist
0:02:15it is quite similar except that there we work with the likely richer itself which
0:02:19is just a ratio of the posterior and the prior odds
0:02:23and that should be bigger than some threshold and the threshold only depends on the
0:02:27cost function or you want included priors on right hand side is also possible
0:02:32and
0:02:33if you have well calibrated likelihood ratios than your threshold will be ideal and you'll
0:02:38be at the
0:02:40point we know as minimum dcf so that that's the relation where likelihood ratios in
0:02:47forensic cases unlikely variations in this case is limited
0:02:51are
0:02:52we say
0:02:54but these stories more about the circumstances because everything is dependent on
0:02:59the information and in a forensic case it is the case so you have these
0:03:03weird samples the joe showed us and you don't have to give an expression or
0:03:08an error rate estimate endowments criteria for a general
0:03:12a general case or an average over many comparisons no you have to for this
0:03:17particular case
0:03:19so
0:03:21our approach would be
0:03:23we need data which is similar to the case
0:03:26so what we've been doing model our approach to dealing with a specific case is
0:03:31to make a database with what's of the annotation so that we can more or
0:03:35less select a sub database
0:03:38which will be is a similar as possible to the case as we can
0:03:44a b and it the database will also allow us to test where they're circumstances
0:03:50so mornings for language differences actually matter if we have all that information
0:03:57so
0:03:58this is where we move on to the next speaker
0:04:02and it's gonna be tricky
0:04:06for this work
0:04:08right
0:04:12yes so i'm they've at on the flute also known as david the second
0:04:18or the first depending on your perspective of course
0:04:22i'll be talking
0:04:24to you about
0:04:25the database itself how we created it which metadata what's included
0:04:31which
0:04:32so you get a sense of
0:04:35of all which restriction star are
0:04:38using real data some of the metadata or is just uncontrollable and some of it
0:04:44is
0:04:47there you go
0:04:49it's just of
0:04:50short overview
0:04:54out
0:04:56the thing to note here that it's similar in set up to other database that
0:05:00use real data
0:05:01so the df as the end of i take it to you know of ten
0:05:04years ago and i might actually perhaps they're others that i don't know off because
0:05:09they're all secretive or that i don't know of because i didn't find a
0:05:15but this time it second six hundred speakers so i hope
0:05:19it can be contribution in the field
0:05:23yes but wanna do it is menu validation we want to use automatic speaker recognition
0:05:29in case work obviously we don't do that yet and we need validation research and
0:05:34i feel there are people using asr in case work
0:05:39and i feel
0:05:40perhaps a bit conservative i need realistic data for calibration and
0:05:46otherwise
0:05:47it will be the will be no real improvement because improvement in using asr over
0:05:53or next to human
0:05:55approach would to me would be that you can actually measure re reliability and for
0:06:01that you really need to realistic data
0:06:05so
0:06:07it's not our own data we're not
0:06:10really the formal owner the owner is the prosecution and they gave us permission to
0:06:15collect data are from the police intercept data and this has some restrictions i'm sure
0:06:22the first question after this presentation will be a question regarding availability
0:06:27i am i'm happy to cooperate i think but it's not in our hands entirely
0:06:32and we only got permission of strict conditions so we had to and on i'm
0:06:37anonymized the data so we have listened through all the data and no doubt
0:06:43names and
0:06:45and stuff and
0:06:51so
0:06:51what did we do we received a lot of data
0:06:55it's just all telephone concentrate conversations so that stereo generally true that one speaker is
0:07:02someone channel and the other speaker in the other channel
0:07:07not always the case because there's just so much data and a lot of cables
0:07:10they're
0:07:11we split the stereo files and half and we uploaded them into the database and
0:07:19this is to roll material and this is some hundreds thousand audio files make this
0:07:25way
0:07:26and
0:07:30we had some meta data to go with it
0:07:34just some general things
0:07:37and i made a medical tune because it's really realistic data so there's actually
0:07:42been it's really intercepted
0:07:45the stress the point idea
0:07:47i and of the core two
0:07:51which also means that lot of the speakers in the database don't know data recorded
0:07:57which as you can imagine
0:08:00is it is a major point in the privacy and permissions that we got it
0:08:04at that we got
0:08:07two
0:08:08to apply the database and two
0:08:11distributed or not and
0:08:13"'kay"
0:08:15so we took some
0:08:18it took some processing which to "'cause" about two years we had a because the
0:08:22chance to hire people to know out
0:08:25the det personal information like names and addresses
0:08:29and to actually find isolating speakers which is
0:08:34the most important part of the job because we just gotta
0:08:38or a whole a whole a
0:08:41a big a big power of audio files
0:08:45and they could sort they could use
0:08:49the telephone number to listen into the files and then they had to decide for
0:08:53themselves okay this is john i think and this is johns uncle and okay i
0:08:57get to know the people around revolving around a telephone number
0:09:01and this is how a speaker id was created just i
0:09:06through listening prude telephone number flew content of the audio
0:09:12and they added to make the metadata in
0:09:17so these people native speakers we call them
0:09:21they isolated these speakers and
0:09:24they
0:09:26banana whenever the rose doubts the recording was excluded
0:09:29but still this is not a hunter percent the could be between brother somewhere using
0:09:34the phone of his twin brother and that there is some confusion
0:09:40so
0:09:41we like to call the truth by proxy i mean you can be quite sure
0:09:46but never hundred percent about speaker id
0:09:50and another thing
0:09:53they choose a first the n-best then recordings per speaker then below it is to
0:09:57five because we were concerned that number of speakers would be too low in the
0:10:01end
0:10:01and
0:10:03they were instructed to take this factors
0:10:06that first recordings as possible if you take five recordings all from the same day
0:10:11talking to same person that's a little less interesting then
0:10:14three recordings of the type and two recordings
0:10:17but some whispering or car or anything
0:10:24and those aims of five and ten
0:10:26perhaps it's my management capabilities i don't know but up to
0:10:30it just varies a lot the modus is still five so most speakers have five
0:10:36recordings but there's even one with one hundred and thirty three recordings
0:10:41which is kind of interesting in itself but
0:10:45it just looked weights a lot
0:10:49so this is the unknown imitation other mean just means listening through it and
0:10:55assessing deciding what information could be deduced able to a real person and that's just
0:11:01know about so there's all row of samples there that's just no
0:11:06which isn't labeled
0:11:07so
0:11:08you just have to sometimes you have to guess better somebody didn't say anything or
0:11:13somebody
0:11:15us said anything set something that's just no doubt
0:11:20and in doubt it's just
0:11:22lee you've adapt so when they were and doubts whether this is really
0:11:27personal information just leave it out
0:11:31okay and that these people and of their own meta data and metadata
0:11:35the single most important operate just of course speaker id
0:11:39and then all these other things but these are all perceptive meta data that it
0:11:43is on the basis of listening
0:11:45so
0:11:48though sometimes
0:11:49there are some
0:11:52subjective measures they're like amount of noise they could choose between non
0:11:57a little and quite a lot
0:11:59but of course they were more than one
0:12:01a native speakers that the this job so it's a bit
0:12:06it's a bit depends on the person what i find it quite a lot of
0:12:09noise and we try to regulate it's but it's never a it's is one of
0:12:13the perks of listening and that then subjectivity judge this kind of metadata
0:12:23okay this was the end for them they this was the end of the job
0:12:27and we as post processing we
0:12:31anonymized metadata of course
0:12:33and
0:12:35the second the next step is something i
0:12:38i pretend here that the database all finished with actual have still working on the
0:12:43second one and the which is to make a clean version because for it to
0:12:47be comparable to
0:12:49forensic casework
0:12:51you will want to leave out all
0:12:53background stuff and it's background speakers music and like you would do with real case
0:13:00recordings so i'm labeling older parts but weather's those kind of background noises so we
0:13:06have a
0:13:07dirty version and a clean version of the same database in the end
0:13:12we go
0:13:14so this is to be database but numbers of just the two
0:13:18c and as you can see there's a two-to-one rate show for male and female
0:13:22a lot the native speakers prioritise males in the database because
0:13:27males are
0:13:29way more frequent in case work on the female
0:13:32and i'm sorry to say
0:13:34or perhaps not
0:13:37and just some
0:13:38some statistics
0:13:41this is interesting the dutch landscape language landscape is not strictly monolingual they're still quite
0:13:49sizeable minorities in holland mainly from oregon and turkish this end
0:13:54which means we have some multilingual speakers and
0:14:00and they come in different flavours their speakers that
0:14:04speak a mix of turkish for instance and dutch in the same compensation and their
0:14:09speakers that bill use touch of some computations and turkish in the other conversation and
0:14:16so we have
0:14:17quite some possibility due to cross language research with this which is the first experiments
0:14:22that will be presented is of the type
0:14:27and i would there some english recordings but don't get your hopes up there is
0:14:31only six speakers and their
0:14:34there most of them are not native and
0:14:37that they like being is like the english that i'm talking now but having their
0:14:41detection
0:14:45so number of recordings
0:14:50on the one hundred and thirty three recordings for the for the largest speaker which
0:14:54is a
0:14:55big criminal in holland
0:14:58course i'm not allowed to
0:15:00so you with it
0:15:01and i don't the in terms of trials same source trials and different source trials
0:15:06but i must admit the different source trials or also cross gender and cross anything
0:15:10so
0:15:12the actual usable number of different source trials is probably a little lower
0:15:20this is the duration to get few if you sense of duration
0:15:24the
0:15:26the pink bars are actually the gross length of durations and the blue bars are
0:15:32after a speech activity detection
0:15:35and their sum
0:15:37unexplainable
0:15:38thinks they're in the pink distribution
0:15:41for which i don't have an explanation
0:15:44you can see that minimum duration for telephone conversations to actually make it into the
0:15:51database of thirty seconds because lower than that there's just a lot of call tones
0:15:56without an answer and other
0:15:59rubbish
0:16:00and the maximum that i talked the native speakers that they could use was a
0:16:04computation of ten minutes "'cause" otherwise to would have too much work for just one
0:16:09recording
0:16:10however my management capabilities
0:16:13mate made by
0:16:15where so we that there are still some recordings over six hundred seconds excluded there
0:16:20but still there
0:16:23okay
0:16:25well like i set we repent to use it for validation research and it goes
0:16:30into two different types there's general validation just
0:16:34choosing which algorithm is best for us for our ugh case
0:16:39for our case work
0:16:41calibration method i'm happy to see that nikos gonna talk about different calibration and types
0:16:47which will be
0:16:49applicable soon hopefully
0:16:53and there's also and that's more relevant case specific validation that's
0:16:58the all the variation that mister campbell talked about that's real there's real really there
0:17:05is no
0:17:08case without something special to with perhaps not as extreme as the examples we heard
0:17:12but there's always there's always something which
0:17:16to me means that you need k specific validation so for every case you will
0:17:21have to be fine which is where which data is representative for my known sample
0:17:26which of data is with represent the for my reference sample
0:17:30and
0:17:32this means
0:17:36we need data are basically and
0:17:38this is why we did it
0:17:41i hope that i database will reflect a lot of cases
0:17:46of course this is only intercept data
0:17:49and the real monkey business with it screaming and yelling and running it's probably not
0:17:54in there
0:17:55so
0:17:57it that will restrict which cases you can do
0:18:01so there's two solutions to broaden
0:18:05to rolled on the type of cases you can do one find more data and
0:18:10to wait for you guys to have made an algorithm and find out that some
0:18:15conditions don't matter any more
0:18:17so you can
0:18:19you can to should that's than the evaluation data a little less strict
0:18:28of course i'm talking faster than the slides
0:18:31so the database i'm thinking of the mess trials of always try to find a
0:18:36lot of same source trials and different source trial so those two score distributions will
0:18:40be a
0:18:41will be
0:18:43i still have five minutes left and they've still has a
0:18:45apart so i'll
0:18:47the third one loss to mark
0:18:51please contact me about the film if availability that but it will be heart because
0:18:54it's not our own data and it's very sensitive
0:18:58being six hundred intercepted speakers
0:19:00but you find my email address on the presentation and please contact me
0:19:05okay
0:19:10a this was kind of
0:19:13expected
0:19:14the running a bit late to know how
0:19:18right selection
0:19:22so
0:19:23i've been talking about exactly what's screen that see anything from the u
0:19:29so we did in experiments we splits
0:19:31we take
0:19:33ten percent of the data set of multiple database
0:19:37so this is a pretty preliminary mites able to the ideas can we do some
0:19:41experiments speaker recognition experiments see what influences
0:19:46are important
0:19:50this is so more motivation all
0:19:53go and tell what we did so we looked at a turkish speakers the three
0:19:58either speaking turkish or
0:20:01dutch with a turkish actions or a mix of dutch and
0:20:04turkish
0:20:06a them here's a slight about
0:20:09still problem this skewness of the availability of you number of
0:20:16segments per speaker and have to deal with that
0:20:19and i had a nice are facing joke about the well-known public to instance you
0:20:26which i paraphrase us there are so some speakers are more equal than other speakers
0:20:32how to deal with the different amounts of trials
0:20:36george actually have it open solution
0:20:39and old solution
0:20:42with indicate that there make a debt curve per speaker pair basic
0:20:46we implemented that you see the influence of that c can be to more about
0:20:51the paper
0:20:55quickly go to
0:20:57the
0:21:00affect all
0:21:01say
0:21:02i'm speaker population for commercial speaker recognition system so here we use a commercial speaker
0:21:09recognition system that can do some calibration and what you see here
0:21:14is that if you give the
0:21:17recognition system some additional material so those were forty five speakers outside the test databases
0:21:24used here you can go
0:21:27problem very badly calibrated at the top line sc alarms above one it's just be
0:21:32useless to the lower lying where you see that the c llr which is a
0:21:37measure of calibration and discrimination as actually quite close to the minimum attainable c llr
0:21:46so this shows that the system that we used in you can be about paper
0:21:50more about the system
0:21:53did work and the data curves are more or less same so this is really
0:21:57the reference population only matter it's towards calibration
0:22:02but i was going to tell you a model wait a minute level
0:22:06about this is just an answer to one reviewers well what about distribution well yours
0:22:10distribution
0:22:12as also the paper
0:22:15this is my final slide already
0:22:17so i'm working towards the finishing in time
0:22:22you see and number of figures
0:22:26showing different tests
0:22:28but you can do with the database so out of this ten percent we took
0:22:31only turkish speakers
0:22:33we first looked at what is train and test both turkish and you see several
0:22:38performance measures in some statistics about the number of trials
0:22:44and the next thing you can do is what if both train and test
0:22:49or sampling questions
0:22:51or trace and reference you might say
0:22:54are both dutch speaking but with a turkish accent
0:22:59and the numbers actually
0:23:01they vary a bit so the equal error rate goes up while
0:23:05except me significant i don't know the remote too many speakers in our in our
0:23:08sub subspace
0:23:10but it might be more interesting to look at the fourth line where
0:23:16what is indicated there is that we
0:23:18training with
0:23:20speakers talking turkish and test with speaker spoken dutch but with a turkish accent for
0:23:26the other way around to these two cases
0:23:32and then the most
0:23:36interesting thing there i think is that nothing happens or not so much happens with
0:23:40the equal error rate that's they've more or less and same ballpark fifteen point eight
0:23:44percent
0:23:45some of the first two lines but the calibration
0:23:50suffers but it doesn't really suffer that much it is actually comparable to what happens
0:23:56to the two
0:23:59turkish speakers speaking
0:24:01dutch so
0:24:03where there
0:24:04from this data it looks like calibration is suffering from speakers speaking some of the
0:24:09language
0:24:12but
0:24:14the cross language effects is modes
0:24:16making things worse i i'd like to
0:24:21show you the figures you look at yourselves
0:24:26i think that general conclusion with the figures is quite a lot of self very
0:24:29so things depend a lot of how you set of experiments at least is data
0:24:35allows you to do those kind of experiments
0:24:40but i would be like to conclude more with that the general idea of this
0:24:43of this work of collecting the database is that
0:24:47first of all i hope to have shown that
0:24:50this kind of data is necessary both for answering questions like what is the error
0:24:55rates at the error rates of the methods for this particular case so one of
0:24:59the easiest outward conditions
0:25:02but you could also use the data that this is not shown in this work
0:25:05with you can use the data to actually make is case specific calibration so once
0:25:10you know which
0:25:12factors are influencing and where you which not then you can select
0:25:17make a selection important to those factors
0:25:20and used at four k specific calibration
0:25:24and i given very shortly an experiment showed shown an experiment there with it
0:25:31language data
0:25:33okay of this is this is
0:25:48thank you for that talks the talk you to david and i would like to
0:25:52ask you about the level the precision of your tagging in metadata
0:25:56and the reason i'm asking is because in australia we have very similar multi lingual
0:26:01context with waves of immigrants from certain countries when there's troubles for example we have
0:26:07whenever there's a war in eleven on we have the immigrants ways arabic accent speaking
0:26:13english and arabic accent
0:26:15now firstly that level of that accentedness varies as we all know
0:26:20and secondly after twenty five years we have a second generation what a native speakers
0:26:26who don't speak english with an accent that speak their own idiolect of english often
0:26:31as native speakers do you account for that kind of difference
0:26:37yes we have not only a language field
0:26:40but also and native nist field so there will be annotated learned this is native
0:26:46speaker or quite a good non-native speaker or a bad native speak of the language
0:26:51spoken
0:26:52and this idiolect this ad collect would be socially clicks us as a linguist term
0:26:58would count as native
0:26:59if it's second or third generation
0:27:03it's deaf
0:27:07for the each where was each speaker confined to the common in z
0:27:13where did they did you have speakers at one across handsets
0:27:21we have it'll mum the majority of speakers
0:27:25i know it about telephone number so
0:27:29is from the same number but there's also some speakers that are using one phone
0:27:34in one recording and for instance a landline in another recording
0:27:41i don't have no
0:28:15in the experiments that show you the non-target conditions are always the same start condition
0:28:20so
0:28:22absolutely i make an argument that the doesn't make any sense to different
0:28:26there "'cause" i c i c conditions is this is sort of the conditioning i
0:28:30in the likelihood ratio thing so you shouldn't
0:28:33have to the numerator with different conditions than the moment i don't believe that makes
0:28:39too much
0:28:49as well that would try to people put it make a speaker
0:28:53just pop of david the
0:28:56my feeling is what the calibration show shows is exactly will limit of speaker recognition
0:29:04has known as we are not able with how system to detect if we need
0:29:09new data
0:29:11to do we calibrations
0:29:13we wean upset with calibration it just v limit of that someone's of the system
0:29:23i'm not sure if a entire lots not to question but i
0:29:27i agree that there so we can test whether particular conditions your noise or whatever
0:29:32make a difference or not we should be able there without with this data
0:29:37but if there is a new condition
0:29:40the can always be a new condition and
0:29:43if we don't know whether it's of influence we can say whether we archer having
0:29:47matching data i agree that that's just a doberman work
0:29:51we should walk on automatic detection
0:29:54of the case where we are
0:29:59in a known condition
0:30:01if we are not able to describe a condition factor by fact or and to
0:30:05decide and to give to the user
0:30:07v probability to be compliant with for training set
0:30:12call choose how system in forensic conditions because has joyce eight reason
0:30:18a huge amount and we know that huge amount of conditions in forensic
0:30:22so the way i would approach is not by doing everything completely automatic back actually
0:30:27having the forensic expert listen to the data and that's no problem is it was
0:30:31limited amount of data
0:30:32and the forensic expert can say something sensible like well this is very much like
0:30:36i've heard before or well there is an enormous bows or there is an enormous
0:30:41this where that i haven't seen before so i
0:30:44we should not pdtb
0:30:50they've it i agree but completely we need to human and human expert to for
0:30:54that at least would beginning but we need to feed that the human expert was
0:30:59information of with the system
0:31:01and another time has norm will has we don't know exactly
0:31:05which information how system using we can't explain to human expert how to define what
0:31:13is the
0:31:15no condition pose system or not it's not enough you have some very interesting levels
0:31:20here but has
0:31:24michael saying about the previous questions we always have some question about language what is
0:31:31exactly but definition of the language of idiolect what is the definition of be conditions
0:31:36of the distance
0:31:38is it in the conditions
0:31:40and we should work without automatic system to deter mined the sensitivity of for system
0:31:45to each of these
0:31:46factors
0:31:47knowing that the u one expert could use the human brain to give a probably
0:31:54and i hope we will we will do but
0:32:03just really transgressions that the underlying assumption i think that most people are thinking about
0:32:08is that when it's intercepted that it's a telephone call but i think in a
0:32:12lot of forensic applications you have a confidential informant where the body warned microphone
0:32:18and in those cases the microphone you have lots of issues with a cloth committing
0:32:23the microphone support could you just say in the audio that you have are all
0:32:28then telephone calls
0:32:30and do not
0:32:31do you have any plans to kind of explore the body one type scenarios because
0:32:35that actually as a real challenge as well
0:32:40this is only telephone speech and we are planning a to expand our data
0:32:47collection to
0:32:49yep more ended mismatch but in holland telephone intercepted telephone speech is really the majority
0:32:56of the data so we're covering quite a lot already
0:32:59but this kind of circumstances or a parked car or anything
0:33:05we need data that's true