she and effect a lot of my presentation is what i would missing with i-vectors
a perceptron analysis of i-vector based falsely accepted trials and decide in collaboration with people
from their phonetic lot of the c as i c would the research
solution for many years at establishing the spain
so plus not talking about
i-vectors
yes we will not i i-vectors but tones
and
those i-vectors give us a compact an elegant solution for every utterance can be represented
in a fixed the dimension vector
they also a given us a great an efficient performance of that a wide range
of the original and a last two
perform a state to apply state-of-the-art and but the recognition techniques
and more the recently we are able to perform speaker recognition without point it is
a really great
we have we can avoid a lot of problems
and especially and i think that in the point
we don't produce calibrated likelihood ratios to forensic speaker recognition when we have lots of
i think that in accumulating a for this we have seen a nice paper we
wanted from that if i own do that's not that's
in this paper some but what if you feel you just a and have
wally score be in this paper has gone given a step farther when they have
not only over a being able to calculate an icon regularly richer when they have
i have recordings from the of these channel intercept assistant but they also have obtain
the day i select aggregation for to collapse the all that was to do so
they have an assessed not just a little bit about all the pros to sell
this is a
a great
we as a starting point but we have to look a little more in detail
about
i-vectors
and they explicitly courses lead to ignore a high-level and source little information
so the speaker and information
and is reduced to reach the short term but this is has a lot of
advantages for features to for conditional for real points
some users and imitate also so
but be still a spectral only detection decisions
probably will be uncorrelated with human perception this morning joe i'd like to this issue
of a possible loss of credibility of the system if the it's a very user
if i boardman ldc perceive rate disagreements between what the system is doing and what
what they can see that they can see that humans are pristine
moreover a we have almost that of that ignorance on the or you know those
detection errors
and when we have also you know system we are simply trying to restore system
probabilistic but we don't do not fit the specific with them
and we can have a transparent estimates is that which is very good but finally
we if we have a roast we cannot display at all what's the recent of
the
of the art
and it's very important to have to be able to provide explanations of all the
wires system is working set a specific way
and just a final reminder we as you decide systems usually on average error rate
but from the user's perspective
and they perceive performance like a baby case by case so it can be done
larger or even a single trial the system will be affected as a
as a whole
so what we in that for the paper wants to select a set of i
bet or based for the s if we try to problem
sorry ten and it's a eight sre ten
and we're gonna some a team of us find useful additions force the not english
a great
and
the objective was to explore to better understand what they do with their with that
a date down and all
that it just a and sre that's
as we have a and we might with that of data what target type of
types of different that they think they could find and also the number of different
types of different that they can have taken finding a single signal a trial
and the first of all a display where this is not a paper on the
speaker recognition by humans
both one of these you know in advance that day speakers in every time a
different
so
all what we are asking that is to highlight difference that they've find in the
and between the two utterances but without any a decision then used fourteen yes to
see what they can find a in a
and
in trials where the i-vector has provided a
line ratio greater than one
as they have a difficult time for analysis and we're not to select a subset
of trials
so we selected we will use the scores from our submission to nist two thousand
and ten
and what we did was a outlier proper selection
first of all we to be a sixteen and a false acceptance that we actually
had
and with the it to eight
with the eight is a set
and but also as those trials were specifically selected to be a special difficult for
humans just in case that was at peace stuff on it for that for the
analysis we also selected fifty different forces us a second trials from the sre can
and in that case of we had thousands of different
trials with the condition was selected yes those with no likelihood ratios in the range
from three to five with the translates into the results for all between the two
one hundred and fifty also so to those were a big are for example systems
that we usually
and how when we use our i-vector systems with
and eight now with the real by a lot of availability
and after those we yes end and all sixty six trials and they are there
are short rehearing not the about the mean this but trial they select it does
for a little work and eighteen trials nine male and female for them probably it's
a it's a and fourteen from a test everything
this is the final this which is in the paper just i want the soda
because we will and referred to every trial using the them
the number of the target id
ability of which one of the speakers
second disclaimer i'm not of an addition that i even have problems with english roll
okay i would be talking about but of things that my colleagues is therefore that
takes a lot declared it so yes
my apology that buttons if i have i say something not right
and this is the rate of features that they will explore they will we be
noted by really deformation type temporal characteristics what extent means that what the characteristics degree
of the solid deep or something like than all the type of non-linguistic features or
what robert was impressions of
so that they will just
what they will extend
we don't like the selected trials is to perform that detail during the at both
about one hour per one of the trials and we focus on the full feature
which are presented all along the conversation
i would still some samples
but that is a
the feature that the difference is that we are that they're finding out present along
the whole conversation
and those comparison will be maybe linguistically k compare compatible segment example select you think
that set consisting of motown and finally some of the observation would be confidence through
acoustically or estimate a and then
by seasonal i used in mentioning that might expect so you don't seem a spectrogram
so the last part of my presentation will be simply so and some of the
use a file
in every case i went so on a number of the trial with the where
the audio can from and also the likelihood ratio in that do not value the
degree of support that the ipod or used a given
the same speaker hypothesis so we know in advance they are different
this the i-vectors is that we say
and then the same of these c same speaker and we will see it for
every trial
and the that the that fault
degree of support of that are that can easily and english
all possible this is a case without a very high misleading value on the three
just and the operator what we use an obtain even for targets
and in that case for example what they found is that this for speech a
lot of the whole conversation is
and not different
no but we do you wanna go well
the it's for the blue line
for the right one
i really but i four
a sound like different by the that are over a regular or you are well
i really i four
and a set of features that they then used
you just about the long as variability
in the collective synthesis people usually tends to decrease the energy at the end up
there is at least that's happened with the for speaker in that case
our that the second speaker in that try out is
keeping the same stress can do you and we'll especially for to keep that log
in this
and this is consequently repeated during the whole conversation
in this case and which has which had a celebration of at a smaller value
obviously value and there's a
only dysphonic voice you once only one of the sides of the conversation is that
they have no idea what are okay
they have no idea what like are okay
is that is for the one
well there are no but neural network grammar
well there are no but you'll never bigger
for example you that are compared to the one light both phase right
but
and this is the spectral analysis of the of that powering latt uses a
without hi everyone would ratio on you know we have
much lower
another type of and situation that would be found is the president of creaky voice
for sample this is not very usual find in a speaker to the second one
here we just peaks do all the conversation with really voice
i
i normal rate and this
second one no you know
no you know
this is not very frequent but this thing present in this case and it's very
quickly but what is quite usual is that the resulting solution of creaky voice at
the end of the of the phrase would like your
we will pop up a sample here in that case work like ratio measly like
results about fifty
two
one segment well
well
well
well
we also found issues about sorry more boys system where you the voice difficult is
to haul the bit it's a similar segment with and that type of speech you
can see the
tennessee of the mean value is quite similar however the second one we have
you use the oscillation problems to maintain that
i together i
i get a very i
second one
we will be known
no
also a feature what's file was about the speech rate
you for somebody in that case there are two different speaker which sold at different
levels of a of activity
what about how would be better marketing
moreover
it was bigger really
we were able to leave
this also issues all known hyperarticulation for example
the phase
really different see if you're selling you know
one the other one i for like you know
well
almost basis some
also this can be found in other cases with the
without using any of a key and where the formant a three of on here
it's much more the about more standard for speaker
your
second
huh
the form of a second formant is much lower than the
signal for one for speaker
also that there may be found differences well the specific but there's of realisation some
first personable one pretty because the finding difference and a type of s that the
speaker reviews
for example in that case and the as in that speaker starts of the five
hundred you're while the as in the second speaker this
start above
three thousand system or a standard student s
i
also cases where the problems or differences in the a degree of summarisation
sample here i
this is like that
you don't want together
and that of kind of nice of voice when and in this case the other
one is i per thousand since we have a goal or something
also that uses about impaired melodic voices
so regular
no we in
what is the one i know you know
in some cases the file extralinguistic ensures that for example the noisy reading everything to
use that speaker
you can hear
that you are construction some parties
for
for example
well as well
so what while the second one that's it's already and noisy breathing at all
they're also presents all squats or
strong not control of the o
g
e
or not the case of some of the presence of rectly voice
e and o
i go off all gonna
so i'm finally this is they comparisons of the of a
this work where and the idea is that if you look and you all some
top weight you can find the amount of times that and one given feature is
file
and but its moral about the look trial by trials or columns of the table
and us see that
for every trial there are
there's an average of about four different types of different that a file
especially health interest to last if we want to make a diplomatic pursues to detect
something some any kind of features are possible feature related to phonation type well phone
creaky also and those the
like to a specific but there are some presentation of the specific sound
so do you might well
yes a we have shown that percent all analyses initial null correlation with the that
backdoor false acceptances
and
there is detectable a useful information goals trials that just produce away from poland uses
what one bs recognition rate is
furthermore there's like
a relational
and specifically the but the realisation that bit of a specific cells
but also at would that those could provide an
we try to reach no signals transcription of the whole utterances and they could be
used to provide some kind of soft information or
and
this what specific highlight the inter some provide an objective measurements about this for you
not the spectral features especially for speaker
thank you
just listening to
second creaky wanna sell like was actually clipping happening in the first
creaky voice
solves one it
perhaps the reason the system to see the same because audio clip like three
was there any analysis on when you when people listening to these false like taking
part of the audio acquisition and one that was quality as well
there was no it's okay
especially analysis of brain processing of the of the data we just select the data
as it was and what is given to them and it what have phone from
the phone at finding just what the what they what they did so
how can you tell them so
what's the variance from the sets consist of experts on
to ten
that's good
there was a very high actually the second one was a student of the rate
was just you from one to and
maybe they provide for they come from the same school of listening
and then the degree of agreement ones
impressing we will be working completely separate
i we have to say that there were no this is chosen there were no
scoring sorry what does i found difference on i five difference on but the degree
of
but i can say that it was almost exactly the same maybe there was one
of the differences and that one of the informant the of the on
i was wondering
since you only used
non-target trials
yes you have conducted the same experiment with the same from the tuition non-target trials
how many of those differences they would also something especially the prosodic differences
of course there will find a lot of then what's
we are trying to do is to look for clues we rolled analysis nowhere to
look for
for a different of information and of course those prosodic and just prosodic information that
prosodic information is very easily and modify a and b and you can depend a
lot on the on the type of conversation
that's why a i stress the idea of the issues of
voice production and specific buttons of religious the which can be much more dependent upon
the speaker but
of course this part of the word that could be don't and of course they
would because when
i suppose like then participate in that kind of a humans just this evaluation they
also did not
yes as the result of this analysis the use it just the but kind of
features that we used
system the future
which so you mention the prosody given duration what do you suggest
that we look at for improving system
i'm not suggesting anything special i just giving the information what they found but what
i'm saying is that the for example the one noise
those voice quality features around
a specific but doesn't really say some of some a has a good degree of
parameters that can be
the properly detected
let's see if they can improve the overall system