grinning thanks for tuning in for my second presentation in the session mining is only
marginally and together with my colleagues listed here on this line although presenting an overview
of the twenty nineteen nist
audiovisual speaker recognition evaluation
which was organised
in the fall of twenty nineteen
before i start my presentation of electing by you
if you have an already all like to invite you to see my first presentation
the session which was an overview of the twenty nineteen
chris sre cts challenge
in addition of blinding white each signal and participate in the twenty nist cts challenge
which is currently going
so
here is the outline of my presentation
all start by describing the highlights all the twenty nineteen audiovisual sre
then to find the task you may summary on the data sets and performance metric
for this evaluation
share some participation statistics followed by results and system performance analyses
although i would like a quite a summary on the audio-visual sre nineteen and sharing
the main observations
baseline presents the main highlights
well the twenty nineteen sre
which included
video data for audiovisual person recognition
and open training condition
as well and a redesign
and more flexible evaluation web platform
recently introduced highlight also included
audio from value
which means
on you
recordings that were extracted from
big online but
so the primary task for the twenty nineteen audiovisual sre was person detection meaning that
given enrollment video data from the target person
and test video data from an unknown person automatically determine whether the target person is
present in the test menu
this person detection problem
can be posed as a two class hypothesis testing problem
where
the null hypothesis is the test video s
belongs to the target percent and alternative our hypothesis is the test video does not
belong to the target person
system that would for this task is then statistics computed on the test video known
as the log-likelihood ratio defining the slide
in terms of evaluation conditions
the audio-visual sre nineteen offer an open training condition that allow the use of on
limited data for system training to demonstrate possible performance gains
for enrollment
the systems where given video segments
we would variables speech content ranging from ten seconds to six hundred seconds
in addition
the systems were provided with
diarisation more as well as face bounding boxes
for the face frames containing the target individual
lastly the test involve area segments of variable durations in the ten to six hundred
seconds grange
the development and evaluation data for the only visual sre nineteen where extracted from the
channels multimedia and the vast corpora
the channels multimedia dataset was extracted from the entire janice benchmark r e
and it consists of two subsets namely or and four
each of which
"'cause" with this on there and test splits
we all for this evaluation we only use the course outside
because it better reflects
the data conditions in sre nineteen
the vast corpus on the other hand was collected by the ldc and contains a
mature online videos such as video belongs or belonged spoken in english the videos have
extremely divers audio and visual conditions
background environments a code different codecs different illuminations and hoses
in addition
third to be multiple individuals hearing in each video
baseline shows speech
duration histograms for the enrollment and test segments in the audio-visual sre nineteen data and
test sets which are shown on the left and right plots respectively
the enrollment segments speech durations were calculated after applying diarisation why no diarisation where applied
to the test segments
nevertheless
the enrollment and test histogram school adhere to follow
log normal distributions and overall
they are consistent across to them and test sets
this table shows the data statistics
four or subset of the channels multimedia data set as well as the audio-visual sre
nineteen and test sets
which were extracted from the vast corpus
notice that over all the size of the channels data is larger than size of
the sre nineteen
audio visual data and test sets
which makes it a good candidate for system training and development purposes
for performance measurement we use the mimicry known as the detection cost or see that
for sure
which is and weighted average of false reject and false alarm probabilities
with the weights defined in the table and baseline
to improve the inter credibility of the see that it is commonly normalized by default
cost
define it slide
this results in a simplified notation for to see that
which is parameterized by detection cost
and for this evaluation
this detection threshold
is the log all data and beta is also defined in this slide
this slide presents the participation statistics for the sre nineteen audio visual evaluation
overall we received submissions from fourteen team which were performed by twenty six sides
eight which where
from industry and the remaining eighteen where from i continue
also shown this line
is the shape of the work on three
which
shows us where
the participating teams where coming from
this line shows the number of submissions true seen her tract and demonstrate tracks in
total audio visual and audio visual cranks
for the twenty nineteen audiovisual
speaker recognition evaluation
we can see that majority of the teams participated in all three tracks
and one two teams only participated in the audio
and audiovisual tracks and
one team
participated in the audio only tracked
in total received one hundred and two submissions
which were made by
fourteen teams as english
this line shows the block diagram of the baseline speaker recognition system developed for the
audio visual history using the nist
speaker and language recognition evaluation toolkit as well as called me
the and then extractor was trained using call he walks alone
version two recipe
and to develop this is system we didn't use any hyper parameter tuning
or score calibration
this line shows
a block diagram of the baseline face recognition system developed for the audio visual history
and to develop this we used a the face then as well as the nist
ancillary to toolkit
we use the pre-training multicast convolutional your neural network model for face detection and for
inventing extraction
use the rest of the model that was trained on b g gee face to
dataset
in order to tune the hyper parameters we use the janice multimedia data set
and similar to what we had where the baseline speaker recognition system nor score calibration
was used
for the face recognition system
this line shows the performance of the primary submissions
parodying pair tract
as well as the performance of the baseline system in terms of the actual and
minimum costs
on the test
the blue bars and red bars show the minimum and actual cost respectively
the y-axis do you know it's
the c primarily and is a point the limit for the y-axis is limited to
is the two point five to facilitate crossed system comparisons in the lower cost regions
we can make something pornographer observations from this figure first compared to the most recent
sre which was the sre eighteen
at the time
there seems to be in notable improvement in audio-only speaker recognition performance
and these improvements are largely at you attributed to the use of extended and more
complex and two and neural network architectures such as the rest the architectures along with
soft marching loss functions such as the angular softmax
for speaker and baiting extraction
and given the size of these models
they can effectively exploit the vast amounts of training data that is available through
data augmentation
the second observation is that performance trends for the top for teens are generally similar
and we can see that the actual cost
for the all you only submissions or larger than those for the visually submissions
and the audiovisual fusion which means the combination of speaker and face recognition systems results
in salt stantially gains in person recognition performance
so for example we can see greater than eighty five percent relative improvement in terms
of the minimum detection cost for the leading system compared to either of the speaker
over face recognition systems along
thirdly more than half of the submissions outperform the baseline audio visual system
with the leading system achieving larger than ninety percent improvement over the baseline
the fourth observation is that i in terms of calibration performance mixed results we can
see makes results for some teens
for example to talk to teens the calibration errors for speaker recognition systems or larger
than those for the face recognition systems
while for some others the opposite is true
finally in terms of the minimum detection cost it to top performing speaker and face
recognition systems achieve comparable results which is very promising a all this evaluation for the
speaker recognition community
given the results we have seen before in prior studies where face recognition systems were
shown to outperform speaker recognition systems by and large margin
it's also worth emphasizing here not the top performing speaker and face recognition system
we each or from teen five
they're both a single systems that means do you know a system combination or fusion
a ford used to
systems
so no to gain further insight on actual performance differences among the top performing systems
we also computed would stratagem based ninety five percent confidence interval a for these point
estimates of the performance
the progress on the slide show the performance confidence intervals around the actual detection cost
for instance team for the audio switches on the call visual which is shown in
the middle and false visual track such as shown at the bottom
in general
the audio systems extra between our confidence margin then their visual counterparts this could be
partly because most of the parties and swore from the speaker recognition community
using off-the-shelf face recognition systems along with pre-training law models which where not necessarily optimize
for the task i and in the sre nineteen audio visual
evaluation
also unknown instead of this notice that several leading systems almost perform comparably under different
sample aims of the trial space
and another interesting observation is that the audio visuals fusion seems to boost a decision
making confidence all the systems by significant margin two point where two leading systems
performed the other systems
statistically significantly
these observations fair further highlight the importance of statistical significance tests wine reporting
performance results or in the model selection stage during system development particularly when the number
of a trials
a relatively small
this line shows a the performance carriers a bit that stands for detection error tradeoff
that performance curves for a top performing system for the audio visual and audio visual
tracks
the solid black cherry in the figure represent adequate cost contours and that means that
all other points on a given contour correspond to the same on detection cost about
so here we can see not consistent with our previous observations from the overall results
on if you slide back
you audiovisual fusion provide remarkable improvements in performance
across all operating points not just a single operating point on adaptor which is expected
given how complementary the two modalities audio and visual modalities or
in addition for a wide range of operating points this speaker and face recognition systems
provided comparable performance which is very problems promising for the speaker recognition community
and shows how far the technology has come so far
this slide shows a normalized target and non-target score distributions for
a top performing system for all tracks and means audio visual and audio visual track
then they recall dashed line
represents the detection threshold which you relative related to the value of data which we
discussed when we were talking about the performance measurement
here we can see that this score distribution from the audio on the end face
only systems
there were they roughly aligned with a target and non-target distributions showing some overlap and
that racial point
however a their diffusion the audiovisual fusion the target and nontarget classes are
well separated with minimal overlap
a threshold by
and we speculate on it and this is actually
a the reason that
we see such low errors
specifically on low false rejects
for systems that use audiovisual fusion
so in summary we use the new and improved evaluation web platform for automated submission
validation and scoring forty audiovisual is very nineteen to this web platform
we release the software package for system now meditations scoring
we also released the baseline person recognition system description and results
in terms of data may take a for the first time we introduce video data
for audiovisual person recognition
rereleased large labeled data sets which are extracted from the janice multimedia data set as
well as the bass corpus
and these datasets probably matched evaluation set
in terms of results
is also actual things a from the audiovisual fusion
we also so that a top performing speaker and recognition systems perform
a comparably
we saw major improvements that were attributed to the use of more
extended then more complex neural network models such as the rest the model
along with angular margin losses
in addition to this the improvements were attributed to the ecstasy use of data augmentation
and in a clustering of at estimating which was done primarily for diarization paris
effective use of this test set as well as the choice of calibration set where
also very working and they were key to performing well in this evaluation
and finally although fusion still seems to
playable we saw that strong single systems can be as good as fusion system
and with that a like to include conclude this time
i thank you very attention e well and stays