so let's talk or more complex
i
right so that we present you do that
so the goals is really
challenges to find people in multiple
multimodal context
so
what you mean multimodal context condition it's is that the participant can use the speech
and image
to recognize people
it is occur at the french collaboration and the corpus is provided by and that
the evaluation is organized by adding the
three you're and the research associate participate to the to the challenge is this presentation
is a presentation of the evaluation is not a presentation of the systems we participate
to the to the competing to the challenge
and that if you want to have more details about the solution becomes a sample
was
please go to interspeech yeah and you might be some
of which
so what about my presentation i will present in the first task after the corpus
the matrix we used
and some results from the driver in contains that's will be noted that we consider
so that we can do is six years and some conclusions
so the main task is to answer the question who is present in the videos
so that's means that is visible or
is speaking in the videos
two conditions are proposed difference is on supervised condition that's means that the participants can
build
a priori models for the very the face or for the speech from a different
persons that's that might be on the videos
another side you have an unsupervised a condition where the participant and are always to
use only the videos the test videos to find the people
this man task is
every time you we have after also task that's
are more precise in the question that's mean to use to answer the question who
is speaking with visible on the video what names are start
oh on the speech
what names are displayed on the screen
to answer the question two conditions to and a mixture model conditions where people can
use all the modalities to answer the question and also where S a for who
is speaking they can only use the speech
for who is visible that can only use
the video that the image
and
we
we assume that for answers this question there are a some technologies that are a
necessary and so we propose that's
where we
yeah we assessed the speaker diarization the speech transcription the ad detection and segmenting the
overlaid words text detection and segmentation
and the optical character recognition for the text on screen
so a lot of the scandals for so as i say that right or do
not was conduct analysis here and the first and second official campaign will be on
a two thousand source thirteen and two thousand fourteen
so what do not show so you have sentence different shows that's are gonna are
not is that in the corpus
that is there are a different utterances of the same show us assume that some
people for example the presentation are present for multiple
yeah
shows and different shows and a clean
we worked with different kind of sure like you're information show or a political debate
you have at the bottom and question to the government stations to you
and the celebrity news shows
the we choose this kind of shows because they are very different and valuable the
some of them are more difficult examples are because of the kind of speech for
example you have for some for example for the celebrity a new show you have
more spontaneous speech and for the parliament question to the government for example is always
a read a speech so it's to mixed the condition of speech
all the this
shows come from two different channels
and then at the end of the project that will be a sixty hours of
videos for
for the database so i can imagine that you don't know was easy the other
so i propose you to show a little samples us to have an idea of
the
of the
the D
i
yeah
sure
oh
i
i
yeah
i
i think yeah
i
i
yeah
yeah
i
oh
oh
i
i
yeah so
for the corpus was annotated form visual annotations
it's i mean image of a point of view so on
although it we annotate and one image every ten seconds
we determine the dickheads with one of the know
performance
the ads are described like to say if is there are there are there is
no occlusion of the jets or for example if you have a parent shorter or
something indication nazis
the person is name
the rate so that the text objects are in a rectangle to transcribe
and the on you on all detected text transcription you have to the person names
are annotated in the in the text
and so has to have something which i
it's more accurate diarization the parents acted experiments of all the other hand
and all the text
all
given to have to
to know where the is the fruit separation of the audience
for
for the speech annotation have a standard transcription of all the details
with the speaker turn segmentation and the music segmentation two
and a rich speech transcription says that includes all the disappearance
and
all the
and all that the world like you're a
i'm french you know some not station but a more like to alright so all
i think so and all this kind of expression that might be useful to recognise
the people
and we name the older person that are speaking and that we may and all
the
the main the speed of be of here so that sure on the speech transcription
are annotated to be from books that's is example here you an example and that's
what i want the user name so it's at the beginning
so that the main difference matrix we use is the estimated global bit-rate is found
on the means and false excitation but we want to boundaries at the fact that
the system i have found that the correct number of people who are present in
the video that's why we include a confusion that's means that if you have to
the number of people but you miss and you do an ml for the name
of the people is a less it's
that it's an important in less important error not to miss some have that's why
we use this kind of and this metric for the main task and for the
question who is speaking who is visible
and what names are displayed
and for what names are cited we use the slow to rate which is a
comparison of the hypothesis and the reference interval for the name
so for the driver and also the dry run corpus is very short
corpus based the goal was to see what's given what we can do with this
metric sense is kind of corpus and that it's clear that it's not enough for
the system to develop something which is
the performance but it's not the goal of the driver
and what we saw here is that the
the speech duration for a speaker is very short
and the majority of the speaker speak less than a twenty seconds but it's the
assignments because it's show and it said that if you can see of the show
and the you have that you have or you have people who speak not that
one more time
two hundred and sixty second so it's the diversity of the corpus and for them
the key for the people distribution according to the number of key frames
they appear you have the same thing some of them have your is not so
much and it is that if you can see but usually when someone appears not
captures a lot G speaker lots and so you combining and visioning the information you
might find who is speaking and who is present in the video
and so if you and i
the moments where the speed of the things display or the faces visible all the
speaker is speaking in all the corpus you can see that for eight percent
the P the person is speaking appears and his name i is displayed on the
videos at the same time
and
yeah but for example you have
a set seventy
percent of the people who just to name displayed on the screen and so for
the main task for example you don't have to say that the these people there's
people are present in the video is because they are not speaking or they are
not
V C and Z is distribution
is very different according to the kind of shows for example for different story
you have
a more
as long as thirty two persons of the few that the people want the name
that are not useful to find the people and for L C P for that's
the contrary you washers that if you find the name of a person that something's
that's
this person is present in the video so the participants have to analyze the little
this kind of things to
to have it might be a full to have this kind of information
so the
here you have to the annotation and the clues you can use to do that
to answer the question
you know i
there is there are there are more that
a two hundred and sixty seven people
there's people in the datasets
the one hundred seventy one people for the test set
and as you can see
there are some and then use guys that's means that's for the annotators a then
why not able to know who is that where it just we got in the
video that's just watching the video so
that's why i say the autonomous and the system have to find that there is
someone but they have not
maybe
for the fast results it's clear that it's a driving test again so the results
on that's so good
what we want to compare is the
here you go the system of things
for the main task
and comparing to the task we speaking and who is visible and as you can
see
they have a better results to say who is speaking example to say who is
visible on the videos and the for the main task the main problem is to
say who he is visible so on
please speaking
for speaking
in particular we analyzed the results for the and comparing the
the results for the supervised mixture model condition and the supervised on the model condition
and as you can see there is not different most significant difference in the results
between the two conditions that's means that the system then
the information that come from
the also for the C was not used by the system to improve their
then
so
and the on the side you know the
the results by shows
so the center of the circle we present the mean for the mean performance
and
the writers represents the standard deviation of the bit of the reference
and as you can see the we got according to the show the systems are
more provides and another so if we compare them
the yet also it's the results are very precise assessment that's this
this show is correctly a tree is a process but yeah regarding the green the
dark green maybe even if there is a lower the evaluation of the performance is
more important so that's might be some things that's the system have to improve to
regarding who is visible
in the videos
doing the same kind of analysis you can see is that there is a significant
difference between the supervised multimodal condition and the supervised model condition so here the speech
and i is useful and the systems have used this complementarity information and here again
you have
the representation of this is the results according to the show and here again you
have difference performance and evaluation of the performance of the show
is important
for who is sort
and we focus here for the results on the kind of mistakes the and have
rows S car done by the system and again as you can say that can
see that iteration is the more important
here are for all the systems will participate
and the
results might be that they have and as out
the system to
to detect the then sent it is named is has to be improved because they
say don't the same is a lots of names
for what i'm are displayed the performance again can be improved and we focus on
the austere and text segmentation results
and the results on a set of models is a lot so but not so
that again they can extract some information from and the segmentation is quite good
so it's again the problem of extracting the name from the text
that is the marginal program for
so in conclusion a dollar question and the goals is to find people in multiple
in the condition in french language the main question he who is present in the
video but you have a subtask and
seven questions that was that can be helpful to risk terms of the domain task
and this challenge now is open to anyone which is to participate so
yeah you can go
and for the dry run it is clear that sufficient information can improve and the
device in we also an important variability of the performance according to the shoes
for the perspective
for the matrix a
we want to include the ensemble and to take account the person in the videos
because for some application in particular for clustering of videos it's a less authority it
so that the importance of the person depend of his role in the video so
it's a an important to work and
we want to weights
the importance of the people according to the way the available modality that's someone if
you lose on okay speaking and
is visible is you will have a man make more errors than if it's just
speaking or just visible on the screen
for that we want to
to improve the characterisation of the difference between scenario
use the due to linger more speech analysis
a more
and this is a different size for the videos
and dropped or more that's what the same speaker
is in different shows like it's not exactly the same thing to speak in departments
and to be in debate with and also people so it's this kind of social
shows and the time we hope for
so thank you and if you have a question
i
all of them all the description that we have i have done here it's for
all the
all the dots are because after all this is that will be on the on
the learning and the training part for the official compare comparing so it so that's
why we have to do that
the analysis on all the data and that's where table it's a
it's a choice because we don't have but that's a now we don't have announced
does not shoot to speech it's in doing so
morning
since the yes
yeah
yeah the continuous but there was presented to the system
as they have they can use all the videos
that's for the annotation for the evaluation if the evaluation is based on a key
frames
it's more the evaluations and the for the for the participants must be a all
the videos
and it's just because we can say it's very expensive to do this kind of
annotation so that's why we dress for the evaluation
and that's why we indicates the beginning and the end of the operation of the
people so as to have also for the systems they have something
generalization risk is that it's not exactly diarisation for the videos
but it's always the problem
expensive part of doing this kind of
in addition
yeah
for the speech for speech for the question who is speaking is they have to
to answer for all the video but for the visible part of that they have
to
to focus on
on the key frames
but it is clear that at the beginning they don't know where are the key
frames it's that just
they don't just so when it's the test and
where wednesday
where are the people in the video
oh
oh
for the it's for the transcript that they have to transcribe the all the videos
the system at the at the beginning thing just have access to the
the as
it's a how to use their own system to transcribe the videos
that's the for the set task we bss for example
a use that you have to some reassurance after this for the main task they
just have to use it was that the beginning and
so used a technologies they want
so that summer yeah
a used to transcribe it does this one
the up on transcription
and also prefer just doing yeah generalization and have a
they don't for a lot on does unsupervised condition and so he was a lot
of face models or
the voice
i
no i think of the name of the detailed the shows so for example a
single a lot of the present data because for the information show is always the
same presentation right now
in all that they all signed the or the shows the old interest shows but
they don't know a always the
in fact that people for example so yeah
yeah
oh