Přepis řeči - SPEAKER DIARIZATION OF HETEROGENEOUS WEB VIDEO FILES: A PRELIMINARY STUDY

0:00:21	i'm waiting for the for screen
0:00:38	yeah
0:00:41	hi i'm just the more from the university of five
0:00:44	and
0:00:45	a will talk to about a a pretty in is to D we did uh and speaker addition of that
0:00:50	original use would be video five
0:00:54	i will start with an introduction then i would describe the at a speaker diarization system
0:01:00	uh describe that that base we we then use for this two T
0:01:04	show use some results and to uh
0:01:07	conclusion
0:01:08	some plastic
0:01:10	as a you know not
0:01:12	but speaker there is a and is the process to find in audio stream who spoke when with no priori
0:01:17	information on that
0:01:19	identity of the speakers of the number
0:01:22	and it's important to note that is
0:01:24	that the speaker diarization process
0:01:27	in the speaker they're efficient process we don't do speaker identification
0:01:32	as you are so now
0:01:34	a uh a a is to approach is
0:01:37	for speaker diarization systems
0:01:39	but the map and top-down
0:01:41	uh the down approach C
0:01:43	a a is used but a system such as the yes stem and the bottom up that's approach is used
0:01:49	by a system such as the you system
0:01:52	so uh uh in the but in the top-down down system we start with no speakers and we had them
0:01:57	one by one and and to the top with their and is reached
0:02:01	and in the bottom-up approach we start with a lot of speakers and we um
0:02:06	and we met them and to the top two and three
0:02:11	the main idea of this to D was
0:02:13	a test
0:02:14	uh uh the of speaker diarization system and and its behavior on different uh on the new
0:02:21	and a new content
0:02:22	in in the new context
0:02:24	which is the web video
0:02:26	this system has been test on uh from that uh
0:02:29	but cast
0:02:30	was that that a it in the french evaluation campaign
0:02:33	instead step
0:02:34	and then meeting that the at the
0:02:37	in the um
0:02:38	and nist evaluation complain R
0:02:46	uh the
0:02:48	yeah this is the the decision that description of our system
0:02:53	there are three minutes steps
0:02:55	in the um in how a process with that with the speech nonspeech segmentation or so called
0:03:00	speech activity detection
0:03:03	then we have a segmentation step
0:03:05	and there is segments should
0:03:07	every every segment
0:03:08	the re-segmentation step which aim to refine
0:03:12	the um the results we have produced
0:03:15	so in the uh speech sounds speed the detection we initialize an hmm from the given gmms
0:03:22	we apply a viterbi decoding and we are our or segment that five
0:03:26	then uh this files are the base for the next step would you the segmentation step
0:03:31	in the segmentation step we initialize
0:03:33	and any hmm with one speaker
0:03:35	which will be the default speaker
0:03:38	we try to add a speaker we'll and it's not that
0:03:41	and the mean are in the do uh of training and decoding
0:03:46	uh we check if we can add a a new speaker if
0:03:50	we can
0:03:50	we have a our segment it thought
0:03:52	and if we can add the speaker we
0:03:55	we go
0:03:56	at the beginning of the in
0:04:00	then a finally a there is some most stations that we in a uh we initialize
0:04:05	a we generate an hmm
0:04:06	from the previews
0:04:07	segment that file
0:04:09	and so in the loop
0:04:11	oh viterbi decoding and but that adaptation
0:04:14	and we have a our final segment
0:04:16	i
0:04:19	uh
0:04:20	as i said in the introduction them in idea of these two D was to test how a system on
0:04:24	in and you context which is the way we do fight is
0:04:28	the content of the web video five is and control we've do you don't video such as a movie trailers
0:04:34	all broadcast use
0:04:36	and will these tools for example a uh you can have a a video recording in studio or with a
0:04:42	cell phone
0:04:43	we decided to
0:04:45	a to be the database
0:04:47	in in as a a which is a D that
0:04:49	two seven categories
0:04:51	described just after or with mean a
0:04:54	so a D as well
0:04:56	contains a small than eight hundred videos in seven categories
0:04:59	document are every movie trailer cartoon commercial a news
0:05:03	well and using you
0:05:05	and this two D we left
0:05:07	uh
0:05:08	a two categories
0:05:09	spot because we don't have
0:05:11	the the video stream
0:05:13	and using video because it the it's a very difficult and there a very particular that i
0:05:20	we manually annotated
0:05:22	a a part of this corpus
0:05:24	we ended it the audio the audio cup
0:05:27	the audio file
0:05:29	of uh a one hundred
0:05:31	the twenty nine video file
0:05:33	oh
0:05:34	a it's which present around then how as and the hard
0:05:38	these numbers are about the and that it but
0:05:42	oh the corpus
0:05:44	but two main thing that we can see it that we can deduce from this that but is that we
0:05:49	um
0:05:50	we have the category which would be the best the news at the but some of the the that bill
0:05:56	and the one which should be the worst
0:05:58	a movie trailer
0:05:59	and D is category should be the best and the worst
0:06:02	because the um the length of the speaker turns
0:06:06	for the news is very high and for the movie trailer is very low
0:06:10	this is
0:06:11	information is very information you "'cause"
0:06:13	be important because if you remember what i said just before
0:06:17	we will on them with that and if we don't have in of that that were on how one with
0:06:21	that
0:06:21	we shouldn't have a
0:06:23	a a good reason
0:06:28	so the results
0:06:30	then uh them set
0:06:33	in the
0:06:34	for these two D we compare the the system to the you and but the map system the room but
0:06:40	the maps
0:06:41	a "'em" were works
0:06:42	uh
0:06:44	a like how our system
0:06:47	a with the C uh speech speech segmentation the the segmentation
0:06:52	and then uh segmentation based on the bic criterion and the or a segmentation
0:06:59	we test
0:07:00	this system on a on the
0:07:03	different that that's set
0:07:04	the at C O nine
0:07:06	uh
0:07:07	that that that's it's from the nist
0:07:09	evaluation can
0:07:11	it's meeting that a
0:07:13	and
0:07:14	from on uh
0:07:15	as step two thousand eight
0:07:17	and that uh from the french evaluation can a stuff to it's broadcast news that that
0:07:22	and a a on our uh and at at the soup that
0:07:26	of it years are are with manual and automatic speech and
0:07:29	speech segmentation
0:07:31	we we see after why you would be
0:07:35	so this is how a pretty preliminary results
0:07:37	the first
0:07:39	a thing that we can out lines
0:07:40	if
0:07:41	E is that uh we have
0:07:43	quite good results
0:07:45	i if you remember what show you said just before
0:07:48	but uh we are not so far from the state of the art
0:07:51	a result
0:07:54	uh the second thing is that uh
0:07:57	we know that the in system i'll perform hours
0:08:00	is
0:08:02	and you can see that on a step two thousand eight
0:08:05	uh they do to two times better than us
0:08:10	and how our system
0:08:11	but
0:08:12	oh on the uh in on the years are are
0:08:16	got to
0:08:17	uh this
0:08:18	um
0:08:21	this
0:08:21	the a are remark can be applied because it they are not two times better
0:08:27	then how our system
0:08:32	uh
0:08:33	then you can see that
0:08:35	the um the hard part of the um
0:08:39	of the there is an error rate
0:08:41	he's you to speech nonspeech segmentation error
0:08:44	so we try to move there Z to measure the influence of the segmentation the first
0:08:50	speech speech nonspeech
0:08:52	detection step
0:08:53	this is the reason why we applied our system
0:08:56	well system on the automatic speech and speech segmentation
0:09:00	and manual segmentation
0:09:03	so that results uh there is nearly no or
0:09:06	for the
0:09:07	with the with the perfect
0:09:09	um
0:09:11	with the perfect speech
0:09:12	speech nonspeech segmentation
0:09:16	are so try to move there are to measure the influence of this system
0:09:20	and the that that's well
0:09:22	yeah as expected you can see that's the best category is the news category
0:09:28	and they're worst category for how a system is
0:09:31	the movie trailer category as
0:09:33	expect
0:09:37	uh
0:09:38	you can see that um that you that you insist them i'll the phones i well system in nearly all
0:09:45	the categories
0:09:46	but the range of the um
0:09:49	oh the scroll on a are quite close
0:09:52	uh for example phone use the minimum an error rate is around zero percent for each system
0:09:58	and the maximum there is an error rate for cartoon new there on the
0:10:02	seventy two per
0:10:04	for most
0:10:08	but i think that we can uh did use from this stuff but that
0:10:12	we
0:10:14	this
0:10:14	it's also something that's we knew
0:10:16	that's that that system phone found the more speaker band how a system
0:10:22	but you can see a
0:10:24	uh uh when you look at the scroll that's
0:10:26	the um
0:10:28	the speaker phone by the U system
0:10:31	i not small right reliable than how
0:10:34	speaker phone even if
0:10:35	the number of speaker from
0:10:37	i of them
0:10:43	um um
0:10:44	in conclusion this to the outlines the difficulties and coded by both system
0:10:50	but by both that system
0:10:53	and uh and that there was a new was what done
0:10:56	it also lines
0:10:58	that's it's a very difficult database
0:11:00	with a lot of but between categories are high interactivity if you're a but the number and the duration of
0:11:07	of for a speaker turn of the speaker turns
0:11:10	and there is a lot of a one i these
0:11:12	should explain what we have but results
0:11:15	and the
0:11:18	our our big T
0:11:19	is
0:11:20	a uh first to data only with their go is where we are the best
0:11:25	and uh in the second time
0:11:28	the main um
0:11:29	a research i sis will be
0:11:31	to use height of that formation from the video stream to have the decision
0:11:36	on the on the speaker
0:11:38	thank you for attention
0:11:40	and if you have been
0:11:41	in
0:11:42	i
0:11:48	we
0:11:50	i
0:11:52	oh
0:11:53	hmmm
0:11:57	so two questions on the first and uh
0:12:00	did you score overlapped speech
0:12:02	no
0:12:02	no because how were system can on the phone now on uh overlaps
0:12:06	each okay and like that
0:12:08	she the notion and data sets marked manually and
0:12:12	number of speakers an average speaker turn
0:12:14	you know the distribution L in any another important factor in the diarization is they even if i
0:12:19	five speakers if it's dominated by two
0:12:23	and you can actually do
0:12:24	right if
0:12:25	speakers stick at ninety percent of think
0:12:27	talk i that we had an action on the different categories of how might of been distributed
0:12:31	we don't really measure the
0:12:33	but uh
0:12:36	i'm call there a partition is quite a key but and for all the speakers
0:12:40	is
0:12:41	a a for some categories
0:12:44	but is no no um
0:12:48	the mean on speaker
0:12:50	yeah it's
0:12:51	uh
0:12:52	i know it depends on the categories
0:12:54	like a news and document are is there is the mean and speakers
0:12:58	but but for movie trailers got to an and from a shot in that the same
0:13:06	i
0:13:07	i
0:13:09	oh
0:13:10	do do anything special with music because i can image and there is a a lot of music a for
0:13:14	example in a movie trailers
0:13:16	or it can be like only music or music in the background
0:13:20	yeah we don't use music uh information
0:13:23	for now
0:13:25	might be uh
0:13:26	something interesting to do
0:13:28	that's uh
0:13:29	i and just to do where your question
0:13:31	a a a a a we don't the the the music information
0:13:34	with the music first
0:13:35	mission fun
0:13:36	which means that you don't you do not score
0:13:39	i Q you are is are the parts are the music
0:13:43	it depends on
0:13:44	how it's we by the by the speech nonspeech uh step if the music is recognized
0:13:51	as a speech um
0:13:54	as the non-speech level
0:13:55	it one be scroll but if it
0:13:57	uh marked as speech uh
0:14:00	level level it would be score
0:14:05	i
0:14:08	i
0:14:11	but
0:14:17	but
0:14:18	oh
0:14:20	but
0:14:22	yeah
0:14:23	i
0:14:25	i
0:14:28	i
0:14:30	uh it's
0:14:32	here again depends on the categories
0:14:34	movie trailers cartoons
0:14:36	a a very noisy
0:14:39	that's uh a
0:14:40	mm use
0:14:41	quite X
0:14:46	i
0:14:48	i
0:14:55	um

SPEAKER DIARIZATION OF HETEROGENEOUS WEB VIDEO FILES: A PRELIMINARY STUDY

Speaker Diarization

Přednášející: Pierre Clement, Autoři: Pierre Clement, Université d'Avignon, France; Thierry Bazillon, Aix Marseille Université, France; Corinne Fredouille, Université d'Avignon, France