0:00:13 | however affinity for changing my spiritualist extraction presentation |
---|
0:00:21 | my name's in the mountains and i'm going to present you know how to stay |
---|
0:00:28 | on a linguistically it is triggered iterations distant loses information from stricter rules |
---|
0:00:35 | personal unseen words our task is and what sticklers issues |
---|
0:00:40 | sewing |
---|
0:00:40 | small still a generic |
---|
0:00:43 | setting standardisation i don't want to answer the question |
---|
0:00:47 | so where |
---|
0:00:50 | a given as input a real speech signal |
---|
0:00:53 | what's wanted used to partition the signal into since derivatives |
---|
0:00:59 | without having any prior information about the speakers a single precision errors |
---|
0:01:07 | and conceptual and traditionally |
---|
0:01:10 | this |
---|
0:01:11 | task involves two steps |
---|
0:01:14 | first |
---|
0:01:15 | we want to circumvent the signal |
---|
0:01:19 | into speaker images segment and this can be found either a uniform way or according |
---|
0:01:25 | to some speaker change detection |
---|
0:01:28 | and then how those speaker sessions and we want to cluster those interesting speaker groups |
---|
0:01:36 | but |
---|
0:01:37 | a there are a specific problems are connected to |
---|
0:01:42 | instead of clustering |
---|
0:01:44 | and in particular |
---|
0:01:50 | speakers within the conversation |
---|
0:01:54 | recite wrinkle means taking stays in terms of the acoustic characteristics |
---|
0:01:59 | then there is there is all merging |
---|
0:02:03 | the corresponding clusters together |
---|
0:02:07 | also |
---|
0:02:08 | it was too much noise or silence |
---|
0:02:10 | we think the speech signal |
---|
0:02:13 | which probably has not been a catchy by giving attention |
---|
0:02:20 | then we may construct a close to shown cultures |
---|
0:02:24 | well those nuisances |
---|
0:02:28 | and as a result |
---|
0:02:30 | is in fact |
---|
0:02:32 | v performance of the system |
---|
0:02:35 | using |
---|
0:02:36 | we knew in advance the number of speakers |
---|
0:02:39 | in the conversation |
---|
0:02:44 | in this work we are for closed or scenarios word of speakers |
---|
0:02:50 | a specific roles |
---|
0:02:52 | for example with me think of that occupation direction a meeting collection where we have |
---|
0:02:59 | the teacher questions |
---|
0:03:02 | anyway interview will be out that each of your and interviewee and so on |
---|
0:03:08 | and the interesting fee of those scenarios |
---|
0:03:12 | is that different roles are usually associated |
---|
0:03:16 | well with distinction |
---|
0:03:18 | when we see colours |
---|
0:03:21 | for example in and you we expect that the interviewer with a small portion and |
---|
0:03:25 | you're you mm we'll answer those questions |
---|
0:03:29 | over in another conversation we except for us the emissions will describe there's in terms |
---|
0:03:37 | and the doctor will |
---|
0:03:39 | you medical i still don't |
---|
0:03:42 | so the question now and is kind we use language and commonly used |
---|
0:03:47 | those linguistic buttons |
---|
0:03:49 | to cities |
---|
0:03:50 | there is a sh |
---|
0:03:54 | so |
---|
0:03:56 | if we remember of the problem for |
---|
0:04:00 | diarisation in a traditional or a bunch |
---|
0:04:04 | what we |
---|
0:04:05 | we do is given the audio signal |
---|
0:04:08 | first hmms and is given really done with involved in addition and the cluster |
---|
0:04:16 | instead |
---|
0:04:18 | if you're propose to also |
---|
0:04:22 | process the fisher information which can really |
---|
0:04:26 | you are from an asr |
---|
0:04:32 | and issues |
---|
0:04:33 | some extent no knowledge about so there are within the conversation |
---|
0:04:40 | and give it is knowledge to estimate their profiles |
---|
0:04:45 | and files you mean the acoustic |
---|
0:04:47 | changes |
---|
0:04:49 | all the speakers in the conversation |
---|
0:04:51 | and now |
---|
0:04:51 | since we have those two profiles we can conclude a clustering problem |
---|
0:04:57 | into which conditional |
---|
0:04:59 | and thus |
---|
0:05:00 | we're gonna for the potential problems races which are conducted in clustering |
---|
0:05:05 | we mention triggers |
---|
0:05:08 | and now the next a few slides and one to go into detail |
---|
0:05:14 | well on what i |
---|
0:05:16 | someone change your use |
---|
0:05:18 | and how we have implemented |
---|
0:05:22 | so noticed your the in the first |
---|
0:05:25 | a couple steps of our system |
---|
0:05:28 | we all process the texture flemish |
---|
0:05:31 | so given text the first step is that we want to change the chronology text |
---|
0:05:38 | so |
---|
0:05:39 | in which a segment after this segmentation set |
---|
0:05:44 | we ones |
---|
0:05:45 | every |
---|
0:05:46 | to be uttered by a single speaker |
---|
0:05:50 | so i really want assistant |
---|
0:05:52 | that |
---|
0:05:54 | no as a kind of their |
---|
0:05:58 | where and there is in you are |
---|
0:06:00 | speaker a speaker change in the conversation |
---|
0:06:04 | instead |
---|
0:06:06 | permissible we assume |
---|
0:06:07 | that there is a single speaker or sentence |
---|
0:06:11 | so we will segment i s the sentence level |
---|
0:06:15 | and energy just so we view of this problem was sequence labeling or sequence tagging |
---|
0:06:21 | problem |
---|
0:06:23 | and |
---|
0:06:24 | we construct this is a similar situation here were initially we construct |
---|
0:06:32 | a |
---|
0:06:33 | current level representation which were stressed and then something |
---|
0:06:39 | we concatenate |
---|
0:06:42 | this is representation with the |
---|
0:06:46 | ward embedding all the course from war and now this |
---|
0:06:50 | a sequence of words sheer ease thing to a biased and steering wheel |
---|
0:06:58 | which predicts a sequence of labels |
---|
0:07:02 | and a little here are two |
---|
0:07:04 | but no that the war is at the beginning of a sentence and denotes the |
---|
0:07:10 | war |
---|
0:07:10 | is that the middle sentence |
---|
0:07:13 | which essentially means |
---|
0:07:14 | every which is not |
---|
0:07:17 | so our sentence here each one of those machines |
---|
0:07:21 | or whatever |
---|
0:07:22 | words |
---|
0:07:23 | strong |
---|
0:07:24 | one when b |
---|
0:07:27 | until the next one |
---|
0:07:31 | now handles a segment we want to a sign role |
---|
0:07:35 | to ensure those |
---|
0:07:37 | so |
---|
0:07:39 | and the domain working on a we assume that we more at |
---|
0:07:45 | the roles in this domain |
---|
0:07:49 | so we you |
---|
0:07:50 | and roles just |
---|
0:07:52 | language models for three and also we have and also with a wrong language model |
---|
0:07:58 | and for this to a construction and prior models |
---|
0:08:04 | and after we interpolate the language models and by these symbols you're a regional ventilation |
---|
0:08:13 | and all the ways of your on some of the questions |
---|
0:08:17 | are optimized on a development set |
---|
0:08:21 | so what we interpolate the language models |
---|
0:08:24 | we can just a sign |
---|
0:08:26 | to each take segment the role that minimizes the corresponding complex |
---|
0:08:35 | no not is that if you're we have built on about to a text |
---|
0:08:40 | in the next step to discontinue was the case densities of the speakers for the |
---|
0:08:45 | year in the conversation |
---|
0:08:47 | we also need was you only so |
---|
0:08:50 | so you're we need to align the text and the audio |
---|
0:08:56 | and the |
---|
0:08:57 | textual information comes from an asr system which to be in a real-world application |
---|
0:09:03 | then these all right information is already available probability can last |
---|
0:09:10 | so have no those module and a segments |
---|
0:09:13 | we extract speaker rating which one visual with the extractor |
---|
0:09:21 | for each |
---|
0:09:22 | segment |
---|
0:09:23 | a sign to a statistical |
---|
0:09:26 | and we can now define as the wrong for the |
---|
0:09:31 | are all acoustic identity |
---|
0:09:33 | as a range of all those |
---|
0:09:36 | speaker ratings transform that role |
---|
0:09:41 | a by doing so however |
---|
0:09:45 | we assume that |
---|
0:09:47 | only on v |
---|
0:09:51 | segments |
---|
0:09:51 | r g |
---|
0:09:54 | however |
---|
0:09:56 | we cannot be confidently about all the roles segments and the reason e |
---|
0:10:03 | since we have conversational interactions |
---|
0:10:07 | after oversegmentations that we may have |
---|
0:10:10 | some very short sessions for example |
---|
0:10:14 | like even one or things like |
---|
0:10:17 | well which do not contain sufficient information |
---|
0:10:20 | well that that's all right recognition |
---|
0:10:25 | so what we're doing instead is that we |
---|
0:10:28 | assign a confidence measure |
---|
0:10:30 | creation of those segments |
---|
0:10:32 | and its confidence measure is the also difference |
---|
0:10:35 | between the best implicitly we have |
---|
0:10:40 | from a and the second was classes |
---|
0:10:45 | and now we can then define a few |
---|
0:10:52 | profile |
---|
0:10:52 | a an average but now for this average we only a control and |
---|
0:10:58 | e |
---|
0:11:00 | segments |
---|
0:11:03 | for which the confidence |
---|
0:11:05 | is able |
---|
0:11:06 | some stuff racial factor |
---|
0:11:09 | and this is the size the tunable parameter all sources |
---|
0:11:16 | so we can we have now estimated or profiles were ready to |
---|
0:11:22 | or |
---|
0:11:23 | a regularization |
---|
0:11:25 | we're instead of clustering we can have a classification much |
---|
0:11:30 | election |
---|
0:11:30 | you're |
---|
0:11:32 | and we're calling a traditional approach for a diarisation were first we segment |
---|
0:11:38 | uniform the speech signal with a sliding window |
---|
0:11:42 | we extract |
---|
0:11:44 | us to go embedding for each resulting segment |
---|
0:11:49 | and we probably |
---|
0:11:51 | the only a similarity |
---|
0:11:54 | known for each segment |
---|
0:11:57 | with all the role profiles are just a estimate |
---|
0:12:03 | and the role that we are assigned to each day |
---|
0:12:07 | using one |
---|
0:12:09 | that is most similar to segment |
---|
0:12:11 | we know that maximizes |
---|
0:12:13 | this is a single are in school |
---|
0:12:21 | so this is this is in the were proposing and we're going to use in |
---|
0:12:26 | to evaluate the system on dialects i felt interactions what we have two rolls namely |
---|
0:12:32 | the normal that there is an efficient |
---|
0:12:38 | and we are also going to use a mix of corporal |
---|
0:12:41 | in order to train a our students tiger and or language models |
---|
0:12:47 | is your in those the data is and reading the sizes of the core well |
---|
0:12:53 | we're using well |
---|
0:12:57 | and not going to go into detail |
---|
0:13:00 | i'm to the specific parameters that we used for system and the several subsystems |
---|
0:13:07 | i just mentioned that if a score or sentences like or more so |
---|
0:13:14 | point age a after all |
---|
0:13:18 | a working at all possible there she said |
---|
0:13:21 | but a word error rates for asr system we're using |
---|
0:13:26 | was about forty percent for dataset but we just is a lot a but actually |
---|
0:13:32 | is |
---|
0:13:33 | can call com one source some changes medical conversations |
---|
0:13:40 | and |
---|
0:13:41 | also baselines we will use in your own and it language baseline |
---|
0:13:46 | forty one you know baseline a workout this is then |
---|
0:13:50 | that we have |
---|
0:13:52 | already mentioned the traditional system i'll mention where we have a uniform segmentation and then |
---|
0:13:58 | to lda clustering |
---|
0:14:01 | and forty language from baseline |
---|
0:14:04 | we essentially how the first steps |
---|
0:14:07 | all our a text based system you |
---|
0:14:09 | well for one takes with a text we segments with our |
---|
0:14:14 | a sentence tiger |
---|
0:14:16 | and we assign a each |
---|
0:14:20 | segments to enrol |
---|
0:14:22 | and the only think of that we need to do in order to evaluate the |
---|
0:14:25 | diarisation is to |
---|
0:14:28 | a line you're |
---|
0:14:31 | and the text here and |
---|
0:14:34 | they have already mentioned |
---|
0:14:36 | in the text can strong it is are then be alignment information |
---|
0:14:40 | already available |
---|
0:14:44 | chair our results on the survey data the we have testing |
---|
0:14:51 | well we have used i don't the reference prostrate or asr transcript |
---|
0:14:58 | we using a or something you're or an oracle text segmentation |
---|
0:15:02 | here are or |
---|
0:15:05 | unimodal |
---|
0:15:06 | baseline same as yours the system that the we have |
---|
0:15:10 | controls and by looking at the numbers we can make |
---|
0:15:15 | interesting observations and |
---|
0:15:18 | generate some interest conclusions |
---|
0:15:21 | some personal |
---|
0:15:22 | if we can further of the to a baseline we have |
---|
0:15:27 | we see that the results or |
---|
0:15:31 | better we feel guilty |
---|
0:15:33 | that's just a |
---|
0:15:34 | i which instantly on your screen as expected contains one information for the task also |
---|
0:15:40 | speaker and session |
---|
0:15:42 | and this is why |
---|
0:15:44 | we propose using the ontology information only as the supplementary q |
---|
0:15:51 | a what is interesting to notice is that |
---|
0:15:56 | you know language model system comparing work and the some additional the timer |
---|
0:16:01 | segmentation |
---|
0:16:02 | i based machine |
---|
0:16:04 | there is the |
---|
0:16:05 | performance gap |
---|
0:16:09 | and the reason for that is that |
---|
0:16:11 | the tiger overstatement and also mention |
---|
0:16:15 | we may have also show segments there's |
---|
0:16:18 | do not contain sufficient information for english |
---|
0:16:23 | however in our system we use this information only |
---|
0:16:27 | in and i would be useful in order to a reddish |
---|
0:16:30 | all the |
---|
0:16:34 | segments of the rules segments to get a acoustic identity |
---|
0:16:38 | the article rule |
---|
0:16:41 | so |
---|
0:16:42 | so such an actress is kind of cancel out you know system after this |
---|
0:16:48 | well i'm british |
---|
0:16:50 | a similar factor |
---|
0:16:52 | is observed |
---|
0:16:54 | last year you we compare the |
---|
0:16:57 | results using the reference for the asr transcript |
---|
0:17:01 | and because condition we have a pretty high word error rate |
---|
0:17:06 | we have as if you're degradation in performance for the language system |
---|
0:17:10 | once when using a star |
---|
0:17:13 | results |
---|
0:17:15 | however |
---|
0:17:16 | when the trustees are only used for the profile estimation as we're doing in our |
---|
0:17:21 | proposed system |
---|
0:17:22 | then the performance |
---|
0:17:24 | is substantially smaller |
---|
0:17:29 | finally |
---|
0:17:31 | when we see here is the if we estimate the files |
---|
0:17:35 | using |
---|
0:17:36 | not only know all the |
---|
0:17:39 | i relevant segments but only |
---|
0:17:42 | the segments that we are most compelling about then we have further a performance improvement |
---|
0:17:50 | and instead of the parameters that we introduce |
---|
0:17:53 | the earlier |
---|
0:17:54 | here |
---|
0:17:55 | we are using the eight percent all the |
---|
0:18:00 | test segments |
---|
0:18:01 | or station by the segment i mean the segments that we're most confident about |
---|
0:18:06 | and they is a parameter optimize convertible |
---|
0:18:11 | well i first observation again it's made from this library |
---|
0:18:16 | where we have illustrated the |
---|
0:18:19 | diarization error rate a function |
---|
0:18:23 | all of the number of segments that's clear thinking this duration |
---|
0:18:28 | or final estimation is that |
---|
0:18:30 | unless we use |
---|
0:18:33 | a very small number of segments per session most of the time |
---|
0:18:38 | but performance is better five |
---|
0:18:41 | the key audio-only baseline which is illustrated by a dashed line we shoot |
---|
0:18:48 | also |
---|
0:18:49 | if we compare those |
---|
0:18:52 | blue and red lines |
---|
0:18:55 | what we see is that even though |
---|
0:18:57 | when we're using |
---|
0:18:58 | v |
---|
0:19:00 | sequence this time you're |
---|
0:19:02 | a bit which is |
---|
0:19:04 | this red line |
---|
0:19:06 | i don't though |
---|
0:19:07 | when using this |
---|
0:19:09 | we have a slightly worse performance is an oracle |
---|
0:19:14 | segmentation we observe that you we have two shoes |
---|
0:19:18 | you're only the number of segments to use |
---|
0:19:21 | then a tiger performance approaches the oracle |
---|
0:19:26 | segmentation performance |
---|
0:19:30 | to some with my presentation today we propose a system for speaker diarization |
---|
0:19:36 | in scenarios were speakers for a specific roles |
---|
0:19:40 | and we use the lexical information machine |
---|
0:19:44 | with those roles |
---|
0:19:45 | in order to estimate the acoustic advantages |
---|
0:19:49 | and which changes the ability for classification approach |
---|
0:19:54 | instead of a clustering |
---|
0:19:56 | approaches use a common thing to do diarization |
---|
0:20:01 | we evaluated our system on dynamics et cetera interruptions |
---|
0:20:05 | and we just really a relative improvement of about |
---|
0:20:09 | thirty percent |
---|
0:20:10 | number two t only on baseline |
---|
0:20:14 | so |
---|
0:20:16 | this was my own presentation |
---|
0:20:18 | thank you very much for button |
---|