0:00:13 | okay |
---|
0:00:15 | um |
---|
0:00:15 | thank you for that |
---|
0:00:17 | production good morning am |
---|
0:00:19 | this work was done uh with that |
---|
0:00:21 | in collaboration with michael or from I B M high for a source in real and i be M |
---|
0:00:26 | thomas most on the source centre |
---|
0:00:28 | night |
---|
0:00:31 | and this work or uh is about |
---|
0:00:34 | uh or the european project the name her mess so |
---|
0:00:38 | first they will introduce this |
---|
0:00:40 | a project |
---|
0:00:41 | and then we will discuss uh uh um yeah walk on the speech transcription |
---|
0:00:46 | speaker tracking and spoken information retrieval |
---|
0:00:49 | um in this project |
---|
0:00:53 | um |
---|
0:00:54 | i |
---|
0:00:55 | i mess is the yeah three years along the resource project in the area of ambient assisted living |
---|
0:01:02 | a a surely funded to buy a a your |
---|
0:01:06 | union |
---|
0:01:08 | a a and the goal of this project these to do a develop the a personal system for the elderly |
---|
0:01:14 | user |
---|
0:01:15 | uh to alleviate a normal ageing related cognitive decline |
---|
0:01:21 | uh by providing a memory support and the some cognitive three |
---|
0:01:26 | so the brought a B is |
---|
0:01:27 | um |
---|
0:01:28 | a a to record in audio |
---|
0:01:31 | and we do your |
---|
0:01:32 | a a personal experience of the user |
---|
0:01:35 | in part manually |
---|
0:01:37 | um |
---|
0:01:38 | and in but to automatically |
---|
0:01:41 | and then to extract to |
---|
0:01:43 | method that data from this uh uh uh a recording |
---|
0:01:46 | and a a two or fair to the use of a certain set of this |
---|
0:01:51 | um |
---|
0:01:52 | as it comes to the your recordings which is a a a a uh a primary focus here |
---|
0:01:58 | uh the user is equipped with the a a mobile device uh over the top i will call it P |
---|
0:02:04 | D A personal digital uh system |
---|
0:02:07 | and the uh the use wasn't can um a required that uh if he's a whole conversations of interest |
---|
0:02:13 | a a a at time or i'll say |
---|
0:02:16 | um |
---|
0:02:18 | uh |
---|
0:02:18 | or those central service is or application that the the or ms system this to the user is called her |
---|
0:02:25 | to mess might past which is a a of the past experience of the user |
---|
0:02:30 | a in our case record in an audio and this is my uh primary focus and uh is specifically the |
---|
0:02:36 | speech shape to single related |
---|
0:02:38 | a a part of these uh application |
---|
0:02:42 | so the idea is to lead to the user |
---|
0:02:45 | to a a sub meat |
---|
0:02:47 | a |
---|
0:02:48 | create a like for example what be the doctor tell me yesterday about the diet |
---|
0:02:53 | uh and if you look at this these read is uh a a a in a a composite create it |
---|
0:02:59 | contains a spoken words |
---|
0:03:02 | a like diet |
---|
0:03:03 | and the um uh come to a a a uh i i i a a i don't it you of |
---|
0:03:08 | the come part of the conversation then this system |
---|
0:03:11 | um we'll we'll tape and the it don't to the user |
---|
0:03:15 | um um are really want fragments of for want uh conversations that the match this uh uh query |
---|
0:03:22 | and they're query will be composed using the a a dedicated interface supporting the composition of such a did not |
---|
0:03:29 | and a free form of co |
---|
0:03:32 | now this that the control flow all of the siege to sink that supports uh these application |
---|
0:03:37 | basically what we need to do |
---|
0:03:39 | we have to the uh i'm extract |
---|
0:03:43 | the speaker a identity from this uh uh uh conversations recorded with a P D |
---|
0:03:49 | and to transcribe speech to text |
---|
0:03:52 | then we have to even books |
---|
0:03:54 | all this information |
---|
0:03:55 | a a and B able to uh so much over of these uh in text information not only fast but |
---|
0:04:02 | also a a a a a |
---|
0:04:04 | a a a correctly |
---|
0:04:07 | uh |
---|
0:04:08 | such as application poses certain challenges to the speech processing first of full the this is open domain conversational speech |
---|
0:04:15 | which is |
---|
0:04:16 | uh all with the challenge |
---|
0:04:18 | uh furthermore at the recording are made with the distantly placed P D uh and device |
---|
0:04:25 | and uh typically two people are talking to each other and the P D placed on the table uh between |
---|
0:04:30 | them |
---|
0:04:32 | uh |
---|
0:04:33 | secondly a a a a a a a house next |
---|
0:04:35 | this is a elderly voices which share uh are important they in the literature to post challenges just to the |
---|
0:04:42 | um a a star system |
---|
0:04:44 | and the last but not the least them must see that data collection for training cannot be a four can |
---|
0:04:48 | such approach |
---|
0:04:52 | uh the target language for their must prototype system was the castilian spanish |
---|
0:04:57 | in the beginning of the project we a |
---|
0:05:00 | but form data collection trying to collect as much or all your that uh is willed |
---|
0:05:06 | uh we collected data from uh a forty seven elderly that lay and for young speaker |
---|
0:05:12 | yeah did that that was recorded simultaneously by the P D which is our target and also a headset microphone |
---|
0:05:18 | for the research and are not as in |
---|
0:05:20 | uh and the in total we collect put about for forty hours of uh that are which share was distribute |
---|
0:05:26 | it among the uh dialogues which is our target |
---|
0:05:30 | freestyle monologues and uh read out then |
---|
0:05:33 | or these that that are all these that that under of and the manual gonna by tim transcription and the |
---|
0:05:38 | speaker laid |
---|
0:05:41 | and now i speech to the speech it to text the transcription but |
---|
0:05:45 | uh uh in on need to be a a work on on this part was based on the i-th do |
---|
0:05:51 | you'll a uh uh toolkit develop by B M uh is still |
---|
0:05:55 | uh this system that we used and B within this project uh are similar to each other in terms of |
---|
0:06:01 | their uh uh architecture |
---|
0:06:03 | they employ to pass decoding |
---|
0:06:05 | with the feature space the speaker adaptation and discriminative for acoustic models at the second pass and the employ three |
---|
0:06:12 | gram |
---|
0:06:13 | a statistical language model |
---|
0:06:15 | the development here you were through a a three phases |
---|
0:06:18 | the baseline system in immediate system and that advance |
---|
0:06:23 | uh as a baseline we adopt |
---|
0:06:26 | spain is a system developed by I B i mean tc-star european project |
---|
0:06:31 | for transcription of parliamentary speech |
---|
0:06:34 | this is the use system their acoustic model contains about uh four thousand hmm states and about one hundred K |
---|
0:06:41 | A of channels |
---|
0:06:43 | in the tc-star is of iterations this system achieved eight percent were there rate which is very successful |
---|
0:06:49 | a a a a a a and one a weighted this baseline system on their had "'em" is that uh |
---|
0:06:54 | i including lead out dialogues |
---|
0:06:57 | a recorded with leap microphones and P E A |
---|
0:07:00 | um and this evaluation the what they or rates are are um a presented in this table |
---|
0:07:05 | uh |
---|
0:07:06 | we eh |
---|
0:07:09 | actually this evaluation uh a review |
---|
0:07:11 | the a high degree of mismatch between the baseline and training condition which has which is a a parliamentary speeches |
---|
0:07:18 | recorded with close talking microphone |
---|
0:07:20 | and the as target conditions which are free dialogues recorded with a distantly placed P D |
---|
0:07:26 | the in this table you can see they can uh a the influence of the linguistic aspect of these mismatch |
---|
0:07:32 | and acoustic matt uh uh ask like separately |
---|
0:07:35 | but all all that all you to both of the aspect the there are rate |
---|
0:07:40 | jams from the twenty four percent |
---|
0:07:43 | for really dollop the recording quiz sleep microphone |
---|
0:07:45 | to sixty eight percent for the dialogs recorded with P |
---|
0:07:50 | next |
---|
0:07:50 | we build an intermediate system by uh adaptation of the baseline language models and the acoustic model |
---|
0:07:57 | a language model adaptation included |
---|
0:08:00 | um that a you go you language model on a subset of the um error mess conversation transcript |
---|
0:08:07 | and interpolation between the baseline language which model and the |
---|
0:08:11 | um yeah new than which model |
---|
0:08:13 | the acoustic model adaptation the patient was done of the speaker enrollment using a good animated our um yeah adaptation |
---|
0:08:20 | oh the baseline acoustic model on they a mess monologue that that of the target speaker |
---|
0:08:25 | uh in this table you uh can see the evaluation of the intermediate system |
---|
0:08:32 | on on the uh dialogs recorded uh by P D |
---|
0:08:36 | a a and they here you can uh see they contributions of the language model adaptation |
---|
0:08:43 | and the acoustic model adaptation |
---|
0:08:45 | uh separately but well that all the intermediate system |
---|
0:08:49 | uh uh read uses the were that all rate from sixty eight percent to fifty four |
---|
0:08:55 | finally |
---|
0:08:57 | we a bill |
---|
0:08:58 | the advanced system them a completely at E on the are miss P D data |
---|
0:09:04 | and the boats to what's stopping this training process by the initial alignments uh obtained with the baseline |
---|
0:09:11 | uh this a advanced system was trained on us so eight hours of speech but used by forty nine speakers |
---|
0:09:18 | it is a very with the data set |
---|
0:09:20 | and we put there is there of only two uh elderly speakers |
---|
0:09:24 | a a male and female for that this |
---|
0:09:26 | uh this is this model system related to small this is a a a a four times more than the |
---|
0:09:32 | baseline and the intermediate and it does not require a speaker don't roman so |
---|
0:09:37 | in that sense it is the deployment friendly system |
---|
0:09:40 | here you can see the evaluation of all these three system |
---|
0:09:44 | on the same dataset comprised of the conversational speech recorded by P D A and you see the that one |
---|
0:09:50 | system achieved |
---|
0:09:51 | so nine point two percent were that all rate |
---|
0:09:54 | which is dramatically improvement in the accuracy at a a a a uh uh a billy to the a a |
---|
0:09:59 | baseline an intermediate |
---|
0:10:02 | now we switch to the speaker tracking |
---|
0:10:04 | a a is you know speak and tracking in it task a uh a mean to answer the question who |
---|
0:10:09 | spoke when and on the um on a channel um |
---|
0:10:12 | a audio |
---|
0:10:14 | it can be seen as the concatenation of two sub that's |
---|
0:10:16 | speaker there is a nation we choose segments |
---|
0:10:19 | the |
---|
0:10:20 | a a audio to speaker tones and fed the a class of this segments uh according to speaker similarity |
---|
0:10:27 | and speaker recognition |
---|
0:10:29 | a we just sign speaker identity labels to this uh a class |
---|
0:10:33 | in a mess we deal these two speaker conversations which is typically a dialogues the conversations of the speaker or |
---|
0:10:40 | of with a and that's that that |
---|
0:10:42 | uh this big get back and in you in or misuse |
---|
0:10:45 | by my only for the so much |
---|
0:10:48 | here we need only to know they yeah |
---|
0:10:50 | uh i i didn't you of the speaker participants in the conversation and the set secondary use use for uh |
---|
0:10:57 | enhancement of the transcript speech tampa they intelligibility while browsing good them uh |
---|
0:11:03 | uh for the use |
---|
0:11:05 | uh for the to here there is a nation and no will than the very effective and simple a technique |
---|
0:11:10 | a it has been developed a a and |
---|
0:11:13 | a it is described in detail in the this paper or or on uh well the second that N |
---|
0:11:19 | uh this |
---|
0:11:20 | a technique could be evaluated on the nist telephone F one you that a achieved the two point eight uh |
---|
0:11:26 | a of sent equal error rate |
---|
0:11:29 | on her a is that low you'd achieve a twenty four percent were there are only uh a excuse me |
---|
0:11:34 | uh frame at all rate which means uh percent H for incorrectly class that frames |
---|
0:11:41 | a and the difference um |
---|
0:11:43 | uh in the performance is accounted for the very challenging good um a record and condition in a miss and |
---|
0:11:50 | that |
---|
0:11:51 | a now speaker recognition on that |
---|
0:11:54 | uh i |
---|
0:11:55 | he has speaker recognition is applied to on the uh segments provide it from the speaker there is a nation |
---|
0:12:01 | it facilitates speaker recognition because speaker recognition on on segment that uh |
---|
0:12:07 | the them multi party a a uh do is uh extremely challenging |
---|
0:12:12 | so it facilitates by |
---|
0:12:14 | a a a a a uh |
---|
0:12:15 | um |
---|
0:12:16 | still the problem persists because the diarization is not perfect so this segments that we applied the speaker recognition to |
---|
0:12:23 | typically contain frames from both the speaker |
---|
0:12:27 | uh a a are as a a uh um as a is the same time to state of the out |
---|
0:12:32 | the speaker recognition algorithms |
---|
0:12:34 | um are not them to the interfering speaker so additional work is needed here and the to this end the |
---|
0:12:41 | or approach |
---|
0:12:42 | uh very a a a effective was um |
---|
0:12:46 | uh developed the in the ms project |
---|
0:12:49 | that the read used to high then the influence of the interfering speaker and the the algorithms on the lang |
---|
0:12:55 | this technique |
---|
0:12:56 | uh i |
---|
0:12:57 | a but it |
---|
0:12:58 | excuse me |
---|
0:13:00 | the |
---|
0:13:02 | uh |
---|
0:13:02 | a uh described in detail but in in this two publications |
---|
0:13:07 | the egg will at all rate on the missed uh on the is uh a telephone you that the is |
---|
0:13:12 | about four percent and on that a lot is by their |
---|
0:13:16 | a diarisation it is about to eleven percent again the difference is accounted for their miss recording these |
---|
0:13:24 | and finally we move to the spoken information a table |
---|
0:13:28 | uh |
---|
0:13:29 | hmmm |
---|
0:13:30 | or a limit that the extracted from the audio he's index so what we are indexing |
---|
0:13:34 | first |
---|
0:13:35 | they word confusion networks provided by the asr system it means |
---|
0:13:39 | for each work we use X |
---|
0:13:41 | and and best alternatives |
---|
0:13:43 | we even uh them in form |
---|
0:13:47 | and the along with their confidence measure |
---|
0:13:50 | next the work time stamps |
---|
0:13:52 | and finally speaker identity is associated with the conversation |
---|
0:13:57 | we define the query language |
---|
0:13:59 | uh which enables combining spoken talents and speaker or uh identity didn't T in the same query |
---|
0:14:07 | and i was so function uh rate dorms at least of uh by a relevance or that uh items each |
---|
0:14:13 | item contains |
---|
0:14:14 | the i D of of the conversation and times stamps of the eleven fragment inside the conversation |
---|
0:14:21 | and also it employs uh |
---|
0:14:23 | spell check |
---|
0:14:24 | eventually we a evaluated |
---|
0:14:27 | our way and two a and systems |
---|
0:14:29 | including a R |
---|
0:14:32 | uh indexing and retrieval |
---|
0:14:35 | uh we test and this as systems uh in the task of |
---|
0:14:39 | uh a conversation that it three will based on their content can vary which means |
---|
0:14:44 | but to be you could not the timing information uh a it by a such function and we did did |
---|
0:14:51 | not include the speaker identity in the query |
---|
0:14:54 | for this evaluation we use the same twenty conversations from the male and female of the lease speak at are |
---|
0:15:00 | used in the that evaluation |
---|
0:15:02 | fifty five queries have been composed manually again this train you conversations |
---|
0:15:09 | uh which means |
---|
0:15:10 | to each conversation uh if you could from one to two four |
---|
0:15:15 | we are composed by a spanish speaking get um |
---|
0:15:18 | uh |
---|
0:15:19 | people |
---|
0:15:21 | now the idea was to a a of the speech so much |
---|
0:15:25 | uh to the texture search much a ritual was can see that as a different |
---|
0:15:31 | so for each query we found and mark to uh a live on conversations |
---|
0:15:36 | by searching with this committee at a or of the the button transcript of all the uh to into conversation |
---|
0:15:43 | in general for a each query that are no them were on uh uh i really one conversation because |
---|
0:15:48 | conversations uh a shared of more or less the same topics |
---|
0:15:53 | i them |
---|
0:15:55 | the and be a applied speech so much and use the standard of of the uh a uh um and |
---|
0:16:01 | mean average precision not a measure of uh to crime to find the accuracy related accuracy to see uh all |
---|
0:16:08 | this so |
---|
0:16:09 | and here uh you can see for evaluation |
---|
0:16:12 | uh for evaluations each evaluation is represented by |
---|
0:16:16 | uh to but |
---|
0:16:18 | the blue bar the a our work there or rate and they read but the big them up |
---|
0:16:24 | the sent |
---|
0:16:26 | uh |
---|
0:16:27 | the first evaluation was for the baseline is that sees them |
---|
0:16:31 | then in that is are them and then i'll glanced is thus |
---|
0:16:36 | all these three evaluations of down with uh uh |
---|
0:16:39 | when we even next one is the first the top best uh guess from the side |
---|
0:16:45 | and here you can see that with the advanced one system we achieve seventy percent uh uh uh a mean |
---|
0:16:51 | average precision of the so |
---|
0:16:53 | and the final evaluation for that one system was done to is using who uh with index goal the entire |
---|
0:16:59 | more confusion network and |
---|
0:17:01 | you a bring seventy six uh percent cent of up which means that we are pretty close to the |
---|
0:17:08 | a a textual so |
---|
0:17:09 | and the to wrap up |
---|
0:17:12 | a fast it seems that |
---|
0:17:14 | i'd the challenges |
---|
0:17:15 | the technology are mature enough to meet the challenge and show the ambient assisted living publication |
---|
0:17:22 | uh secondly a |
---|
0:17:24 | availability of the domain specific that uh been know that it is a very a important X three male important |
---|
0:17:30 | on the other hand |
---|
0:17:32 | um many projects cannot not of for the rate a big scale that the collection so it today they D |
---|
0:17:38 | L for uh |
---|
0:17:39 | but collaboration and that shading |
---|
0:17:41 | which will be a |
---|
0:17:43 | a a very useful for the progress in this area |
---|
0:17:46 | uh next the speaker recognition uh proper but as um yeah while the would |
---|
0:17:52 | but uh a reasonable performance on to speaker or uh conversations recorded by |
---|
0:17:57 | uh in such a a a a and like a distant mobile device and finally |
---|
0:18:03 | the a |
---|
0:18:04 | advanced speech source technology |
---|
0:18:06 | uh can have a calm the uh substantial a a cell asr error rate and the |
---|
0:18:13 | uh allows to approach the performance of the tech |
---|
0:18:16 | texture of information retrieval |
---|
0:18:19 | thank you |
---|
0:18:24 | okay questions |
---|
0:18:27 | okay |
---|
0:18:30 | thank you for to talk or two questions |
---|
0:18:32 | the first one |
---|
0:18:34 | L people |
---|
0:18:35 | you is that over a long time |
---|
0:18:37 | then you could expect them to role |
---|
0:18:41 | or or you could use |
---|
0:18:43 | to to an unsupervised adapt |
---|
0:18:46 | so much which my first one is |
---|
0:18:49 | it |
---|
0:18:49 | if you right no |
---|
0:18:52 | or at about thirty nine percent word error rate |
---|
0:18:55 | oh what is your prediction on how for what you get with a |
---|
0:18:59 | unsupervised adaptation uh or supervised adaptation |
---|
0:19:03 | and the other the question is |
---|
0:19:05 | with that kind of population |
---|
0:19:07 | you could have |
---|
0:19:08 | dramatic changes like the person that's a stroll course so which would |
---|
0:19:12 | totally changes a closed |
---|
0:19:14 | so do you have any |
---|
0:19:17 | a any any idea your of how how you would deal with that |
---|
0:19:21 | is to record could you to be the second question |
---|
0:19:24 | second question is |
---|
0:19:25 | a i to lay she could |
---|
0:19:28 | a can have a very dramatic change of the ports correct rest |
---|
0:19:32 | for instance |
---|
0:19:33 | because used developed mean |
---|
0:19:36 | or are concerned yes a strong |
---|
0:19:38 | something like that which told to changes of what sort of absolute |
---|
0:19:42 | so okay so the the first question |
---|
0:19:44 | yeah |
---|
0:19:46 | in general |
---|
0:19:47 | um the the supervised speaker in the |
---|
0:19:52 | can help |
---|
0:19:53 | to to to bring the rubber or it to a lower or what are the what were that all rate |
---|
0:19:59 | but |
---|
0:20:00 | a |
---|
0:20:02 | E |
---|
0:20:02 | complicates the um yeah deployment the installation on of such system |
---|
0:20:08 | a a and um um i'm not quite sure that the egg in the it's our uh accuracy |
---|
0:20:16 | uh a be paid or if it the a lot of the speech source because we are not going to |
---|
0:20:20 | probably this transcripts what we need is |
---|
0:20:23 | just to so much so |
---|
0:20:26 | uh |
---|
0:20:27 | i'm not sure that |
---|
0:20:29 | this complication of the deployment will be paid or for a L A at the level of the deployed |
---|
0:20:35 | uh to your second question uh yeah absolutely i agree with you and this is the you know this is |
---|
0:20:42 | a research area we |
---|
0:20:44 | actually in in this project we have to a a by lots at uh |
---|
0:20:49 | a a real problems |
---|
0:20:52 | and this is the first uh uh um yeah uh a time |
---|
0:20:56 | um |
---|
0:21:02 | so uh |
---|
0:21:04 | a i i i would like to know the the answer by myself |
---|
0:21:07 | um |
---|
0:21:11 | it you know |
---|
0:21:13 | yeah |
---|
0:21:14 | it it it says that maybe speaker and a meant is not |
---|
0:21:17 | so uh useful |
---|
0:21:19 | uh |
---|
0:21:21 | uh to some extent to you can be keep using this system and the you the voice characteristics uh |
---|
0:21:28 | uh degree of to that i am my to merely them |
---|
0:21:33 | i do not know maybe that |
---|
0:21:35 | size the user one |
---|
0:21:36 | you will not be able to use such a system and in no |
---|
0:21:41 | okay you thank you |
---|
0:21:43 | oh thing we uh need to trying the speaker |
---|