0:00:15 | so thank you for a nice introduction |
---|
0:00:18 | my name is but the shreds and the |
---|
0:00:21 | at the beginning i was recessionary the banana what is it the of technology |
---|
0:00:27 | i for one many different fields start thinking of porting recognizer |
---|
0:00:33 | then |
---|
0:00:36 | we really try to do it should speaker identification don't know asr |
---|
0:00:40 | are you |
---|
0:00:41 | roll the speaker is no particular isn't enough |
---|
0:00:45 | that many of you have still using |
---|
0:00:48 | but and down the here |
---|
0:00:51 | two thousand five |
---|
0:00:53 | something like a stranger happen to of for us a basically we do our approach |
---|
0:00:58 | by one company and the |
---|
0:01:00 | the companies |
---|
0:01:01 | set to will give you some money but you would like to have different license |
---|
0:01:07 | for recognizer it was publicly available |
---|
0:01:10 | also we said okay this is find a do this |
---|
0:01:13 | bill help us to |
---|
0:01:15 | the to finance the there is an essential but the third row to be |
---|
0:01:19 | quite the low expediency to go also |
---|
0:01:24 | nine months just to the |
---|
0:01:27 | negotiated the license university |
---|
0:01:30 | and we realise two things that there is interest from commercial market it can be |
---|
0:01:36 | recast additional money |
---|
0:01:38 | and |
---|
0:01:41 | that we need to do we the heat a better way |
---|
0:01:44 | so there we started a company called for X |
---|
0:01:49 | so i would like to talk about to the topics are the two main topics |
---|
0:01:54 | today |
---|
0:01:55 | how to the speech tickle you probably such a two |
---|
0:02:00 | mark and the and then a i would like to the |
---|
0:02:05 | shows are much related that we see a problem the user point of view |
---|
0:02:12 | so at the beginning i will talk a few words about the company |
---|
0:02:16 | then about the text widget use cases |
---|
0:02:19 | technologies that are behind the our programs |
---|
0:02:23 | and the so |
---|
0:02:25 | how big over the technology to the rose and then i mean really indicates someone |
---|
0:02:31 | grand challenges |
---|
0:02:34 | people usually don't know what is that you speech |
---|
0:02:39 | but if you look at this |
---|
0:02:44 | at the |
---|
0:02:47 | at a dislike so you can see that they result |
---|
0:02:52 | that there is before make sure the about speaker it can be |
---|
0:02:56 | E hundred and there are you gonna be eight sure it can be speaker i |
---|
0:03:00 | didn't at you can't that example emotion states |
---|
0:03:05 | meant the speaker speaks and so on there is the goal that the |
---|
0:03:10 | you can detect the language you can detect dialects a keyword crazy so |
---|
0:03:15 | you can do the whole speech transcription |
---|
0:03:18 | maybe the topic is interesting |
---|
0:03:22 | you can |
---|
0:03:23 | do something domain incapable then the |
---|
0:03:26 | but there are other modalities you can have some information about the |
---|
0:03:33 | and white men the that the speaker is |
---|
0:03:35 | to whom the speaker speaks you can have other solid so |
---|
0:03:39 | what is up of user you go |
---|
0:03:41 | very close animals |
---|
0:03:43 | or you have a lot of information about equipment that was used |
---|
0:03:49 | of the to get a relief what we on voice it can be the device |
---|
0:03:53 | the |
---|
0:03:54 | for example we also for ticket be transcription of huge it can be according to |
---|
0:04:01 | you may be the test it in a speech quality |
---|
0:04:04 | this |
---|
0:04:05 | is important for user and the users can benefit from this information |
---|
0:04:12 | about products E R |
---|
0:04:13 | it also based in two thousand six as us |
---|
0:04:16 | startup from brno university of technology |
---|
0:04:20 | it has C |
---|
0:04:21 | in czech republic in button or just five minute walk though from the university come |
---|
0:04:26 | rules |
---|
0:04:28 | if we speak about the user so we have currently users in more than twenty |
---|
0:04:32 | come companies |
---|
0:04:34 | so i've got with the agency score said that all the bank |
---|
0:04:37 | dell corporation that also brought up so service providers and others |
---|
0:04:43 | the company use the roughly table and the |
---|
0:04:47 | the little small only for an external funding so those so far |
---|
0:04:53 | you if we speak the about the process how to transfer technology it probably search |
---|
0:05:00 | the to the progress and the market |
---|
0:05:04 | there are several steps |
---|
0:05:07 | well i'm an important role E V speak about the research |
---|
0:05:11 | theirself useful for a dollar a universities or inside companies |
---|
0:05:16 | but the goal is to get the |
---|
0:05:20 | the best like technology but the more i don't interest is unit could easily the |
---|
0:05:28 | quality so like a or a set of for the stability of speech will for |
---|
0:05:34 | the court is not do |
---|
0:05:37 | the man it main importance |
---|
0:05:39 | and the what was important for a so it probably also for |
---|
0:05:44 | you |
---|
0:05:45 | and this stage we will to be all limit as possible so it's the beans |
---|
0:05:50 | quotative measurement |
---|
0:05:52 | to |
---|
0:05:52 | to the saba |
---|
0:05:54 | okay so |
---|
0:05:56 | open-source toolkit and so on |
---|
0:05:59 | but then you need to getting better technology to |
---|
0:06:05 | user somehow |
---|
0:06:06 | this also for this you would be to do |
---|
0:06:10 | next step but |
---|
0:06:11 | you need to build the |
---|
0:06:13 | code base that is or almost that is stable |
---|
0:06:17 | it is fast the has a modified a P has documents a day shown a |
---|
0:06:22 | proper licensing |
---|
0:06:24 | us assume so the D V D this is what is what exactly |
---|
0:06:30 | then that is others that but |
---|
0:06:32 | a better you know |
---|
0:06:35 | i need to build product for our customers so you can have nice technology you |
---|
0:06:39 | could have nice interface is but if you don't have power back to you won't |
---|
0:06:42 | be able to sell |
---|
0:06:44 | so here the full cost is |
---|
0:06:48 | functionally the and that this is donna either by phonics el or by other companies |
---|
0:06:53 | we never be |
---|
0:06:58 | not now i the bill mentioned |
---|
0:07:03 | pretty domain use cases |
---|
0:07:06 | or pretty main customers there are others but i selected this free |
---|
0:07:13 | the first the are all sentences in course of course there are so we are |
---|
0:07:18 | active you know why rasta the fires are is the quality control how to ensure |
---|
0:07:26 | quality parentheses in |
---|
0:07:29 | the call the course of terror and there are there is data mining from |
---|
0:07:34 | voiced i think |
---|
0:07:38 | the that the quality control what is it about |
---|
0:07:42 | in both antenna |
---|
0:07:44 | it you have to really |
---|
0:07:46 | some kimberly there or supervisor |
---|
0:07:49 | that's a pairwise well the |
---|
0:07:52 | then a but i there are so i just this better than do |
---|
0:07:57 | but i think of course |
---|
0:07:59 | evaluation of operate that also |
---|
0:08:01 | analysis of the results |
---|
0:08:03 | for the team and some reporting |
---|
0:08:06 | if the there are no speech technology |
---|
0:08:09 | so you usually only three by a set of recordings is |
---|
0:08:15 | inspect the then the use of to control wanting to local schools said there but |
---|
0:08:20 | if you would be point something which is you are able to control |
---|
0:08:24 | a hundred percent so the topic to get you are able to better use statistics |
---|
0:08:28 | the |
---|
0:08:29 | and the this everything is about the |
---|
0:08:32 | moreover |
---|
0:08:33 | the cost so of star but |
---|
0:08:37 | we are able to try to reduce the number of advisers to how well |
---|
0:08:41 | over operating costs |
---|
0:08:44 | so it is very you are in but |
---|
0:08:47 | to shorten the call so |
---|
0:08:50 | if it you are able to the |
---|
0:08:54 | you have |
---|
0:08:55 | find problem so you really errors |
---|
0:08:58 | some or but i just are not the us up despite the or a remote |
---|
0:09:03 | well train the |
---|
0:09:05 | i hear its is possibility to said that what is look training can that |
---|
0:09:11 | usually a the |
---|
0:09:14 | the formulation of people you know such posts of the rest is |
---|
0:09:18 | you know tens of |
---|
0:09:21 | paris and so we are able to reduce the this |
---|
0:09:28 | the this amount of data and the again six of some calls |
---|
0:09:36 | and about this approach is a huge the for quality control the main technology is |
---|
0:09:42 | the |
---|
0:09:43 | the at and doing some |
---|
0:09:47 | this does so i'm not easy so on |
---|
0:09:50 | topologies |
---|
0:09:52 | so you are unable to get important statistics |
---|
0:09:55 | like but better |
---|
0:09:58 | the dialogue starts the number of speaker turn |
---|
0:10:01 | us |
---|
0:10:03 | speech adaption times |
---|
0:10:05 | of unknown the call centres have |
---|
0:10:09 | all the equipment so if they have some channels to the conversation the speakers are |
---|
0:10:15 | not in separate the |
---|
0:10:17 | well i like channels that we need to do diarization |
---|
0:10:22 | then it is possible to deploy the key you want to raise the text order |
---|
0:10:28 | you have some obligatory phase this you don't want the people to |
---|
0:10:33 | speaker all the words |
---|
0:10:36 | you would like to have some |
---|
0:10:38 | of course grip compliance the people should |
---|
0:10:41 | for all calls three the |
---|
0:10:43 | and the it is possible at the to the voice speech transcription are mainly |
---|
0:10:48 | for us |
---|
0:10:49 | set a should |
---|
0:10:52 | in this task |
---|
0:10:55 | every about the data mining the this is other large a topic for also that |
---|
0:11:01 | is |
---|
0:11:03 | year again we have like two subtasks |
---|
0:11:07 | one is but i mentioned of the corset errors |
---|
0:11:10 | of course and therefore overloading |
---|
0:11:13 | you may gina that you have |
---|
0:11:15 | also there are of a few hundred people |
---|
0:11:18 | and the there is large up |
---|
0:11:21 | i rolled each of thousands of people start holding the |
---|
0:11:26 | to all said that the service so you need to whatever it really quickly |
---|
0:11:30 | we need to explore the export of what is wrong and the |
---|
0:11:34 | maybe that some information to do initial i we are |
---|
0:11:39 | a stage |
---|
0:11:40 | and the japanese so you could be the this is solved by some be |
---|
0:11:45 | screen |
---|
0:11:46 | in the call centers showing the topics that are just the discussed that the |
---|
0:11:51 | percentage |
---|
0:11:52 | then |
---|
0:11:53 | i other important the |
---|
0:11:56 | but you use the like the |
---|
0:11:59 | i did value speech technologies for business of eight basically now a moan about so |
---|
0:12:06 | but |
---|
0:12:07 | indeed i don't know if you have for example how |
---|
0:12:12 | may be done also |
---|
0:12:13 | it's looking for places that too but i'm not |
---|
0:12:17 | in new |
---|
0:12:20 | new fast foods |
---|
0:12:21 | you the approach is to but the because go to telephone operator and they ask |
---|
0:12:28 | please could you |
---|
0:12:29 | if a statistics of where people that the visit it the our fast foods are |
---|
0:12:35 | putting day and the |
---|
0:12:38 | the place where is the highest of that it was good consideration is good place |
---|
0:12:42 | to start the you fast food |
---|
0:12:45 | but the you know speech technologies the same |
---|
0:12:48 | if you have more information for example it's on the phone line a is a |
---|
0:12:52 | male or female or that they are more people or on the line or the |
---|
0:12:59 | pairs and was interest in the in some regrets the in boston |
---|
0:13:04 | it helps you to go |
---|
0:13:06 | the whole business certain to push to some more |
---|
0:13:10 | for this we usually use of speech transcription |
---|
0:13:14 | and then |
---|
0:13:15 | some of data mining to on top of it so |
---|
0:13:19 | of course it is possible to at the |
---|
0:13:22 | so i changing if you want to the session a narcotics |
---|
0:13:28 | then |
---|
0:13:30 | the other big groups are bangs |
---|
0:13:33 | a bank so of colours have all sentence so what i dimension on the past |
---|
0:13:38 | two slides |
---|
0:13:39 | is important also here |
---|
0:13:41 | but the other too large task |
---|
0:13:45 | was a box are the bands needs to ensure the security part so on the |
---|
0:13:54 | other side |
---|
0:13:55 | they need something that these |
---|
0:13:58 | breeze and the |
---|
0:13:59 | for the user does that at the |
---|
0:14:01 | the that doesn't to |
---|
0:14:03 | being |
---|
0:14:05 | much complications |
---|
0:14:06 | so here the voice by but lisa very interesting it can be ways by timidity |
---|
0:14:13 | using a cape race or it can be ways by comedy that is the dominant |
---|
0:14:19 | on a big get the using a text independent speaker identification system |
---|
0:14:25 | and then i other |
---|
0:14:28 | task is for our protection in major in there are people according to bank so |
---|
0:14:35 | for example a hour a day |
---|
0:14:37 | i shows we make i didn't theses and that today are |
---|
0:14:42 | requesting clones |
---|
0:14:45 | you it is how to detect that this that if you don't have technology |
---|
0:14:49 | but if you have technology like the speaker identification is a really simple |
---|
0:14:56 | now about the intelligence agencies |
---|
0:15:00 | the intelligence agencies the situation it is usually that the this intelligence agencies have |
---|
0:15:07 | they really huge amount of data |
---|
0:15:11 | the amount this i should that the |
---|
0:15:13 | they are not able to put to see the |
---|
0:15:17 | manually this can came from |
---|
0:15:19 | a big use of from telecommunication network and communication the internet and so on |
---|
0:15:25 | they are looking really for need to |
---|
0:15:27 | you know i say it's take a |
---|
0:15:29 | and for these it is possible into to use combination of technologies |
---|
0:15:35 | so we are using combination of technologies |
---|
0:15:39 | and you language identification agenda to speaker diarization keyword spotting speech transcription |
---|
0:15:46 | data mining tools |
---|
0:15:48 | and also a little fun a correlation with some other metadata for example from this |
---|
0:15:54 | text is used |
---|
0:15:56 | and so of course the sequences are very interesting you know operation and forensic speaker |
---|
0:16:01 | identification |
---|
0:16:04 | now i will go to be better to the technologies |
---|
0:16:09 | and the tell you what this important for a practical deployments |
---|
0:16:16 | here are some of the technologies i want to speak about all of them you |
---|
0:16:21 | can come and ask if you should it from a question |
---|
0:16:26 | about the voice activity detection |
---|
0:16:29 | i would say that this is the most important part therefore practical deployment |
---|
0:16:37 | you can see our process |
---|
0:16:41 | by this is the most important part |
---|
0:16:44 | you can have |
---|
0:16:46 | very nice results for example on these databases |
---|
0:16:50 | but |
---|
0:16:51 | you what do you will explore a target so |
---|
0:16:55 | the users are working if such channels that the you should quantity of the traffic |
---|
0:17:00 | it can be |
---|
0:17:02 | tens of a sense is not speech at all |
---|
0:17:05 | it is some technical signal for like dialling don so that six |
---|
0:17:10 | and |
---|
0:17:12 | everything can if you don't have that |
---|
0:17:14 | this |
---|
0:17:16 | built in |
---|
0:17:18 | it it's a really harder to work with such channels |
---|
0:17:22 | so we are using energy based the steam of would be eighties energy based |
---|
0:17:26 | the at the beginning the to remove very large portion of the silences |
---|
0:17:32 | then technique the signal removal like the tone detect to removal first |
---|
0:17:39 | back to the spread like that are in mobile station i |
---|
0:17:43 | signals and so on |
---|
0:17:44 | and then we have a vad based on F zero tracking |
---|
0:17:49 | because of the speech have the specific characteristic the that should be a we have |
---|
0:17:54 | zero |
---|
0:17:55 | and the |
---|
0:17:56 | and the respective behaviour of this |
---|
0:17:59 | F zero and then we have you wanted what based vad |
---|
0:18:06 | to get a very precise the segmentation |
---|
0:18:12 | but is this say sets it is very important technology and they are still many |
---|
0:18:16 | challenges |
---|
0:18:19 | so it is important |
---|
0:18:21 | the accuracy of media |
---|
0:18:24 | directly affects the accuracy of the technologies |
---|
0:18:27 | us some sort are actually trying just you can have music |
---|
0:18:31 | of |
---|
0:18:32 | that they're other speakers sounds of like people tend to |
---|
0:18:37 | well a four or something like this |
---|
0:18:41 | you have a an alignment silence |
---|
0:18:45 | use a different technical signals |
---|
0:18:47 | what is a challenge is the vad one variable well snrs |
---|
0:18:54 | we at on distort each section was |
---|
0:18:57 | well what is also through you that important i think it is unknown parameter to |
---|
0:19:02 | a |
---|
0:19:03 | non automatic way |
---|
0:19:04 | of green vad because we know that we can do it's one to the deep |
---|
0:19:08 | or specific channel |
---|
0:19:10 | by training just as before |
---|
0:19:12 | some good classifiers |
---|
0:19:14 | but how to get a rise this it is still difficult and of colours distant |
---|
0:19:18 | mikes |
---|
0:19:22 | and well the language identification |
---|
0:19:26 | currently if we are able to recognize about fifty languages |
---|
0:19:31 | and the |
---|
0:19:32 | what is even more important |
---|
0:19:34 | that's that the user can add a new language |
---|
0:19:40 | themselves |
---|
0:19:42 | this is important especially for the intelligence come community because today will never |
---|
0:19:48 | tell you able to instead of the languages are great interest the that |
---|
0:19:53 | what the correct you on sent to you won't be able to collect the data |
---|
0:19:56 | to have much easier X axis |
---|
0:19:59 | to such data |
---|
0:20:02 | we are using i-vector based the technology |
---|
0:20:06 | and it is commanded training okay and |
---|
0:20:09 | we have first of all men which means that |
---|
0:20:12 | the language print is a less than |
---|
0:20:15 | well |
---|
0:20:16 | "'kay" a record |
---|
0:20:18 | bear |
---|
0:20:21 | a better file |
---|
0:20:26 | do it in this is the technology behind the there S this several stages |
---|
0:20:34 | year we have feature extraction |
---|
0:20:37 | collection of statistics using the ubm |
---|
0:20:40 | usually the |
---|
0:20:42 | we use gmm and the that is |
---|
0:20:46 | are aesthetically the by some subspace so the subspace it's estimated on large quantity of |
---|
0:20:52 | data to model the |
---|
0:20:55 | for |
---|
0:20:57 | variability in the speech |
---|
0:21:01 | so in that we get the |
---|
0:21:04 | of estimate so for but when we are in the subspace |
---|
0:21:09 | so this part was prepared by for next year |
---|
0:21:13 | but then and there is other part to |
---|
0:21:16 | but that it is the classifier of languages we use a multi class logistic regression |
---|
0:21:22 | here |
---|
0:21:23 | and that this is done by using |
---|
0:21:31 | about speaker recognition |
---|
0:21:34 | the there are many task like speaker verification a speaker to |
---|
0:21:38 | set of speaker spotting link analyses |
---|
0:21:42 | for |
---|
0:21:42 | after normalization some house |
---|
0:21:44 | sometimes social network analysis |
---|
0:21:47 | we can work in text independent or text dependent more |
---|
0:21:51 | i-vector based the approach |
---|
0:21:55 | we use diarization |
---|
0:21:57 | i think it what is important here we have |
---|
0:22:01 | use a based the system training for calibration |
---|
0:22:04 | that again helps |
---|
0:22:05 | people a lot |
---|
0:22:07 | it is here |
---|
0:22:11 | a so that the use of the same as a in case of |
---|
0:22:15 | and which identification |
---|
0:22:18 | what about a year we remove other what it but it is on speaker variability |
---|
0:22:25 | i would be have some normalisation of ways pain so simply by |
---|
0:22:29 | mean subtraction that can be done it user side |
---|
0:22:33 | and |
---|
0:22:34 | then |
---|
0:22:36 | if we have well |
---|
0:22:39 | scoring |
---|
0:22:43 | we compare ways putting it's a |
---|
0:22:46 | this |
---|
0:22:48 | pretend the by one excel and it is do you get it but the |
---|
0:22:53 | we allow our user to |
---|
0:22:56 | the rain or i don't the this classifier |
---|
0:22:59 | this is very important because |
---|
0:23:03 | it's harder to get any recording for from clients |
---|
0:23:07 | but |
---|
0:23:08 | if you deliver a such system to clients the |
---|
0:23:11 | and they are able to adapt the system you the amount of data can be |
---|
0:23:15 | a really small it can be |
---|
0:23:17 | for example fifty speaker does just few recordings of each |
---|
0:23:21 | well i'm |
---|
0:23:22 | we saw that the |
---|
0:23:23 | a normal telephone channels that we are able to get about the forty percent improvement |
---|
0:23:30 | for |
---|
0:23:31 | the new deployments |
---|
0:23:34 | and if it is about some |
---|
0:23:36 | us special of or |
---|
0:23:39 | for example many directions |
---|
0:23:40 | we saw a hundred percent improvements just |
---|
0:23:44 | with this |
---|
0:23:45 | simple book |
---|
0:23:47 | and of course so what is closely you |
---|
0:23:51 | important is calibration |
---|
0:23:54 | of |
---|
0:23:56 | you know case that we are drinker like that and then calibration because that this |
---|
0:24:00 | is also not seen too much in |
---|
0:24:03 | and nice the because they're the recordings are |
---|
0:24:07 | about two and half minutes long the but if you have |
---|
0:24:10 | the huge |
---|
0:24:11 | but variability in do lying to you need to do anything three this L C |
---|
0:24:15 | the shore recording studio |
---|
0:24:19 | solve it |
---|
0:24:20 | by do some up for a times |
---|
0:24:25 | what are the challenge in a language identification and the speaker identification |
---|
0:24:33 | i think the that the main challenges are very short recording so it can be |
---|
0:24:39 | one less than a three seconds |
---|
0:24:41 | but the |
---|
0:24:43 | very important for us is |
---|
0:24:46 | keeping to the training corn a user side |
---|
0:24:49 | because why less than three seconds to each if you have department of speaker identification |
---|
0:24:56 | and the you would like to deploy eighteen bank the people don't want to speaker |
---|
0:25:02 | they would like to have |
---|
0:25:04 | the decision even before they start speaking |
---|
0:25:07 | so i would say that the ten second these |
---|
0:25:12 | the maximum of |
---|
0:25:14 | a line that it that is a set the |
---|
0:25:17 | and you really free second to for a verification |
---|
0:25:21 | you can do we the we but text dependent systems |
---|
0:25:25 | i is harder to do it with text independent systems but in case of text |
---|
0:25:29 | independent systems |
---|
0:25:32 | these two steps are report to study on background so bias it to do use |
---|
0:25:36 | user is talking |
---|
0:25:38 | the operator |
---|
0:25:40 | of |
---|
0:25:44 | i that is question how to ensure |
---|
0:25:47 | accuracy over large number of acoustic channels and languages |
---|
0:25:52 | the technologies are more and more general |
---|
0:25:55 | independent but there still is someone |
---|
0:26:00 | independence |
---|
0:26:02 | what is was link important there are a graphical tools so that how |
---|
0:26:06 | the user is to visualize the information to do the calibration |
---|
0:26:10 | because |
---|
0:26:12 | if you do want to do this for user the user will never we the |
---|
0:26:17 | in self |
---|
0:26:19 | what we see also very challenging is |
---|
0:26:21 | language identification and we could ideas deviation a no voice over ip networks because |
---|
0:26:27 | there you have pockets you have gets lost |
---|
0:26:31 | and the you if you have this costs |
---|
0:26:34 | you usually cortex a are doing something that they are either put their zero also |
---|
0:26:39 | okay are sensing is i speech |
---|
0:26:41 | but this is not so the speaker to the that it this is something that |
---|
0:26:45 | was |
---|
0:26:45 | generated by decoding |
---|
0:26:47 | so that's also it's very important topic |
---|
0:26:50 | and of course the distance might |
---|
0:26:54 | now i would |
---|
0:26:56 | say few words a ball so diarisation a because that this is very important technology |
---|
0:27:01 | you useful for example the call centres but model also |
---|
0:27:06 | for anger |
---|
0:27:10 | other users |
---|
0:27:12 | we are using approaches one approach is really possible the not so |
---|
0:27:17 | much weaker the |
---|
0:27:19 | this is approach are based on |
---|
0:27:22 | clustering of i-vector so we basically split the audio too small chunks and to do |
---|
0:27:27 | clustering go for i-vectors |
---|
0:27:29 | but the |
---|
0:27:30 | then the i don't take the |
---|
0:27:33 | fully bayesian approach to the initial you know might take the by |
---|
0:27:37 | fabio one a |
---|
0:27:39 | patrick kenny worked with this to |
---|
0:27:42 | on the reset assures |
---|
0:27:44 | quite with the text and it it'll be the |
---|
0:27:47 | D P this is approach to bear you don't do a heart decision |
---|
0:27:55 | during of the process of you have everything good |
---|
0:28:03 | probably sticks and the you are going to do the decision and at the beginning |
---|
0:28:06 | it at the end |
---|
0:28:10 | this approach is i would say |
---|
0:28:14 | but if you're at the but you want see |
---|
0:28:17 | well my next slide it's not |
---|
0:28:19 | fully to |
---|
0:28:20 | but |
---|
0:28:21 | memory cons i'm mean and the quite small |
---|
0:28:26 | so what are challenging |
---|
0:28:28 | so in diarisation |
---|
0:28:31 | in my point of view the diarization |
---|
0:28:34 | still technology that needs quite a lot of research |
---|
0:28:39 | really so that it is very sensitive to initialization |
---|
0:28:44 | it is very sensitive to |
---|
0:28:47 | non speech sounds |
---|
0:28:49 | do you usually it is about the wall so you got more gaussian before |
---|
0:28:55 | for example if there are you sure that the |
---|
0:29:00 | new sounds that you haven't seen in your training data needs to be sorted for |
---|
0:29:04 | example we ask |
---|
0:29:06 | the system |
---|
0:29:06 | to keep two speakers |
---|
0:29:08 | but the output was so the that the |
---|
0:29:11 | we got two speakers in a |
---|
0:29:14 | one or like |
---|
0:29:16 | like under one labeler and the second speak that are you know |
---|
0:29:21 | there were segments |
---|
0:29:23 | it was other us |
---|
0:29:26 | speaker sounds |
---|
0:29:27 | i think of a lot in this case the |
---|
0:29:31 | so what is important the it's a very |
---|
0:29:35 | would the duration of your vad if you have |
---|
0:29:39 | i just sounds the that the speech |
---|
0:29:43 | it can hardly due to the adaptation |
---|
0:29:47 | a so it's a it is but very sensitive to two so that speech and |
---|
0:29:51 | also and then which is |
---|
0:29:53 | what we see that the you human of with this is things systems you can |
---|
0:30:00 | easily very should diarisation error rate the |
---|
0:30:02 | close to what one percent the one is data |
---|
0:30:07 | but |
---|
0:30:08 | well what we also saw |
---|
0:30:10 | and the |
---|
0:30:13 | you the |
---|
0:30:15 | first it's us |
---|
0:30:16 | we could be that the rest of the one percent the |
---|
0:30:20 | is that the there won't be pro by means segmentation about it's fails so |
---|
0:30:26 | forty four for this recording good did this is the |
---|
0:30:30 | usually of speaker to sweep but very similar voice but this happens |
---|
0:30:38 | okay so i think there is a shana |
---|
0:30:41 | that was a lot done during past two years but the data |
---|
0:30:45 | the challenge is quite the |
---|
0:30:47 | and of course you can speak about |
---|
0:30:50 | text and distant mikes for |
---|
0:30:53 | the of |
---|
0:30:55 | processing cove of |
---|
0:30:57 | the or like for example of deviance |
---|
0:31:02 | it in both keyword spotting |
---|
0:31:05 | so we are |
---|
0:31:10 | the we are using approach is what one approach is something probably you know all |
---|
0:31:15 | few |
---|
0:31:16 | no see the |
---|
0:31:18 | probably from project |
---|
0:31:19 | is the lvcsr based the keyword spotting |
---|
0:31:22 | is this is very good |
---|
0:31:25 | but the |
---|
0:31:26 | small |
---|
0:31:28 | and it's expensive for development |
---|
0:31:30 | the other keyword spotting the that the T V are using this acoustic bass the |
---|
0:31:37 | year indifference the that the |
---|
0:31:39 | it here we usually use a larger acoustic model |
---|
0:31:42 | here it's a simple on your network based acoustic model |
---|
0:31:50 | the there is no language model or data simple language model but here it's much |
---|
0:31:56 | cheaper of for our development |
---|
0:31:58 | so in case of |
---|
0:32:00 | lvcsr we are stopping creep hundreds of hours of training data |
---|
0:32:04 | in case of plastic you want splitting a |
---|
0:32:07 | we are stuck used by the office of acoustic data or human less |
---|
0:32:14 | the speech transcription a what we are using a |
---|
0:32:18 | this is probably not important of all of you are working can this |
---|
0:32:22 | feel that |
---|
0:32:23 | we are using the system based on a |
---|
0:32:26 | bottleneck features that the can combination we've other features hlda vtln |
---|
0:32:33 | gmm based system or and your network based system is not explaining okay |
---|
0:32:38 | speaker adaptation |
---|
0:32:39 | and gram language model and generate the |
---|
0:32:42 | usually confusion networks |
---|
0:32:47 | what are the challenging |
---|
0:32:49 | from the deportment point of view here |
---|
0:32:54 | well of course the accuracies |
---|
0:32:57 | still important |
---|
0:32:59 | but i would say that the it's not so the most important challenge |
---|
0:33:04 | the challenge is us be the |
---|
0:33:07 | lower memory consumption |
---|
0:33:10 | how to train new systems for the automatically course we would like to do it |
---|
0:33:14 | for |
---|
0:33:16 | so how to donna |
---|
0:33:17 | hundreds of recognizers in a parallel |
---|
0:33:21 | before all compute efficient computation one |
---|
0:33:25 | resources |
---|
0:33:26 | and also |
---|
0:33:27 | how to |
---|
0:33:28 | to the lecture normalization is |
---|
0:33:31 | speaker |
---|
0:33:32 | adaptation of for any length of |
---|
0:33:35 | speech utterance |
---|
0:33:37 | course for example if we transcribe |
---|
0:33:40 | along all source lectures some whatever |
---|
0:33:43 | we try to put the |
---|
0:33:45 | this much a adaptation is possible but if you are working with very short the |
---|
0:33:50 | segments so like |
---|
0:33:51 | three seconds or less |
---|
0:33:54 | the adaptation |
---|
0:33:56 | below heart how do you and the usually you will see worse results |
---|
0:34:01 | but |
---|
0:34:02 | the system |
---|
0:34:03 | that was so one solution is to remove those this adaptation |
---|
0:34:08 | but |
---|
0:34:09 | the system to be less robust to train |
---|
0:34:18 | not now of a how to sell of speech transcription |
---|
0:34:25 | what we found that if you must speech transcription |
---|
0:34:29 | and the you want to sell this technology is quite heart |
---|
0:34:34 | you need to have but |
---|
0:34:36 | something that this on top of this technology at that the real presently information to |
---|
0:34:42 | users |
---|
0:34:43 | this is the |
---|
0:34:45 | because |
---|
0:34:47 | there is too much text |
---|
0:34:49 | and |
---|
0:34:52 | this that there are still some errors |
---|
0:34:56 | what is our experience that the |
---|
0:34:59 | the user |
---|
0:35:01 | bill never be happy about the accuracy of those |
---|
0:35:04 | the speech recognition system if there are errors in more so the uses to mention |
---|
0:35:10 | this are also |
---|
0:35:11 | you if the words are correct the data start combine of a preposition suffixes in |
---|
0:35:16 | this is correct the |
---|
0:35:17 | a star complain about the some punctuation marks or grammar |
---|
0:35:21 | this is but |
---|
0:35:24 | if we use so |
---|
0:35:26 | of for a summer |
---|
0:35:27 | and the representation how to look at the data |
---|
0:35:31 | bill help you |
---|
0:35:32 | to sort of technology and that we are doing the in such weighted maybe do |
---|
0:35:37 | configuration |
---|
0:35:38 | we've the existing test bay it takes a base data mining tools |
---|
0:35:42 | integration is donna |
---|
0:35:44 | you usually on that the level |
---|
0:35:47 | all of |
---|
0:35:49 | of confusion networks also we have also the other a captive audience |
---|
0:35:57 | this is one a to the to use of |
---|
0:36:00 | this was the double |
---|
0:36:01 | by our part company so like |
---|
0:36:04 | so you have set session in gina |
---|
0:36:06 | here you can have very complex squarey here are |
---|
0:36:11 | documents it's a better found |
---|
0:36:13 | the document |
---|
0:36:15 | but you need to |
---|
0:36:19 | bright some somehow the query the query can be very complex |
---|
0:36:23 | so is here is |
---|
0:36:25 | gladiator |
---|
0:36:27 | but the was so it's one possible ability you but if you want to |
---|
0:36:33 | we have described topic so you can use this but it there |
---|
0:36:37 | or you can go from update time you can result i still they are you |
---|
0:36:41 | can look at what is the correlation among works |
---|
0:36:44 | and the |
---|
0:36:45 | you can you can |
---|
0:36:46 | take this could happen automatically two classes |
---|
0:36:50 | well i mean you have these so you can |
---|
0:36:52 | here are need the correct someday time |
---|
0:36:55 | then train |
---|
0:36:56 | statistical based approaches |
---|
0:36:58 | or if you can deploy stuff for example to see you what is the |
---|
0:37:05 | how well |
---|
0:37:08 | the topics |
---|
0:37:08 | and morph in time |
---|
0:37:14 | of the input |
---|
0:37:16 | not now i have |
---|
0:37:19 | two slides so how we transfer the call |
---|
0:37:22 | what is the time please |
---|
0:37:27 | all cases so it each |
---|
0:37:30 | okay so it's we just quickly |
---|
0:37:33 | this is how we |
---|
0:37:37 | to transfer the call so it to use this at the i think in a |
---|
0:37:41 | two thousand seven wanna be decided to write our speech for the of score |
---|
0:37:46 | the well the reasonable so that you wanted to have something very |
---|
0:37:51 | stable very fast that |
---|
0:37:53 | and the before the proper interfaces |
---|
0:37:58 | the speech for it has morgan two thousand five hundred topic objects coding or the |
---|
0:38:04 | hour of speech processing go |
---|
0:38:06 | it is a more then |
---|
0:38:10 | minimal first lines of source code and it ceases steely |
---|
0:38:14 | still use it might enable |
---|
0:38:18 | how |
---|
0:38:21 | be approach to refer to the recession |
---|
0:38:24 | we it the research is usually done using standard tools like that |
---|
0:38:28 | S T K in a car be by transcripts |
---|
0:38:32 | i think it it's all through the nose these two kids that this is |
---|
0:38:38 | for of |
---|
0:38:41 | hmms reconnect the but this is for neural network training good colour be it's made |
---|
0:38:47 | in the by then pour we |
---|
0:38:49 | and so and so on |
---|
0:38:52 | but the that diana we can the to use our code base |
---|
0:38:58 | and we can implementing new system and a two hour speechcorder quickly in a |
---|
0:39:05 | just two days the |
---|
0:39:06 | well final nor seen a single line of C plus court is written |
---|
0:39:12 | everything is don |
---|
0:39:14 | flew configuration file this could do this configuration file |
---|
0:39:18 | can |
---|
0:39:19 | look like this |
---|
0:39:20 | you have some objectivity this object so |
---|
0:39:23 | well this description is the map |
---|
0:39:26 | two |
---|
0:39:29 | C plus interface the user to set functions |
---|
0:39:33 | and then i we have some framework out to connect to be subjects to better |
---|
0:39:39 | so some |
---|
0:39:40 | you of fun we have four or the artemis entity |
---|
0:39:43 | but if you need a algorithm we just goal and to buy one simple chip |
---|
0:39:49 | for simple objects |
---|
0:39:52 | a about interface is |
---|
0:39:54 | what |
---|
0:39:55 | the customers are used to |
---|
0:39:58 | a locally specific interfaces |
---|
0:40:02 | us |
---|
0:40:03 | so i don't want to change data bits is so we |
---|
0:40:07 | men then the double the |
---|
0:40:09 | large |
---|
0:40:10 | set of interface this C plus channel aussie sharper and marcy be protocol uses for |
---|
0:40:18 | ivr so that is nice open source project the |
---|
0:40:21 | press instead of face to build the our on how based so set B C's |
---|
0:40:26 | and so on |
---|
0:40:30 | the this is common a framework |
---|
0:40:32 | for |
---|
0:40:33 | but based |
---|
0:40:35 | over a solution |
---|
0:40:37 | we were speech set of our application server the base ever and some clients |
---|
0:40:45 | this is just example of |
---|
0:40:48 | our testing client |
---|
0:40:51 | okay so some not now i will just summarised |
---|
0:40:55 | three slides about the |
---|
0:40:57 | some ongoing challenge is that the I C now |
---|
0:41:02 | partner very important challenge these |
---|
0:41:05 | data |
---|
0:41:07 | a training data is the smog a small company it is a difficult for us |
---|
0:41:12 | to get the data it is expensive |
---|
0:41:17 | and the this out that the a common approach of allows us to at just |
---|
0:41:23 | two and we just by here |
---|
0:41:25 | so we are working the |
---|
0:41:28 | for cheaper mesa how to do these |
---|
0:41:32 | i think it'd great inspiration is google |
---|
0:41:36 | so but not a we did something similar in you know language identification |
---|
0:41:42 | in which i didn't if we are able to bit the data the that the |
---|
0:41:46 | we can use of for training go for organisational |
---|
0:41:52 | a speech recognition systems that can be deployed on balls like don't the you know |
---|
0:41:57 | quantization of speech for telephone speech and the |
---|
0:42:00 | for broadcast |
---|
0:42:03 | so of what one possibility that we export the was to use broadcast |
---|
0:42:10 | for this |
---|
0:42:11 | but not so the whole content but the |
---|
0:42:15 | automatically to take the |
---|
0:42:17 | phone calls in the broadcast this ensures |
---|
0:42:20 | a high variability of speaker does the dialects and the was speaking styles |
---|
0:42:28 | language can be a very fight the using the when automatic language identification will so |
---|
0:42:33 | we need to some |
---|
0:42:34 | a small amount of data to bootstrap the this approach but then it is possible |
---|
0:42:41 | this is speaker identification of the speakers of the variability conventional by current |
---|
0:42:47 | speaker id technology |
---|
0:42:50 | the you would like to |
---|
0:42:52 | transcribe the it is some of the speech we think crowd sourcing |
---|
0:42:57 | and that use have really unsupervised the training for adaptation |
---|
0:43:03 | of |
---|
0:43:05 | currently we discuss the D so we've |
---|
0:43:10 | several company sent to would like to form a conception for this |
---|
0:43:14 | you have some expending admins experience to when we did this |
---|
0:43:19 | you know project the for language identification ldc anthony's the |
---|
0:43:24 | of it turns out the E to be very successful and the melody so like |
---|
0:43:29 | mainstream language identification |
---|
0:43:31 | we have one line |
---|
0:43:32 | up or go for adaptation |
---|
0:43:36 | be backed by our |
---|
0:43:38 | after companies the spinoff from but and that we believe that we could put to |
---|
0:43:44 | reduce the cost for the opened of new recognizer |
---|
0:43:47 | to variable and models |
---|
0:43:49 | so if you are interested in the you would like to one and more just |
---|
0:43:53 | sent me email and to |
---|
0:43:55 | we can discuss this |
---|
0:43:58 | the then other trying to we see is the that the |
---|
0:44:03 | we have quite the roots the technology about the still the deep one man the |
---|
0:44:09 | is hardly bring some of somebody six of each customer list |
---|
0:44:13 | but if the specified |
---|
0:44:16 | the each if we have departments many cantonese |
---|
0:44:19 | we never know what to be the final |
---|
0:44:23 | accuracy of up |
---|
0:44:25 | you're technology and of unevenly to do adaptation again some |
---|
0:44:30 | project so that i mention on the previous slide so we have |
---|
0:44:35 | to word this |
---|
0:44:37 | usually if you speak of the technologies that |
---|
0:44:40 | we claim the |
---|
0:44:43 | the customers that the technology |
---|
0:44:46 | is |
---|
0:44:48 | language-independent the |
---|
0:44:50 | channel independent the but always there is some for two for station |
---|
0:44:55 | the only possible way i see to reduce this risk |
---|
0:44:58 | is to built on to evaluate that these technologies |
---|
0:45:03 | on a many languages and to know that the results in advance before the technology |
---|
0:45:08 | so |
---|
0:45:10 | so for this again the data collection project and can have |
---|
0:45:15 | and to you are thinking about you want |
---|
0:45:17 | to extends some approaches to something like to work through much of spoken languages |
---|
0:45:24 | because for language identification we have collection of about fifty languages and |
---|
0:45:29 | good all the rapidly |
---|
0:45:32 | and the |
---|
0:45:33 | finally remark |
---|
0:45:36 | what we see that is that the percentage is |
---|
0:45:41 | full cost mainly on accuracy or more most of the research articles |
---|
0:45:46 | are describing some improvement in accuracy but if we speak about commercial market the |
---|
0:45:53 | i think that any improvement and the you know speech the |
---|
0:45:57 | or will |
---|
0:45:59 | something that cannot |
---|
0:46:00 | and i do use |
---|
0:46:02 | the cost of hardware |
---|
0:46:04 | can help and can |
---|
0:46:07 | help you to |
---|
0:46:10 | have successful technology |
---|
0:46:13 | so we saw you in some approaches the hardware cost is a really large can |
---|
0:46:19 | be |
---|
0:46:20 | fifty percent of the cost of the project and so on |
---|
0:46:24 | so this is everything from me and thank you for attention if you have any |
---|
0:46:30 | questions please ask |
---|
0:46:39 | any questions |
---|
0:46:46 | so we how did you do that didn't have to go to cepstral but better |
---|
0:46:51 | or something like that |
---|
0:46:53 | we are considering this approach should to but |
---|
0:46:57 | you know it at the beginning it's harder to get the |
---|
0:47:01 | money from adventurers |
---|
0:47:02 | the so this started the in the trade that event to customers and the ask |
---|
0:47:08 | or negotiate at some contacts |
---|
0:47:12 | and the we just started to be for contract so basically the |
---|
0:47:17 | custom development and the |
---|
0:47:19 | B and some money on this custom development and then we compute developing technology |
---|
0:47:24 | and we start something good technology and then |
---|
0:47:28 | even to product and stuff |
---|
0:47:32 | i have a question your |
---|
0:47:35 | your solutions are on site or is it based on cloud services |
---|
0:47:43 | most of the solution so |
---|
0:47:46 | i don't say it actually bosses possible be because |
---|
0:47:50 | that we can use of the technology one site but we have the base the |
---|
0:47:55 | or interfaces for example that is the best interface that can be used for |
---|
0:48:00 | called department |
---|
0:48:03 | but so you have a lot of cloud deployments or not please models and not |
---|
0:48:07 | know most of our current improvements are |
---|
0:48:10 | of |
---|
0:48:11 | a local click the like of one side the departments |
---|
0:48:15 | but we have |
---|
0:48:19 | the spinoff at the battalion is it of technology this is to play well |
---|
0:48:22 | that is for example the recording go lecture here |
---|
0:48:25 | this is already got based this is gonna but |
---|
0:48:30 | i don't at the of lectures |
---|
0:48:35 | questions |
---|
0:48:44 | so you started off connected with |
---|
0:48:47 | with university do you still do you have now it's to say that you projects |
---|
0:48:51 | that at the next cnns are an issue with that in terms of their |
---|
0:48:56 | we only with the company a with the government |
---|
0:49:00 | we are doing this see in different races we didn't have students |
---|
0:49:05 | it's for next cer we some or twelve some people at the but |
---|
0:49:11 | some contacts we have joint project |
---|
0:49:14 | the sort out differently so |
---|
0:49:21 | alright |
---|
0:49:22 | that's one thank you thank you |
---|