0:00:18 | alright welcome to the second session on acoustics we well |
---|
0:00:24 | follow this immediately with the sponsors session and then the |
---|
0:00:28 | back with dinner or per speaker |
---|
0:00:30 | is all like a came out |
---|
0:00:35 | thank you |
---|
0:00:49 | okay it's not all okay |
---|
0:00:51 | okay sorry |
---|
0:00:53 | hello vehicle it's a welcome to my talk my name is a ticket out |
---|
0:00:57 | and that you might be |
---|
0:01:00 | is not better or |
---|
0:01:05 | sound check |
---|
0:01:06 | okay that's good |
---|
0:01:08 | things |
---|
0:01:09 | how well come welcome to my talk so |
---|
0:01:14 | today i'd like to present decided that's i conducted together with my colleagues |
---|
0:01:19 | in was eager to lexical profound problem thinker the store like to thank them or |
---|
0:01:23 | without them it would be impossible to conduct this research on this you attention |
---|
0:01:27 | and so the use your problem as you probably can guess so this topic is |
---|
0:01:33 | related is with the big problem introduced by then both those |
---|
0:01:40 | at the beginning of our conference today |
---|
0:01:43 | so it's also about stated |
---|
0:01:46 | interaction and multi party interaction |
---|
0:01:49 | so |
---|
0:01:51 | a the title is cross corpus that accommodation for acoustic addressee detection |
---|
0:01:56 | first of all i'd like to |
---|
0:01:58 | clarify what was use action actually is |
---|
0:02:01 | so it's a common trend that modern spoken dialogue systems i getting |
---|
0:02:07 | more adaptive and human like |
---|
0:02:09 | not you know the two |
---|
0:02:12 | interact with multiple users under realistic conditions in the real physical world |
---|
0:02:18 | and's |
---|
0:02:21 | sorry |
---|
0:02:25 | so |
---|
0:02:26 | it may happen that's |
---|
0:02:29 | not a single user of interest the system but a group of users and this |
---|
0:02:33 | is exactly the place where the suit action |
---|
0:02:35 | where |
---|
0:02:36 | this young but the rises it appears in conversations between |
---|
0:02:43 | technical system and the group of users |
---|
0:02:45 | and it's |
---|
0:02:46 | we're gonna call this kind of |
---|
0:02:49 | interactions as human machine |
---|
0:02:50 | conversations and here we have |
---|
0:02:53 | realistic example from our data |
---|
0:02:56 | so |
---|
0:02:58 | the as the s |
---|
0:02:59 | so base in such a mixed kind of instructions as this is supposed |
---|
0:03:03 | to distinguish between human and compute a direct utterances |
---|
0:03:07 | that means solving a binary |
---|
0:03:09 | classification problem in order to maintain a efficient conversations in a realistic manner |
---|
0:03:15 | it's important that |
---|
0:03:18 | human direct utterances so the system is not supposed to give a direct answer to |
---|
0:03:22 | human direct utterances |
---|
0:03:25 | because otherwise it would so interrupt a dialogue flow between to human participants |
---|
0:03:34 | well |
---|
0:03:35 | a similar problem arises in can with in conversations between several adults and a child |
---|
0:03:41 | and similarly to |
---|
0:03:43 | function of this you'd actually caller's problem as i don't channel to be sued action |
---|
0:03:47 | and here we have again |
---|
0:03:49 | a realistic example how |
---|
0:03:52 | not to educate your children but smart phones |
---|
0:03:59 | yes and again in this case the is this is supposed to distinguish between adult |
---|
0:04:04 | and child directed utterances produced by adults |
---|
0:04:07 | and this also means |
---|
0:04:10 | binary classification problem |
---|
0:04:12 | and it's functionality may be useful for a system before mean |
---|
0:04:17 | children developments mandatory in |
---|
0:04:21 | mainly the let's assume that the list distinguishable are children and a directed acoustic patterns |
---|
0:04:27 | the bigger progress so that shouldn't make in maintaining social interactions and |
---|
0:04:33 | in particular in maintaining |
---|
0:04:36 | spoken conversations |
---|
0:04:39 | so |
---|
0:04:41 | now |
---|
0:04:43 | let's find out if |
---|
0:04:45 | these two rejection problems have anything in common |
---|
0:04:51 | first of all we need to answer the question how we address other people in |
---|
0:04:55 | real life |
---|
0:04:56 | the simplest way to do this is just |
---|
0:04:59 | by name so or what we will okay cable or okay alex a or |
---|
0:05:04 | i like this |
---|
0:05:06 | then |
---|
0:05:08 | we can do the same think implicitly by using for example das |
---|
0:05:12 | i'm looking at him talking to you |
---|
0:05:15 | then some contextual markers like a specific topics or |
---|
0:05:19 | specialist a convenience |
---|
0:05:21 | and |
---|
0:05:23 | the |
---|
0:05:24 | the last utterance if is to |
---|
0:05:26 | modified acoustic speaking style and our prosody |
---|
0:05:29 | and the present study is focused |
---|
0:05:32 | exactly on the |
---|
0:05:35 | last way |
---|
0:05:36 | on the on the on the letter way of |
---|
0:05:38 | addressing |
---|
0:05:40 | subjects in our conversation |
---|
0:05:44 | so the |
---|
0:05:46 | the idea behind acoustic addressee detection is that people tend to change the remainder of |
---|
0:05:51 | speech depending on whom the talking to |
---|
0:05:53 | for example we may face some special to see such as hard of hearing people |
---|
0:05:58 | actually people |
---|
0:06:00 | children or spoken dialogue systems |
---|
0:06:03 | that's in our opinion might have some communication difficulties |
---|
0:06:07 | and talk into such it receives we intentionally |
---|
0:06:12 | we intentionally modify all in the moment of a speech make you need a more |
---|
0:06:16 | technical loud and generate the more understandable a since we do not |
---|
0:06:20 | pc then as adequate conversational agents |
---|
0:06:23 | and then main assumption that we make here is that's human the reckon speech |
---|
0:06:31 | is supposed to be |
---|
0:06:32 | similar to adult directed speech |
---|
0:06:36 | well |
---|
0:06:43 | and is |
---|
0:06:45 | in the same way you much indirect speech is for so must be quite similar |
---|
0:06:48 | to child directed speech |
---|
0:06:54 | in our experiments we use |
---|
0:06:56 | relatively simple and yet efficient approach data augmentation called makes a mix up encourages a |
---|
0:07:02 | model to behave mean eerie into that space between seen data points and i it |
---|
0:07:08 | already has quite many applications in |
---|
0:07:11 | isr in |
---|
0:07:13 | image recognition and |
---|
0:07:14 | many other |
---|
0:07:16 | popular fields |
---|
0:07:18 | basically makes it generates a typical examples |
---|
0:07:21 | as thing and combinations |
---|
0:07:24 | of to random a real feature and label vectors take into the coefficients number |
---|
0:07:31 | and it's this number is a real number randomly generated from a but it stiff |
---|
0:07:36 | from but from a beta distribution |
---|
0:07:37 | a specified as follows by the only parameter alpha so technically life i thought lays |
---|
0:07:44 | within the interval from zero to infinity |
---|
0:07:47 | but according to our experiments |
---|
0:07:50 | so i four values higher than one |
---|
0:07:54 | leads already two |
---|
0:07:55 | and defeating |
---|
0:07:58 | and it's in our opinion the most reasonable inter well to ri |
---|
0:08:02 | this parameter is from zero to one |
---|
0:08:07 | so |
---|
0:08:07 | that's question is how many examples to generate and here |
---|
0:08:12 | that's imagine that we just merge the |
---|
0:08:15 | c |
---|
0:08:16 | different datasets without applying any bit argumentation just put them together |
---|
0:08:21 | so we generate one batch |
---|
0:08:24 | from each dataset |
---|
0:08:25 | and it means that we they can increase the initial model training data in the |
---|
0:08:30 | target corpus in c times |
---|
0:08:33 | but if you something sleep line except |
---|
0:08:35 | so we generate |
---|
0:08:37 | along this |
---|
0:08:38 | but this seebosh's we generate a also |
---|
0:08:43 | "'kay" |
---|
0:08:45 | examples key |
---|
0:08:46 | i'd |
---|
0:08:49 | "'kay" artificial examples of from each real example |
---|
0:08:52 | increasing the amount of training data in a |
---|
0:08:55 | see you multiply a k plus one times |
---|
0:08:59 | and it's important to note that if it but at the visual examples are generated |
---|
0:09:02 | or |
---|
0:09:03 | but relies on the fly without any significant delays in the training process so we |
---|
0:09:07 | just |
---|
0:09:07 | do it on the go |
---|
0:09:11 | well you can see |
---|
0:09:14 | the models that we used to |
---|
0:09:17 | two |
---|
0:09:19 | it uses all the views to solve our problem |
---|
0:09:23 | and the they are arranged according to their complexity a little from |
---|
0:09:26 | left to right |
---|
0:09:29 | well the first model is a simple |
---|
0:09:32 | we are as we am |
---|
0:09:34 | using the compare functionals as the input so this is a pretty popular feature set |
---|
0:09:40 | in the area for motion recognition was introduced at the interspeech to solve and thirteen |
---|
0:09:46 | i guess |
---|
0:09:47 | yes so these features are extracted from the whole utterance |
---|
0:09:52 | next we apply |
---|
0:09:55 | the l d model |
---|
0:09:57 | that includes a recurrent neural network with long short-term memory |
---|
0:10:02 | and so |
---|
0:10:03 | repeat a bit of these which were also used to compute the |
---|
0:10:08 | the compare function also for the for the first model |
---|
0:10:12 | and in contrast to |
---|
0:10:14 | the functionals the l d's have |
---|
0:10:17 | a time continuous nature |
---|
0:10:20 | so it's time continuous signal |
---|
0:10:22 | and in the last more lost all model is and consistently for mean raw signal |
---|
0:10:28 | processing so |
---|
0:10:30 | it receives just the |
---|
0:10:33 | raw audio utterance that buses statistical of convolutional input then there's and suffer the same |
---|
0:10:39 | convolutional component the lunchroom with looks for |
---|
0:10:41 | we launch with the memory |
---|
0:10:43 | that was introduced the within the previous model |
---|
0:10:47 | yes and to be |
---|
0:10:49 | it should be as the as the reference point for the convolutional component be of |
---|
0:10:53 | taking |
---|
0:10:53 | the five-layer sounded like addiction slightly modified it for needs mainly be reused |
---|
0:10:58 | it's dimensionality |
---|
0:11:00 | so by reducing the number or of for use in each layer according to the |
---|
0:11:06 | amount of data that we have at our disposal and we also reduced the kernel |
---|
0:11:11 | sizes in this paper according to the dimensionality of the signal that we have |
---|
0:11:20 | well |
---|
0:11:21 | here you can see the data that we have at our disposal we |
---|
0:11:24 | we have two datasets for modeling |
---|
0:11:27 | emotional issue detection namely smart video corpus that's contains interactions between the user to consider |
---|
0:11:34 | it and the mobile is this |
---|
0:11:35 | and by the way this is the only corpus that's |
---|
0:11:38 | that was |
---|
0:11:40 | models like |
---|
0:11:42 | played by wizard-of-oz setting |
---|
0:11:46 | the next |
---|
0:11:47 | corpus |
---|
0:11:48 | is was this was this is a conversation corpus that contains |
---|
0:11:51 | similarly to this we see that contains |
---|
0:11:54 | interaction between the user a confederate and then almost an alex acero dot is data |
---|
0:11:58 | is real |
---|
0:12:00 | without any was of the for stimulation |
---|
0:12:03 | and |
---|
0:12:04 | the third corpus is home bank that's includes conversations between a and adults another adult |
---|
0:12:10 | and the child |
---|
0:12:12 | we tried to repeat use the same as pleadings into training development and test sets |
---|
0:12:18 | that's |
---|
0:12:20 | the introduced in the |
---|
0:12:21 | i regional studies published but also the corpora |
---|
0:12:25 | and they turned out to be approximately the same well in the proposal so |
---|
0:12:32 | train development and test has a purple the proportion of four five by one by |
---|
0:12:36 | four |
---|
0:12:40 | first we conduct some preliminary analysis with a linear model the font model we perform |
---|
0:12:47 | feature selection by means of recursively recursive feature elimination |
---|
0:12:51 | we just the exclude a small portion of all |
---|
0:12:54 | compare features with the lowest svm weights |
---|
0:12:57 | and that we measure the performance |
---|
0:12:59 | all the |
---|
0:13:01 | you reduced feature set in terms of unweighted average recall |
---|
0:13:04 | and if it just let us consider the is considered to be optimal |
---|
0:13:07 | e for them |
---|
0:13:08 | them dimensionality-reduction leads to a significant information loss as |
---|
0:13:13 | and it's here in this in this figure we see that's the |
---|
0:13:18 | the optimal feature sets a |
---|
0:13:20 | right significantly |
---|
0:13:22 | and it's also very interesting that's the size of the optimal feature set on this |
---|
0:13:26 | p c is much greater than then the other two so it may be explained |
---|
0:13:30 | by them |
---|
0:13:31 | a wizard-of-oz model in probably |
---|
0:13:34 | some of the participants |
---|
0:13:35 | did it's really believe that they were interacting with the real technical system |
---|
0:13:39 | and the this issue resulted in |
---|
0:13:43 | mm slightly a acoustic the basic buttons |
---|
0:13:47 | well another |
---|
0:13:50 | sequence of experiments at we conduct is a is inverse local and look experiments the |
---|
0:13:54 | local means leave one corpus out a everyone knows what it means and inverse local |
---|
0:14:00 | am is just that we retrain a our model on one corpus and test on |
---|
0:14:06 | each of the other corpora separately |
---|
0:14:08 | so and in this figure there is a pretty clear relation between b a c |
---|
0:14:12 | and |
---|
0:14:13 | as we see |
---|
0:14:14 | so or it's pretty natural that's |
---|
0:14:19 | these corpora |
---|
0:14:21 | perceived as similar by our system because |
---|
0:14:24 | the domains pretty close and the they both your utterance in german |
---|
0:14:28 | in contrast to home bank that was uttered english and as we can see from |
---|
0:14:32 | this figure |
---|
0:14:33 | so our |
---|
0:14:34 | linear model |
---|
0:14:37 | fails to find any direct relation between |
---|
0:14:41 | this corpus and the other two |
---|
0:14:43 | but let's take a look at the |
---|
0:14:45 | at the at the next year |
---|
0:14:47 | and here we notice a very interesting trend that's |
---|
0:14:52 | even bill |
---|
0:14:52 | hum bank |
---|
0:14:55 | significantly differs from data to from data two corpora i think the linear model trained |
---|
0:15:00 | on |
---|
0:15:02 | on every on sorry one and u two corpora |
---|
0:15:05 | a reforms on each of them equally well is if it's not range |
---|
0:15:10 | on each of the corpus separately and tested on them separately |
---|
0:15:14 | so it means that's |
---|
0:15:16 | the data sets that we have a non coded |
---|
0:15:18 | at least not contradictory |
---|
0:15:22 | so well let's take a look at all experiments but |
---|
0:15:27 | the l d model and various can on various contexts lands a prime example |
---|
0:15:33 | and here |
---|
0:15:34 | in each of the three cases |
---|
0:15:36 | red green and blue we see that the |
---|
0:15:39 | dashed line is located about the |
---|
0:15:42 | the solid one |
---|
0:15:43 | mean and that's a mix up results in this additional performance improvement no really |
---|
0:15:50 | when the ready |
---|
0:15:51 | when already applied to the same corpus |
---|
0:15:53 | and |
---|
0:15:54 | it's also interesting to note that |
---|
0:15:58 | so the context and for two seconds |
---|
0:16:01 | turns out to be optimal for each of the for each of the corpus given |
---|
0:16:05 | a given that they have |
---|
0:16:07 | very different utterance then distributions |
---|
0:16:10 | so two seconds is sufficient to predict accuracies using acoustic commonality |
---|
0:16:16 | well |
---|
0:16:16 | unfortunately makes up gives no performance improvement to the end-to-end model or probably we just |
---|
0:16:21 | don't have enough data to provide |
---|
0:16:28 | so we really produce the same experiments with |
---|
0:16:32 | local and inverse local on be neural network based models |
---|
0:16:35 | and so the |
---|
0:16:37 | they both show the same trends the |
---|
0:16:39 | that's |
---|
0:16:40 | s b c n b a c seem quite similar to them |
---|
0:16:44 | and actually the end-to-end model managed to capture |
---|
0:16:47 | this similarity even better compared to the l d one |
---|
0:16:51 | but there is an issue with model with multi with multitask learning |
---|
0:16:55 | particularly |
---|
0:16:56 | the issue is that |
---|
0:16:58 | our neural network |
---|
0:17:00 | regardless of which one us start with reading to |
---|
0:17:05 | so the sig to the easiest task |
---|
0:17:06 | with the highest commission features and labels and he they can see that the model |
---|
0:17:11 | trained on any two dataset |
---|
0:17:14 | starts |
---|
0:17:15 | like |
---|
0:17:15 | so the model |
---|
0:17:17 | completely ignores the home bank |
---|
0:17:19 | even though it was trained on this corpus |
---|
0:17:22 | and it also star discriminating |
---|
0:17:25 | i guess with you dataset colour vegetation changes if we started by me so |
---|
0:17:30 | so all over the corpora |
---|
0:17:33 | and the model actually starts receiving |
---|
0:17:36 | both corpora really efficient |
---|
0:17:38 | efficiently |
---|
0:17:39 | as if you go |
---|
0:17:41 | trains a on each of the corpus separately and tested on each of the corpus |
---|
0:17:45 | separately |
---|
0:17:47 | again we really but we conduct |
---|
0:17:49 | this index but we conducted a similar experiment it just merging all three |
---|
0:17:54 | datasets with and without makes up |
---|
0:17:57 | using all three models |
---|
0:17:58 | and so here we can see that makes up a low rises both settle these |
---|
0:18:02 | l d and models and also prevents overfitting |
---|
0:18:05 | the specific corpus mainly dstc with the highest correlation with the features and labels as |
---|
0:18:09 | i is the set so these this task for our system |
---|
0:18:13 | but unfortunately makes up doesn't provide an improvement for the funk model |
---|
0:18:18 | what |
---|
0:18:19 | actually goal |
---|
0:18:20 | this model |
---|
0:18:21 | doesn't suffer from overfitting the specific task and |
---|
0:18:24 | doesn't need to be regularized |
---|
0:18:25 | you do it's very simple structure |
---|
0:18:27 | did it is very simple architecture |
---|
0:18:30 | well the last the last the series of experiments |
---|
0:18:33 | is experiments with i some of the features |
---|
0:18:37 | the idea behind them is that so |
---|
0:18:39 | system directed utterances tandem age |
---|
0:18:44 | the isr |
---|
0:18:45 | acoustic and language models much better compared to |
---|
0:18:48 | human addressed utterances |
---|
0:18:51 | and it's |
---|
0:18:52 | this definitely works in the human machine setting |
---|
0:18:56 | but |
---|
0:18:57 | it seems to be |
---|
0:18:58 | not working |
---|
0:18:59 | in the i don't channels i think and we just analyse the |
---|
0:19:03 | the data itself so |
---|
0:19:06 | deep inside and the noted that |
---|
0:19:09 | sometimes addressing children |
---|
0:19:12 | no |
---|
0:19:13 | sanderson children so people don't even use words instead they just use some separate intonations |
---|
0:19:20 | or sounds or so without any words and |
---|
0:19:23 | this causes real problems to our asr meaning that's so |
---|
0:19:27 | the are the |
---|
0:19:30 | the asr confidence will be equal over both of the target process |
---|
0:19:34 | so |
---|
0:19:35 | this is the reason why it performs so where |
---|
0:19:38 | at this humbling problem |
---|
0:19:41 | so here we come to the conclusions and we can conclude that makes up improves |
---|
0:19:45 | classification performance for models then this |
---|
0:19:49 | predefined features and also |
---|
0:19:52 | this is less like |
---|
0:19:53 | and also enables multitask learning abilities |
---|
0:19:57 | for both and joint models and models that it was conducted feature sets |
---|
0:20:03 | just like and speech fragments |
---|
0:20:07 | allows us to |
---|
0:20:08 | capture |
---|
0:20:11 | accuracies but the |
---|
0:20:12 | sufficient quality |
---|
0:20:13 | and actually the same conclusion was drawn by the group of |
---|
0:20:17 | matters of researchers regarding english language |
---|
0:20:21 | yes and |
---|
0:20:22 | as a told |
---|
0:20:24 | a couple beers before i saw confidence is not representative for a c d low |
---|
0:20:28 | it still useful for each met and three so you all experiments we also a |
---|
0:20:34 | bit a couple of baseline so we introduce the first official baseline for be a |
---|
0:20:38 | sissy corpus and the ability to the on back into and baseline |
---|
0:20:43 | for future directions i woods propose extending our experiments applying mix up to two dimensional |
---|
0:20:50 | spectrograms and two features extracted with their without the convolutional component |
---|
0:20:54 | thank you |
---|
0:21:01 | we have time for some questions |
---|
0:21:04 | hi a credit when you in c |
---|
0:21:08 | yes i |
---|
0:21:11 | i was wondering why it shows you a tree i don't child interaction between a |
---|
0:21:17 | human machine interaction is there any literature likely to this decision or was it just |
---|
0:21:23 | sort of this additional you know it was a but our assumption without any background |
---|
0:21:28 | i mean it was like an interesting |
---|
0:21:30 | assumption in interesting something to do not to prove it of the proved run |
---|
0:21:35 | yes and so |
---|
0:21:36 | conceptually |
---|
0:21:38 | it should be like this that's not so sometimes we receive a system as an |
---|
0:21:44 | infant or person have been lack of communication all scales |
---|
0:21:48 | of and's that's what we take in as the basic assumption for |
---|
0:21:55 | forums actually simulate conceptually there's do not sitting |
---|
0:21:59 | conceptually distinct okay this is on one so i put into our experiments a single |
---|
0:22:05 | i think |
---|
0:22:06 | yes that's actually they are probably overlap but only partially |
---|
0:22:12 | what's couldn't our experiments a single system is capable or float in both |
---|
0:22:16 | that simultaneously |
---|
0:22:17 | i perform far worse on the adult channel corpus |
---|
0:22:22 | yes but because the baseline performance is far worse |
---|
0:22:25 | i mean the highest baseline on one h b is like |
---|
0:22:29 | it is zero point sixty four |
---|
0:22:32 | all zero point six to six or something this |
---|
0:22:34 | okay |
---|
0:22:36 | so it just the matter of the data quality |
---|
0:22:48 | high and just from a reporter numerous the interesting talk i was wondering |
---|
0:22:56 | maybe i missed something did you see any language features it so no do you |
---|
0:23:01 | not all can speculate so it is gonna be an impact on the performance of |
---|
0:23:06 | what it means same as which we just i mean like a separate words or |
---|
0:23:09 | for instance if i'm talking to a channel i might address to change in a |
---|
0:23:14 | different way to address signals |
---|
0:23:17 | okay well it's a difficult question human that's i told that sometimes talking to the |
---|
0:23:23 | channel we don't use real words |
---|
0:23:25 | this is the problem for language modeling right i mean i was my hypothesis is |
---|
0:23:30 | that you would simplify the language to use if you're addressing a child their compared |
---|
0:23:35 | when you address and yes we do we do |
---|
0:23:40 | my speculation on this would be yes |
---|
0:23:43 | we can so we can we can try to leverage in both textual and acoustical |
---|
0:23:47 | modalities |
---|
0:23:48 | to solve the same problem yes okay next |
---|
0:23:52 | for one more |
---|
0:23:56 | that is common |
---|
0:24:00 | i just so have you checked |
---|
0:24:04 | how well you do with respect to the results of the competence |
---|
0:24:07 | so the same data set was used a similar data set was used as part |
---|
0:24:11 | of the interspeech compared challenge anything the guy obviously don't like i think it was |
---|
0:24:16 | seventy point something |
---|
0:24:17 | so this curious but the look at the majority baseline so i you predicting the |
---|
0:24:22 | majority class because essentially binary class prediction you do we |
---|
0:24:25 | and so one thing that you model is just only |
---|
0:24:28 | how to predict the majority class |
---|
0:24:31 | i mean i use a |
---|
0:24:33 | no |
---|
0:24:34 | i use unweighted average recall and if it if it would predict just |
---|
0:24:39 | just a majority class a so and so it means that actually the model we |
---|
0:24:44 | just |
---|
0:24:45 | a role |
---|
0:24:46 | all the examples to the ones you melissa |
---|
0:24:49 | it means that you're performance metric would be |
---|
0:24:54 | like |
---|
0:24:55 | not about than zero point a zero point five |
---|
0:24:59 | because it's like it's like a global metric |
---|
0:25:02 | sure but for instance even so if you look at the |
---|
0:25:06 | the baseline for the speech and that's about seventy point something |
---|
0:25:10 | so you so i we see you mean the baseline for combine corpus |
---|
0:25:16 | of using the end-to-end or |
---|
0:25:18 | similarly no i actually the end-to-end baseline was the word baseline |
---|
0:25:23 | so and sixty four so |
---|
0:25:26 | i remember the |
---|
0:25:29 | the article |
---|
0:25:30 | release the rights before the interest right before the submission for the challenge and the |
---|
0:25:36 | result there's of the baseline for the intent model was like |
---|
0:25:39 | is zero point fifty nine also |
---|
0:25:42 | at rate and the end-to-end if you if you mean this and |
---|
0:25:45 | if we talk about the entire multi model |
---|
0:25:49 | like thing so the baseline was like |
---|
0:25:54 | zero point seven also but they use the much a great the feature sets for |
---|
0:26:01 | this and several models like a collective of models |
---|
0:26:05 | include in michael for your words and two and so ill these and all that |
---|
0:26:10 | stuff |
---|
0:26:13 | okay let's thank our speaker again |
---|