0:00:15okay so
0:00:16hi everyone
0:00:18and one's a and i would like to
0:00:19talk about the new data sets that you can't in a
0:00:25that i nine we have
0:00:27created at three at what university and it's a dataset design for
0:00:34and two and the natural language generation
0:00:36with that we mean that a generating a fully from data and from unaligned
0:00:45data pairs so that means a pair of the meaning representation and the corresponding textual
0:00:49reference
0:00:51with no water additional annotation
0:00:54this has already been down but so far all the approach is where limited to
0:01:00relatively small datasets and all of them use of delexicalization
0:01:06induce are the datasets you can see on the slide
0:01:09and our goal here is to go a bit for the with the data driven
0:01:14approach and to replicate the
0:01:17rich dialogue can discourse phenomena
0:01:20that had been targeted but you know the year and non end-to-end the rule based
0:01:24or also statistical approaches
0:01:28and what
0:01:28we have down is
0:01:31we have collected a new training dataset that should be challenging enough to
0:01:37show
0:01:38some
0:01:40more interesting outputs more interesting sentences
0:01:43and
0:01:44it is also much bigger than all the previous datasets we have over fifty thousand
0:01:49pairs of meaning representations and textual references
0:01:55the textual references a longer so we usually
0:01:58have more sentences that's
0:02:00describe
0:02:01one meaning representation and the sentences themselves are all also longer than in previous datasets
0:02:07we
0:02:08have also made the effort to collect the data set in as divers way as
0:02:13possible
0:02:14and that's why we used editorial
0:02:18instructions to crowd workers on a
0:02:21crowdsourcing website
0:02:23and
0:02:24we have found out that this leads to more divers descriptions so
0:02:29if you if you look at these two examples
0:02:33you we have a low cost
0:02:35japanese-style cuisine and
0:02:36you we have cheap japanese food so the
0:02:39descriptions are very diapers and
0:02:42also there's more of them on average than in most previous nlg datasets we have
0:02:48more than eight
0:02:50our preference texts better meaning representation
0:02:55we have evaluated the dataset in various ways and compared it with the previous datasets
0:03:02in the same domain
0:03:03and we have found out that
0:03:05we have
0:03:08higher lexical richness which means
0:03:11more
0:03:12divers text and terms of words used and a higher proportion of rare words in
0:03:19the data
0:03:20the sentences are also
0:03:23on average more syntactically complex so we have
0:03:29longer and more complex sentences
0:03:32and we have also up
0:03:34us
0:03:35kind of a semantic challenge because we asked the crowd workers only to verbalise information
0:03:40that seems relevant given the instructional picture so actually this requires content selection also for
0:03:49natural language generation which hasn't notes it's not present in the previous
0:03:55state of sets of the same type
0:03:58and we are organising a shell challenge with this dataset so
0:04:03you can
0:04:05all register for the challenge we would like to encourage you to do so
0:04:09and try to train your own nlg system and
0:04:15sub made your results
0:04:16by the end of local october
0:04:18we provide the data and also a baseline system along with the baseline system outputs
0:04:23and metrics creates
0:04:25is that
0:04:26will be used for the challenge along with us some human evolution
0:04:32so
0:04:33is it and i woods
0:04:35like to invite you to comment c or a poster later on and we can
0:04:40talk about this some more
0:04:42and definitely
0:04:44and downloads the data and take part in your challenge
0:04:48thank you