okay so
hi everyone
and one's a and i would like to
talk about the new data sets that you can't in a
that i nine we have
created at three at what university and it's a dataset design for
and two and the natural language generation
with that we mean that a generating a fully from data and from unaligned
data pairs so that means a pair of the meaning representation and the corresponding textual
reference
with no water additional annotation
this has already been down but so far all the approach is where limited to
relatively small datasets and all of them use of delexicalization
induce are the datasets you can see on the slide
and our goal here is to go a bit for the with the data driven
approach and to replicate the
rich dialogue can discourse phenomena
that had been targeted but you know the year and non end-to-end the rule based
or also statistical approaches
and what
we have down is
we have collected a new training dataset that should be challenging enough to
show
some
more interesting outputs more interesting sentences
and
it is also much bigger than all the previous datasets we have over fifty thousand
pairs of meaning representations and textual references
the textual references a longer so we usually
have more sentences that's
describe
one meaning representation and the sentences themselves are all also longer than in previous datasets
we
have also made the effort to collect the data set in as divers way as
possible
and that's why we used editorial
instructions to crowd workers on a
crowdsourcing website
and
we have found out that this leads to more divers descriptions so
if you if you look at these two examples
you we have a low cost
japanese-style cuisine and
you we have cheap japanese food so the
descriptions are very diapers and
also there's more of them on average than in most previous nlg datasets we have
more than eight
our preference texts better meaning representation
we have evaluated the dataset in various ways and compared it with the previous datasets
in the same domain
and we have found out that
we have
higher lexical richness which means
more
divers text and terms of words used and a higher proportion of rare words in
the data
the sentences are also
on average more syntactically complex so we have
longer and more complex sentences
and we have also up
us
kind of a semantic challenge because we asked the crowd workers only to verbalise information
that seems relevant given the instructional picture so actually this requires content selection also for
natural language generation which hasn't notes it's not present in the previous
state of sets of the same type
and we are organising a shell challenge with this dataset so
you can
all register for the challenge we would like to encourage you to do so
and try to train your own nlg system and
sub made your results
by the end of local october
we provide the data and also a baseline system along with the baseline system outputs
and metrics creates
is that
will be used for the challenge along with us some human evolution
so
is it and i woods
like to invite you to comment c or a poster later on and we can
talk about this some more
and definitely
and downloads the data and take part in your challenge
thank you