okay so

hi everyone

and one's a and i would like to

talk about the new data sets that you can't in a

that i nine we have

created at three at what university and it's a dataset design for

and two and the natural language generation

with that we mean that a generating a fully from data and from unaligned

data pairs so that means a pair of the meaning representation and the corresponding textual

reference

with no water additional annotation

this has already been down but so far all the approach is where limited to

relatively small datasets and all of them use of delexicalization

induce are the datasets you can see on the slide

and our goal here is to go a bit for the with the data driven

approach and to replicate the

rich dialogue can discourse phenomena

that had been targeted but you know the year and non end-to-end the rule based

or also statistical approaches

and what

we have down is

we have collected a new training dataset that should be challenging enough to

show

some

more interesting outputs more interesting sentences

and

it is also much bigger than all the previous datasets we have over fifty thousand

pairs of meaning representations and textual references

the textual references a longer so we usually

have more sentences that's

describe

one meaning representation and the sentences themselves are all also longer than in previous datasets

we

have also made the effort to collect the data set in as divers way as

possible

and that's why we used editorial

instructions to crowd workers on a

crowdsourcing website

and

we have found out that this leads to more divers descriptions so

if you if you look at these two examples

you we have a low cost

japanese-style cuisine and

you we have cheap japanese food so the

descriptions are very diapers and

also there's more of them on average than in most previous nlg datasets we have

more than eight

our preference texts better meaning representation

we have evaluated the dataset in various ways and compared it with the previous datasets

in the same domain

and we have found out that

we have

higher lexical richness which means

more

divers text and terms of words used and a higher proportion of rare words in

the data

the sentences are also

on average more syntactically complex so we have

longer and more complex sentences

and we have also up

us

kind of a semantic challenge because we asked the crowd workers only to verbalise information

that seems relevant given the instructional picture so actually this requires content selection also for

natural language generation which hasn't notes it's not present in the previous

state of sets of the same type

and we are organising a shell challenge with this dataset so

you can

all register for the challenge we would like to encourage you to do so

and try to train your own nlg system and

sub made your results

by the end of local october

we provide the data and also a baseline system along with the baseline system outputs

and metrics creates

is that

will be used for the challenge along with us some human evolution

so

is it and i woods

like to invite you to comment c or a poster later on and we can

talk about this some more

and definitely

and downloads the data and take part in your challenge

thank you