Speech Transcript - Natural Language Generation: Creating Text

a good morning everyone that they have to change at least i hang up seeing

a speaker is to study and practically i is a senior researcher french national centre

for scientific research just a little area computer science a sixteen c fan

i think i c six test databases and have it seems like and statistical models

natural language she's are not an acquisition that can arise and lexical resources for nlp

syntactic and semantic parsing and i need technology for language learning

she is very natural language generation from syntactic let and has a large delay not

time which is a shared task and generating text from section let

said this morning style will be at and t planning-based critically models thinking i natural

language generation of are a variety of different types of a unique tasks so that's

baseline okay

can you email

can you hear the back

okay so good morning

thank you for being hereafter last night

so when i was invited so it is workshop i was i was for real

of course you know coming to the n i giving a talk

but then i was at the river a the also about the

thai tone of this workshop synthesis no

i think that the introduction say they show that i don't do synthesis i don't

do speech in fact i just for a context

but of course you know there is a link between

text-to-speech synthesis and generation which is what i've been working on for the recent years

which is that natural language generation is the task of producing text so you can

see natural language generation as a

three step two text-to-speech synthesis

and it is what i'm going to talk about today are going to talk about

different types of natural language generation tasks

and is and we start with

we know what's

okay

so i would start with a an introduction to you know how hard generation was

the on before deep learning or is and then i will show how

you know the deep learning

and paradigm completely change the approach to natural language generation

and i will talk about some issues about current your approach is to

text generation

so the vocal presence here is a joint work with the phd students and colleagues

which i want to name here so i met corner is a p g students

at the north sea where am based

you do that again from bob dylan university and it is well okay life on

is a piece is you don't to jointly student between fair in paris and all

are in our group goal of each that the dams that university okay push your

cough and the another liberal who were also puget students they're the joint supervision with

within a

and the finally and that so the su money not be the phd students with

mean of

okay so first what is natural language generation well it's the task of producing text

but it's very different from natural language understanding because of the input so in natural

language understanding the input is a text is well defined everybody agrees on that and

this you know large quantities of text around available

in natural language generation is very different in that the input can be candy many

different things

and this is actually one of the reason why natural language generation board was very

small see for a very long time so compared to nlu you know the number

of papers on energy was very small

and so what out the types of inputs there are basic

types of input data are meaning representation

well text

so data i would be that it data from data bases a knowledge bases

structured data then you have meaning representation that are devised by language that can be

produced by computational linguistic tools that basically a device to represent the meaning of a

sentence

sometimes of a text more generally of a sentence or dialogue turn

so sometimes you want to generate some this meaning representation for example

in the context of a dialogue system

you might want to the you know the system will the dialogue manager would produce

a meaning representation music which is called a dialogue turn

and then the generation task is to generate the texas

a turn so the system turn in response to

to this meeting representation

and finally you can generate from texas and that would be in our applications such

as text summarization text simplification sentence compression

so those are the main types of input another complicating factor is that the

what we call the communicative goal can be very different sometimes you want to verbalise

so for instance if you have a knowledge base you might want the system to

just verbalise the content of the knowledge base

so is

you know readable by human uses but off you know all the goals would be

like to respond to a dialogue turn

to summarize a text or to simplify text or even to summarize the content of

a knowledge base

so those two factors with the

means that

up renewal

natural language generation was divided into many different sub fields which didn't help

given that the community was already pretty small

and there wasn't much

you know communication between those subfields

so why did we have this difference amphibians because essentially the problem is very different

so in the when you are generating from data there is a big gap between

the input and the output six of the input is a data structure they died

does not like the text at all it can be even you know result from

signal processing can be numbers from with r

at call numbers whatever

the input data is very different from texts and so the to bridge this gap

you have to do many things and essentially what you have to do is to

decide what to say

and how to say eight

what to say is more in the eye problem so deciding you know what part

of the data are you want to select two

actually verbalise time because if you were you know if you verbalise all the numbers

in the you know given by a sensor it would just make no sense at

all of the

so without in text with make no sense at all

and so you have a content selection problem then usually have to structure the set

the content that you've selected into a at structures of the resemble the text

then so this would be more like ai planning problem so in this is actually

that was handles often with planning techniques

and this more linguistics once you have to text structure how do you convert it

into well-formed text

and there you have to make many choices so generation is really a choice problem

because there are many different ways of realizing of things

so you know have problems such as you know

well to choose to lexicalising that every symbol which referring expression to describe an entity

known hear its rays are you going to

use a pronoun

a proper name

aggregation how to about repetitions this is basically about choosing deciding when to using such

as ellipsis so-called in addition to avoid redundancy in the output x

even for there is redundancy in your or not

see but repetition signal knowledge base

things actually so

basically generating from data at the consensus was there was this a big and then

g pipeline where you had to mobile all of these

subproblems

if you generate a meaning representation the task was seen that's completely different

partly i mean a mainly because the gap between in your presentation and the sentences

is much smaller as in fact and in fact scissors meaning representation not be a

by linguists

so the

consensus here was that

if you can have a grammar that describes you can you know

no and the rubber grammars that describe basically the mapping between text and meaning

and c and because it's a grammar it also include this notion of syntax so

it ensures that that's the text will be well-formed

so the idea was you have a grammar that define this mapping between this association

between text and meaning

you could use it in both direction either you have it x and you use

the grammar to derive its meaning

but you can use it for generation you start some of the meaning

and then you user grammar was to decide you know what is the corresponding sentence

given by the grammar

and of course is grammar that soon as you have lots coverage they become very

ambiguous so there's a huge ambiguity problem

it's not tractable basically you get you know tiles on some intermediate results thousands of

outputs

and the initial search space is used so you combine usually you combines this grammar

with some statistical modules that will basically designed to reduce the search space

and to limit the output to one always you outputs

and finally generating from texas here again very different approach the main point them and

consensus again was that you generate from text they are basically foreman operations you want

to model i mean all of some of them depending on the application which is

pleats rewrite real during delete

thing is about learning went to split a long sentence into as the several sentences

for example the simplification where you want to simplify text

we ordering is just moving constituent around the words around

again because maybe you want to simplify the paraphrase is another text to text the

generation application

you want to rewrite again maybe to simplify of the paraphrase so we write a

word or rewrite the phrase

and you want to decide what is what you can see deeds in particular if

you're doing simplification

so in general free many free very different approaches to those free task depending on

but the input the is

and this completely change with annual approach so what the new one approach that it

is it's really completely change the field so now is the before generation was a

very small fields and now at a c l so the main completion linguistics conference

generation is one of the top

you know he gets the top number of submissions i think the second ranking for

number of submission in that field

so it's really

change completely

and why changes because is encoderdecoder framework willie allows you to model or free task

in the same way

so all the techniques the

methods that you can never up to improve the encoderdecoder framework

there will be novel but you know

it is common framework which makes it much easier to

take ideas from one field from one stuff it or not

so the encoderdecoder framework is it's very simple you have you input and it can

be data

text or meaning representation

you are encoded into a vector representation and then you use the power of the

in your language model to the code so that the decoder is going to produce

the text

one word at a time using recurrent network

and we know you know that

no language model much more powerful than previous

mobile language model because the context of

in limited amount of context into account

okay so we have this unifying framework but what i want to doing this talk

at a

of course the that you know the problem still remain the task are different and

you still have to handle them somehow

so whether we'll doing this talk is so you based on some work we've been

doing focus on two main points how to improve encoding or how to adapt encoding

to various energy task

and if i have time to talk a little bit about training data again off

and you know with they stiff this problems that findings disparity that so it's all

supervised approach usually and unsupervised mean you have this training data in this case the

training data has to be

the texts in the input

but this inputs candy already how to get so this meeting a presentation you know

where do you get them from or even the you know getting an alignment a

parallel corpus between

database fragment and the corresponding text is also very difficult to get right

so often you have you don't have much data

and of course this neural networks they want a lot of data so of and

you have to be clever about what you do with the training data

okay something coding and we talk about three different points modeling graph structured input

so we see that

the encoderdecoder framework initially at least in the first steps

the encoder was usually always a recurrent network so

no matter whether the input was that x or meaning representation of a graph

you know a knowledge base

it was people where using this recurrent network i think order because you know the

encoderdecoder framework

was very successful machine translation and people with all the building on this for doing

this

but of course you know after a while people for the about some of the

input is graph so maybe it's not such a good idea to model it as

a sequence

so let's do something of so we talk about

how to model

graph structure input

then i would talk about generating from texas where here i will focus on the

an application where the input is a very large quantities of text and the problem

is that if you are you know neural networks are only so good that encoding

large quantities of text so it's snowing big in fact for machine translation that

you know the longer the input these

so both the performances

and here we're not talking about long sentence is well within talking about single wrong

text to have that as an tokens or something so what do you do in

that case if you still want to do text to text the generation

and i will talk a little bit about jen normalization so some device that can

be used in some application

again because the data is not so big how can you

improve its so you can generalise better

okay so first encoding graphs

so i say the input so graphs

they all curve for example if you have not answering the input or meaning representation

so here you have an example from the mr to two thousand and seventeen challenge

where is the task was

given

meaning representations of this amr is means abstract meaning representation you part be considered as

a matter to match basically it's the it's a meaning representation you know the rights

which is can be written like it's written on the right

but basically you can see that the graph with the note that the concepts and

the edges of the relation between the concept

right so here the

this meaning representation idea would correspond to the sentence here

us officials had an expert meeting group meeting in january two thousand two in new

york and then you see that the you know that the at the top of

the tree you have z holds concept and then the arg zero with the person

and then account even read it's but contrary to the basically so united state but

then there are some concepts

so the task was to generate from these from this mr and the mr can

this you know the graph

was another challenging two thousand and seventeen which is how to generate from set saw

a the f triple

and so here the

we what we do this we extracted the sets about the f triple from to

be paid which

we had a method to ensure that this that's about the f triple where

could be match into a meaningful social texts

and then we had crowdsourcing people associating the sets of triples with the corresponding text

so that i said in this case where the pilot data set with the input

with a set of triples and the output with this text that was available i

think the content of this triples

so you probably can sit here

but for example the exact

example i showed here is like you have free triple is that repere is the

subject pretty property object

the first simple sets junk the harp state and then the date john gonna have

birthplace and then the place and then shown to how occupation fighter pilots so you

for example you have this very triples

and then that is would be to generate something like john blah ha born in

some of you on nineteen forty tool with twenty six worked as a fighter pilots

so this was or the task

and the point again here is that

so when you are generating from there are like doing here then this data can

be seen as a graph where the

well that's a graph of the subject in the pair and the object of the

entities in your triples and the edges of the relation between a triples

okay so i they send initially people where and you know apply for these two

task initially people with that

it simply using recurrent network so that we have linear rising is a graph to

just to a service offers a graph using

you know some kind of traversal methods

and then the so they then they have the sequence of tokens

and then they just encoding using a recurrent network

and so here use in example where you know the tokens

input to the rnn always tn are basically the concepts and d and the relation

that are present in the meaning representation

and then you the code from that

okay

of course that problems

intuitively it's not very nice you know pure modeling a graph as a sequence well

and then also there is technically that some problems that occur in the in that

not a

local dependency that at low content the graphical we could become long range

it's okay so these two edges here they are the same distance from eight writing

the initial graph but now when it's you know right you see that the crew

members of the first stage

is much closer to the to the a node which is that in the in

than this one right so you really

the linearization is creating those long range dependencies and then again we know that lstms

are not very good at dealing with long-range dependencies

so also you know technically you think well maybe it's not such a great idea

okay so people have been looking at this and they propose the various a graph

encoders so the idea is no instead of using an lstm to encode your linearise

graph you propose to you just use a graph encoder which is going to lead

is going to models of relation between the nodes inside a graph

and you and then you the code from the output of the graph encoder so

there were several proposal

which i won't go into detail via basically the amount in cohen propose a graph

published in the network

and this to approach the uses some graph a recurrent network

okay so we build on this idea here

at this you know i started this is introduction of putting your own energy because

i think it's when important to know all about the history of energy

to have ideas about how to improve the new one approach and here is this

proposal was really based on the previous approach the previous work on a grammar based

grammar based generation so this idea that you have a grammar is that and that

you can use to produce a text

so in this pre in your work

what people show this okay you have to a grammar and you have meaning representation

then you use the grammar to decide to tell you

which sentences that describe our associate

with this meaning representation

so you

see it's like it's you know it's not good parsing problem

if i say you know you have a sentence you have a grammar

and then you want to decide what are the meaning representation of the syntactic tree

associated basis grammar with the sentence

it's a parting problem

so all i'm saying what i'm doing it's other reversing the problem

instead of starting from the text that stuff of the meaning representation that say what

you know what that's a grammar tells me audio the okay sentence si that's to

say to do this sentence

it was a parting problem and then people started working on this reverse a parting

problem to generate sentences

and they found it was very hard problem because of all this ambiguity

and they had like two types of algorithm bottom and top down

you know eyes are you start from the from the meaning representation and then you

tried to be of the

it's really a syntactic tree that is allowed by a grammar and you get out

of that you get the sentence or you got top-down so you just user grammar

and try to be of the relation that are going to map you in jail

meaning representation

so there were these two approaches and they both had problems and what people in

the end it

if they combine both approaches so they use both top-down and bottom-up they have some

he breeds algorithm which was used which we are using both top-down and bottom-up a

information

so here this is what we did more that's we

the idea was okay and those graph encoders the and they have a unique

representation graph encoding of the input graph of the input meaning representation

what we want to do is to

well they're this idea that both bottom-up and top-down information are important

so we are going to encode each node in the graph using two encoders

one that's is that goes

basically top-down from the graph and the others that goes bottom-up for the graph so

is what it's gives us is that each node in the graph is going to

have

two encodings of buttons that reflect the top down view of the graph and the

other

the bottom-up view of the graph

what and that so in terms of number

we could show of course you know the weather with independence that

you know the we could

outperform the state-of-the-art so those are with the state-of-the-art so this is a more recent

one which are of course we are no longer state-of-the-art

brenda and the time we we're right so without improving a little bit over as

its previous approaches

more importantly it's all those numbers i don't what was it that was sitting here

blah so of course it's there's always runways evaluation is always very difficult to evaluate

those

generated text

because you don't want to look at them one by one what you can

you have to the human evaluation in fact side but if you have large quantities

and if you want to compare many systems you have to have an automatic metrics

so what people use these learn from machine translation and they are well known problems

which is you know you can generate a perfectly correct sentence that match the input

practically

but if it does not look like the reference sentence which is what you compute

your blue against then it will get very low score

so you have to have some other evaluation or should try hadley

so what we did this one on problem with neural network is semantic adequacy of

and they

the generate very nice looking texts right because this

you know language models are very powerful but often

the normal to match the input so it's a bit problematic you know because if

you when you want to have a generation application

it will it has to match input otherwise in right

it's very dangerous the asian in a way

what we try to do here is we wanted to measure the semantic adequate because

the semantic adequacy of a generator

meaning

how much that it to match you know how much the generated text

match the input

and then we what we did this we use the

the textual entailment system that basically give and give a sentence tells you whether the

first one entails the other

so is the first

you know with the second sentence implied so entailed by the first sentence

and then if you do it both ways

on the

owing to sentence si so that's being t s q and that's to intel speed

then you know the t and q are semantically equivalent right

logically that would be the fink

so we did something similar we wanted to check semantic equivalence on text

we use these tools that have been developed in competition linguistic to determine whether

two sentences on a relation entailment

and we looked at of direction between so we're comparing the reference and the generated

sentence and what you see here is that the always the graph approach is much

better

at that i mean at producing sentence see that are entailed by the reference

and also much better

it's producing

sentence is that entails a reference

we also the human evaluation

where basically the way to questions to the human evaluators

is it semantically it quite that the output x match the input

and it it's readable and then again you see so this is the in orange

this is our system and the result the sequence systems and you see that is

a large improvements

so this you know this all points to direction where you know using the graph

encoder ways you have a graph at least

i meaning representation of the graph is a good idea

okay another thing we found this

it's also a valuable in its often to combine local and global information local information

meaning local to the node in the graph

and global sort of giving information about the structure

all the surrounding graph

so in this so this is that it is still the same the

dual bottom-up so this is top-down bottom-up souls

this is a picture of the system

we have this graph encoder that eh could top down view of the of the

of the graph the bottom of view and then you

so you have

these are the than the encoding of the nodes

and then what you do is you

okay so you end up with free and could free embedding so each node one

embedding is basically the embedding of the label

the correct things is no one so the concept so that

so it's a word basically what i'm betting

and the other two of the bottom-up and top-down embedding of the node

and what we do is would ban same from an lstm

so we have a notion of context for each and would

which is given by you know the preceding nodes in the graph and we found

that this also improve our results

and we also apply this idea so this local plus global information

idea to the to another task he has a task was it's a surface is

another challenge on the generating from depending on who the dependency trees

so the idea is the meaning the input meaning representation is this case is an

older dependency tree

where the where the nodes are they created with them and so the

something like this

and then what you have to do is to generate a sentence from it so

basically this task has to send task one of them is how to real those

of them as into correct sentence

and then when you have the correct order

how to inflict the words so you want

for example you want apple to become apples

this case

so we worked on that so this was also some work we did with you

have any push you coughing in time slot

so what we did again it's a so then we transform basically what happened so

where we handle this

the way we handle this was

as follows so he i'm just i'm just focusing he on the word or the

problem how to

maps this and all the tree

to a sequence of or of elements

so i'm not talking about the world interaction problem

so what we do is we basically binarize achieve for us

so every everything become binary

and then we have c we use a multilayer perceptron to decide on the older

of each child's with respect to the head so here we're going to this that

we have a

we build a training corpus where we say okay i know you know from the

corpus from the from the reference that i proceed likes

et cetera and then you so this is a training corpus and the task is

medically given to

given the child and ahead

and the parent

how do your there's m is apparent first saw is apparent second

so we where doing so it's and then we found again that combining local always

global information helps

and this is the this is a picture of the of the model

you have that the embedding for your two nodes

and you concatenate them so you build in your presentation

and then you have

the unveiling although the of the normal also know that are below other parent node

so the subtree the is dominated by the by the parent node

and again we found that you know if you combine these two information

you get much better result in the world ordering task

so that what this shows is that

taking into account in this case the subtree

you know that

top down view of the node of that node

hams and it's really helps

so here you say you again have this bleu score this is the basic question

is one where you do data expansion talk about it later and this is the

one with the new encoder so when you when you do take into account this

additional global information so you see this quite a big improvement

okay so this was a bad decoding graph

what i want to talk about now is a bad

what you do with what you what can you do when you know the

the input text that you have to have there is very large

so in particular we look at two different tasks

we looked at the different is one of them is question answering on free phone

for more web text and the others multi document summarization

the first task is you have you have a query

and basically what you're going to do is you're going to retrieve a lot of

information from the web

meaning a lot meaning something like to know that doesn't tokens

so basically that the first one hundred sun hits from the web

and then you are going to use all six takes as input press the question

as input to generation

and the task with the generate summaries answer the question

so it's quite difficult is this is a text to text generation task

and the other one is multi document summarization

we use that we get some dataset

what

is that we keep at article tight also you have a title former we could

be the article

then you retrieve

information again some that a from the web about you know using this title other

query

and the goal is to generate a paragraph we keep at a paragraph that talk

about this title

so basically the first paragraph a week over you have to generate the first paragraph

of that we keep at a page

so here's an example you have the question so this is the a life i've

i do it i five is ask is you i'd if you where five so

with a pos it is simple language

so the question would be why consumers that still terrified of genetically modify logan i'd

organisms

so there is little debating the scientific community or whether that's safe or not

and then you retrieve some documents on the web search

from the web and then this is this would be in this in this case

the

the target and so

so not only in the input text very long but the output x is also

not sure what is not a single sentence is really a paragraph

so the question is how to encode two hundred thousand are then generate from it

previous work you know it took a way out in a way they basically use

tf-idf to select the most relevant

wave hit so they're not taking

or even sentences so that are not taking the whole results of the website so

just take you know they limit the results to a few thousand votes using basic

basically tf-idf score and

okay so what we wanted to do with to see could we have with their

way that we could encode all

all these two hundred thousand words that where we're three from the web

and then coded and use it for generation and the idea was

to convert the text into a graph

no this in this case not

not a new mode not a graph encoding but sweeter graph it's embody graph like

we used to do in

and information extraction and see whether that help

so when to see how do we do this

and so no so here is an example

the query with explain the fury of relativity

and here's a toy example right we have to those two documents the idea is

that buildings this graph allows us to reduce

to reduce the size of the input drastically

we see why later

so the idea is we use

to tools to existing tools from compression linguistics coreference resolution and information extraction tools

court what coreference resolution status and gives task is that it tells us what are

the mentioning that x that talk about the same entity

and then once we know was this we group them into a single node in

the knowledge graph

and the triples the transform the text into sets of triples

basically relation between binary relation between entities and so the band is a relations i

used to be at the edges in the graph and the entities that in the

nodes

so here's an example those two documents and then you have like in blue you

had those for mention of albert einstein they will all p combined into one note

and then the information extraction

we'll tells us that there is this triple that you can extract from

from this sentence here i'd but i'd sign a german theoretical physicist publish the fury

of relativity you can

this is the open i need to will tell you

that you can transform the sentence into these two triples here

no german

the german signals to the german no justice this one here into this triple

and similarly take this one here they've lobbed if you're over at t and give

you this triple

and so is the and that's how you build the graph so basically by using

coreference resolution and

information extraction

and another thing that is that was important was that buildings described

it's sort of

a giving us a notion of or

the important information or you information that is repeated in the input

because every time we are going to have

you know a different mention of the same entity

we will we will keep score of how many times it's entities mentioned

and we will use this in the graph representation to give a score to be

the weights

to each node a to each edges in the graph

so here if i goal either bit in more detail so you have

if you

constructs a graph incrementally by going for your sentences so you first at the sentence

here

i'd it as a graph or you are you at the corresponding triples to the

graph

and now you had this one here

and you see if you're real for a t v t was already mention

so now the corresponding node has a he's is the weight of this corresponding note

is incremented by one and you go on like this

we also have a filter operation that said you know if

if a sentence had

nothing to do with the query we don't include it's

right so and we do this using tf-idf

so to avoid including is a graph information that is totally relevant

okay so we built this graph and then we are going to linearise a graph

right

so it different from previous approach where we are going from sequence to graph here

we're going to grab from graft a sequence because the graph is too big so

it you know what you could time as a graph encoder but we didn't do

this

i might be the next step but this that quite the graph so i'm not

sure how whistles graphing coders would work

so we dinner as a graph but then to keep some information about the graph

structure

we use this two additional invading so it's token so we have a the encoder

is a transform and this case so we have since it's not a recurrent network

we have the position i'm betting add it to the word i'm betting

to keep track of where the in the sentence or word is

and we had these two additional embeddings that gives that's information

a bad you know the weight of each node and edges

and the relevance to the core

so the global view of the of the model was

you have your linearise graph as i said with different embeddings for four different embeddings

for each node or edges

you press it was a transformer we use

memory compress attention to this is for scaling better we used up cat tensions is

you only look at the point in the encoder which have at the top attention

so we encode the graph as a sequence

when core the query we combined involve using those that tension

and then we decode from that

so these pictures he additional or the amount of reduction you get from the graph

construction

and it's

and then

the proportion of meeting and so tokens because you might think you know okay you

may be choose because by compressing the text into this graph

by reducing the redundancies

maybe you lose some important information

but is actually not the case

so what the first graph shows is that

if you do website so you have something like two hundred thousand tokens

and if we you what we what we are graph construction process that it's reduces

or if e ten thousand tokens

right

and that we compare this with a just extracting triples on the text and not

constructing the graph in you see that you still have a lot of that would

not be enough to reduce the size

and it seconds a second graph shows is that

and it says the proportion of missing answer tokens

wise lower bit are missing and the tokens

you don't want to many

so we are talking about comparing with the reference and sign you want to have

as many tokens in your output

that come from the reference as possible so you don't want to many missing tokens

might so what this shows is the previous approach using tf-idf filtering where you don't

consider the whole so hundred thousand tokens but simply

i think it's hundred tokens

this is what happens if you encode the graph from this eight hundred and fifty

tokens that the tf-idf approaches using

and the seas

and so this is the number of meeting talk and so you know the higher

other words

but we what we see so we if we encode the everything so this is

the one we and core the whole

it's not very

so we encode the whole

one

with this one on that but those the what if we encore the whole input

text the one hundred with

web page

and so all the information we take on the web

you see that the actually the performance is better

and these are the general without again so in this case using rouge comparing against

or reference an answer again they are

issues with this so here we compare with the tf idf approach

with

extracting the tables but not constructing the graph in here was a graph and then

you see you always "'cause" get to you know some improvements but the important point

mainly is that we can so we get some improvement with respect to the tf

idf approach

you go from twenty eight something to twenty nine something

but also what's important in which can we can really scale to the whole

two hundred web pages

and here's an example showing the output of the system

which is a that's a very

i've only the very impressive so there's a but also illustrates some problems with the

evaluate the automatic evaluation metrics so

the question is why is touching make it might micro fibre time of such an

uncomfortable feeling

then you have these and so this is the reference and this is the generated

answer

generated answer is you know make sense the micro fibre is made up of a

bunch of tiny fibres

that attached to them

when you touch them

the fibres that make up the micro fibre an attractive to each other

when that actually attracted to the other end of the fibre which is what makes

them uncomfortable so this part is a bit strange but overall it makes sense

and it's relevant to the question and you know you have to think it's generated

from this and it doesn't talk and so it's not so bad

but what it also shows that you know they almost not overlapping words between the

generated answer in the and the reference and so it's an example where

you know that automatic metrics to give it a bad score actually

whereas in fact this is a pretty okay sentence

how much time your hand

fifteen

so with this

okay so one nothing about encoding

is that sometimes

sometimes again you don't have so much data

so you model your abstracting away a over the data might help in generalizing

it so here i'm going back here to this task of generating from an older

dependency trees

so you had this no is it input this is what you have to produce

as output

and a idea here was that so this was another work that and it's with

the

this was in that study i don't see

idea was that a that here we just use before we you know in this

other approach we have sort of

attaining the two into a binary tree and then having this a multilayer perceptron to

although the trees so local ordering of the of the knowledge

but we do is we just haven't encoder-decoder which basically learn to map

are linearise version of the and all the dependency tree

in two

the correct order of the lemons

so it's different approach and also what we'd it's is

we for twelve this work downlink problem is not so much determined by word sits

model dependent on syntax

so maybe we can abstracts of other words we can just get rid of the

words

and you know if this was reduce data about sparsity it

it wouldn't we be more general we don't want you know the specific words to

have an impact basically

so what we did is we actually got rid of the words

so here you have your input word input

input dependency trees that is not older oppose the john the need for example

and what we do is we linearize this tree

and we remove the words so we have factored representation of each node where we

keep track of you know the pos tag the parent node

and can't remember with this one days

i guess the position

well i don't know for the

zero one two okay a member

anyways important point is that are we got we would within right tree and remove

the words

so we only keep

basically postech information

structural information what is apparent

and i apparent

and what is a grammatical relations the dependency relation between the node and the parents

so here you're saying eats for example is replaced by this id one you know

it's over

you know the parent is a route

about sorry it's populated i didn't ct of its related by the what relation to

this no to the remote so if i think another example that would be clear

john for example

where is the subject john here is replaced by id for it's a proper noun

so this is a bust act and its the subject and the i think it's

missing the parent node

okay so we need we delexicalised we didn't linearise and delexicalise the tree

and then we build this corpus where the target is that the lexical i sequence

with the correct all other so here you see that you have the proper nouns

first a verb the determinant and down

and basically we train a sec two sec model to go from here to here

and then we have a lexical i so we keep a mapping of you know

what id one is and then you generate so you can just use of the

mapping to we lexicalised a sentence

and what you see that it really helps

so this a surface realization task is

it data for everything about ten languages so it is

big czech and english spanish finish french

italian dutch portuguese and russian

and you see here the difference between

doing the sec two sick with whereas a tree

contain all the words so where we haven't change of thing and the seeds doing

it's without the words of the delexicalised version and you see that for all languages

you get quite a big improvement in terms of bleu score

and we use a similar idea here on the so this was generating from another

dependency trees but

there was is all the task you know generating from abstract meaning representation

in fact a here we built a new data set off range but the

so here is the same idea we represent the notes by a concatenation of those

the factored model where

each node is the concatenation of different types of embedding so you know it's a

can take on the post like the numbers and the and morphological syntactic features

and again we delexicalise everything

and again we found oops

yes and again we found that you know that get this improvement so this is

a baseline weights not the lexical i and this is that when we delexicalise so

you get two points improvements

okay

so i think as i mentioned the beginning of and you know the that is

that they don't have they are not very being so the in particular for example

the surface realization challenge

it is you know there is a training sets a like a two thousand packets

so you have to be a little bit

sometimes you have to be clever or

constructive in what to do with the training data

one thing we found is

it is often useful to

to extend your training data outweighs information that is implicit it's in a to be

found

is that the implicit in the available training data

so again going back to this example where the problem was to we members of

the

the problem is to

we attack the problem by having this classifier that would that i mean

so delatable they're of a parent and child

and so you know you know the on your training data was like this you

had you had the parent and you had the child and you had the position

of the child with respect as a parent

and this is that we had and we folds well you know it should learn

that if this is true then also this is two days that the if the

you know

if the if the chinese to the left of the parents it should also learned

somehow that the parent is to the right of the chart

but in fact we found that it

didn't learns that we have also what we did is we just and it was

payers whenever we had these spans in the training data we add the despair to

the training data so we double the size of the training data

but we also give more explicit information about what possible constraints there are so usually

you know the subject it before the verb at a us thing you know and

the verb is after the subject

and again you know you see that there is a large improvement

and also went swimming poof expands the data is to use competition linguistic tools that

are available and that was done already in two thousand and seventeen billion these constants

where the idea is that so this was for this so generating for amr data

so the training data was

it or manually validated or constructed

for the task for the for the shared task but in fact there are a

semantic parser that if you give them a sentence they will give you the error

more so i mean they are not one hundred percent reliable but they can produce

a tamil

so what you can do is you just generate a lot of you generate a

lot of training data are by simply using a semantic parser on available data

and this was i think what about constancy so you basically part two hundred thousand

you get what sentences with this semantic parser

and then so you do some pre-training on this data and then you do some

fine tuning on the other shan't as the test set

and so we

we use this again you know for the first approach i show the good the

graph encodes the dual top-down bottom-up

i think writing approach and again we so what you know like the other approaches

we should we see that knows this really improve performance

case on getting to the ends

so you know i mentioned some

some things you can do a better encoding of your input and better training data

there's of course many open issues

some of them the affine particularly interesting is a multilingual generation so we saw in

the surface realization shared task there are ten languages but is still a reasonably simple

that's and you can have some data

they take it on the universal dependency tree banks

it so what would be interesting is you know how can you generates in multiple

languages from data from knowledge bases

or even from texas a few at simplifying can you simply find different languages

they all cosine supposed to be seen interpretability issues

i that's at the beginning you noses the standard approach is this encoder and decoder

into and approach

which that the wherewithal those modules that we had before

but in fact now

you know one way to make the mobile more interpretable is to have to reconstruct

those modules right so instead of having to sing a mandarin system

you had difference different networks for each of the of the task and people are

starting to work on this in particular is a fine to cost

coarse-to-fine approach

where you first for example generate the structure of the text and then you feel

in the details of example

and generalized inverse is memorising

they have been problems with you know that that's that are very repetitive and it's

really important to have very good test sets to control for the test set and

for example a lot of the data a lot of the shared task do not

provide the sort of unseen test set in the sense that

i don't know you add generating newspaper texts about you would like the test set

also to contain a the test for you know what happen if you applied your

mobile two different types of text one so i think this you know having sort

of out of main test set is really important for general for the for testing

so did normalization of the system is also linked to you know what can you

do with also learning of them in addition to go from one a type of

text another

and that's it thank you

is a personal questions

i one you shown some results of acceptability of text generation that was something like

seven five percent before was sixty percent somewhere in the middle

thus annotation and i want to show that to ask you is this like a

zero one people say are you accept or and not accept or dts you've a

degree was like i don't know so i'd human evaluation you mean yes the evaluation

human annotation is usually on a scale from one to five

okay because you shown to be percentage i think

at some points so i'm wondering if that is

readability

sorry to compare so they know what in this case in which they just compare

the output of two systems

so the compare the output of the second set to the output of the crisis

ten

and then they and they say which ones they prefer

so the percent is like sixty four people prefer this one

if you have them from one to five okay that because this is preference test

right do we know in these similar or reliability

good is the score between one to five

no i think i think a four

to get back to the paper i don't remember

but i think it is not is not that one to five or that was

wrong so it's been here comparison between two different if they have to do this

often systems that don't know which one and they have to say which one day

before

okay

the quite

so i like to thank you for the great i since you cover their summer

kinds of generation start many in the relationship a slu engine texts first summarization you

can generate x is very a

conversational system and was curious is that's

a very different kind of problem the architecture and main a state-of-the-art approach how to

converge at that's another architecture they are very different by different sets of that well

so the question is whether we have very different your approach for the given depending

on the input of different on the task

this version task

so i that initially for all online photo for years everybody was using this encoder-decoder

are often within a recurrence encoder

and the difference is where what was the input space so in that i don't

for example

going to take as input

the current dialog the user turn press maybe the context or receiving some information about

the dialogue right

if you adding question answering qa take a question

and as a supporting evidence

so it was really more about

you know which kind of input you had and that was the only difference but

now more and more and then people are paying attention to you know

in fact there are differences between this task so what is the structure of an

input what is the goal do you want to focus on identifying important information or

you know

the problems are very think it remains very different a so you have to tutor

so very different problems in a way so this was but that was trying to

show in fact

but the fact that dialogue and generation there is a thing people at time different

approaches to do the encoder and decoder and that problem is not very is

and you can you can learn to see of these units that are a lot

campaign see that encoding okay that the time things is the decoder generates things not

able to generate things more san okay and a and various state transition now

yes so there is a known problem your data set and that the potential generate

very generic answers like i don't know or maybe you're not very informative we actually

working on

using external information

to produce to have direct systems that

actually are produced more informative and so the idea in this case is high or

the problem is how you're achieve so you have your dialogue context you have euro

user question no user turn

and it's a bit similar to the text approach to produce you do you look

on the web or in some sources all for some information that is relevant to

a what is being discussed

and you in the range and so now on you and we joint but with

this additional information and the whole busy gives you more informative dialogue so instead of

you know avoids is empty utterances

but the system now hides all this knowledge it can use to generate more informative

response

so there are a number of you know calmly i two challenge for example is

using providing this kind of data sets where you have it dialogue plus some additional

information related to the topic of the dialogue or chat image where you have

you have any image

and so there's a dialogue is based on the image

so again you can is it the dialogue system should actually use the content of

the image to provide some informative

and so a again this slide is something i think what you have really human

in evaluation is i think is something that speaks to a lot of people in

this room because it's at least for speech it's been shown that you really need

humans to george whether or not something is adequate and natural and many of those

things so

i wonder that because this to my understanding was perhaps the only subjective human evaluation

results double contain only slides so mostly people optimising towards objective metrics do you think

there is a risk of overfitting to these metrics maybe in particular tasks or so

do you see where do you see the role of the humans judging generated text

in your field now one in the future

so the human evaluation of sas will important because is automatic metrics

that you need them you need

two dev larger system and you need them to compare you know exhaust yet is

you have a the output of many system so you need some automatic metrics

but they are imperfect right

so you all you also need you meant evaluation

often the shared task actually organiser human evaluation and they

they do this i mean it's i think it's getting better and better because getting

a people are getting more experience

and they are better and better platform and you know i'm guidelines don't know how

to do this

we are not optimising with respect to those human

objective because it's just impossible right so the over fitting would be a with respect

to the training data where you do you know maximum likelihood

trying to maximize the likelihood of the training data mostly using cross-entropy so they are

say there is some oracles on using a reinforcement learning where you optimize with respect

to actually your evaluation metrics for the rouge number

that morning spent reading

we could and you so to me it's is that the main problem that you

cup is kind of during the

student a task and right beyond is the about this that's correct

looking internet finally the information that you want them right m is the about that

my question is very often the type of owns for that you give depends on

the type of person that is going to receive is what you will employ the

same and sort of you are going to give to work during your soul child

or to an expert in the figures that will go with screen your

is there any research on how can you can be sooner or limit

the answers so that it fits the user

there really but i can think of ways right now i mean there is a

people often find that you if you have some kind of parameter like this that

you want to used to influence the outputs so you have wine input and then

you want to different outputs depending on this parameter

often just i think this to your training data actually helps a lot

i think this was in

so people do this with emotions

for example should that x p us either should be happy so are there is

a might use there is they want to use emotion detector and emotion you know

some something that gives you in any motion tag for the for the sentence and

then they would

produce

i mean you need to the training data right but if you

if you can have this training data and you can label the training example with

the

personalities that you want to generate for then it works reasonably well in fact the

chat image a data it's nice place

it's

the dialogue is the dialogue has you have four same image you might have different

dialogue depending on the

personality so there's i-th input as a personality and the personalities is like something like

two hundred and fifty personalities

can be you know

jockeying serious or whatever and so they had the database the training data taking into

account this personality

so you can generate dialogues that have

about the same image

with different on depending on the personal but for example within the same or what

would be possible to open a constraint on be the vocabulary but you can use

on the output

yes

so in the encoder decoder

yes you could do that

this is not something normally people do the just use the whole vocabulary and then

they hope that the model is going to learn to focus on that vocabularies that

correspond to certain feature but maybe you could say that

you already mentioned it somewhat but

this you also raises or effective questions on the right the text more than maybe

in synthesis

that you really need to get it right or two

you have some other problems consistency or units indication of something bend it is

this is the proposed a statistical approach or can you

can you solve this

well that's a i mean i think one

when the problem i think that is the l c of the i c with

the with the current approach to new approach to generation is that they're not necessarily

semantically faithful what okay thing right so you know that they can print think that

have nothing to do with the input we can see the problem

i'm not sure a syntactical problem in the sense you know that generators are not

really out in the one that are not super useful either but in an application

so you know for people for industrial people want to develop application clearly it's a

problem right because there you don't want to sell a generative that

that is not faithful

but i mean ethical problems we have plenty in general in nlp

that's not time we had so that's a think is documents

Natural Language Generation: Creating Text

Keynotes

Claire Gardent CNRS/LORIA, Nancy, France