a good morning everyone that they have to change at least i hang up seeing
a speaker is to study and practically i is a senior researcher french national centre
for scientific research just a little area computer science a sixteen c fan
i think i c six test databases and have it seems like and statistical models
natural language she's are not an acquisition that can arise and lexical resources for nlp
syntactic and semantic parsing and i need technology for language learning
she is very natural language generation from syntactic let and has a large delay not
time which is a shared task and generating text from section let
said this morning style will be at and t planning-based critically models thinking i natural
language generation of are a variety of different types of a unique tasks so that's
baseline okay
can you email
can you hear the back
okay so good morning
thank you for being hereafter last night
so when i was invited so it is workshop i was i was for real
of course you know coming to the n i giving a talk
but then i was at the river a the also about the
thai tone of this workshop synthesis no
i think that the introduction say they show that i don't do synthesis i don't
do speech in fact i just for a context
but of course you know there is a link between
text-to-speech synthesis and generation which is what i've been working on for the recent years
which is that natural language generation is the task of producing text so you can
see natural language generation as a
three step two text-to-speech synthesis
and it is what i'm going to talk about today are going to talk about
different types of natural language generation tasks
and is and we start with
we know what's
okay
so i would start with a an introduction to you know how hard generation was
the on before deep learning or is and then i will show how
you know the deep learning
and paradigm completely change the approach to natural language generation
and i will talk about some issues about current your approach is to
text generation
so the vocal presence here is a joint work with the phd students and colleagues
which i want to name here so i met corner is a p g students
at the north sea where am based
you do that again from bob dylan university and it is well okay life on
is a piece is you don't to jointly student between fair in paris and all
c
are in our group goal of each that the dams that university okay push your
cough and the another liberal who were also puget students they're the joint supervision with
within a
and the finally and that so the su money not be the phd students with
mean of
okay so first what is natural language generation well it's the task of producing text
but it's very different from natural language understanding because of the input so in natural
language understanding the input is a text is well defined everybody agrees on that and
this you know large quantities of text around available
in natural language generation is very different in that the input can be candy many
different things
and this is actually one of the reason why natural language generation board was very
small see for a very long time so compared to nlu you know the number
of papers on energy was very small
and so what out the types of inputs there are basic
types of input data are meaning representation
well text
so data i would be that it data from data bases a knowledge bases
structured data then you have meaning representation that are devised by language that can be
produced by computational linguistic tools that basically a device to represent the meaning of a
sentence
sometimes of a text more generally of a sentence or dialogue turn
so sometimes you want to generate some this meaning representation for example
in the context of a dialogue system
you might want to the you know the system will the dialogue manager would produce
a meaning representation music which is called a dialogue turn
and then the generation task is to generate the texas
a turn so the system turn in response to
to this meeting representation
and finally you can generate from texas and that would be in our applications such
as text summarization text simplification sentence compression
so those are the main types of input another complicating factor is that the
what we call the communicative goal can be very different sometimes you want to verbalise
so for instance if you have a knowledge base you might want the system to
just verbalise the content of the knowledge base
so is
you know readable by human uses but off you know all the goals would be
like to respond to a dialogue turn
to summarize a text or to simplify text or even to summarize the content of
a knowledge base
so those two factors with the
means that
up renewal
natural language generation was divided into many different sub fields which didn't help
given that the community was already pretty small
and there wasn't much
you know communication between those subfields
so why did we have this difference amphibians because essentially the problem is very different
so in the when you are generating from data there is a big gap between
the input and the output six of the input is a data structure they died
does not like the text at all it can be even you know result from
signal processing can be numbers from with r
at call numbers whatever
so
the input data is very different from texts and so the to bridge this gap
you have to do many things and essentially what you have to do is to
decide what to say
and how to say eight
what to say is more in the eye problem so deciding you know what part
of the data are you want to select two
actually verbalise time because if you were you know if you verbalise all the numbers
in the you know given by a sensor it would just make no sense at
all of the
so without in text with make no sense at all
and so you have a content selection problem then usually have to structure the set
the content that you've selected into a at structures of the resemble the text
then so this would be more like ai planning problem so in this is actually
that was handles often with planning techniques
and this more linguistics once you have to text structure how do you convert it
into well-formed text
and there you have to make many choices so generation is really a choice problem
because there are many different ways of realizing of things
so you know have problems such as you know
well to choose to lexicalising that every symbol which referring expression to describe an entity
known hear its rays are you going to
use a pronoun
a proper name
aggregation how to about repetitions this is basically about choosing deciding when to using such
as ellipsis so-called in addition to avoid redundancy in the output x
even for there is redundancy in your or not
see but repetition signal knowledge base
things actually so
basically generating from data at the consensus was there was this a big and then
g pipeline where you had to mobile all of these
subproblems
if you generate a meaning representation the task was seen that's completely different
partly i mean a mainly because the gap between in your presentation and the sentences
is much smaller as in fact and in fact scissors meaning representation not be a
by linguists
so the
consensus here was that
if you can have a grammar that describes you can you know
no and the rubber grammars that describe basically the mapping between text and meaning
and c and because it's a grammar it also include this notion of syntax so
it ensures that that's the text will be well-formed
so the idea was you have a grammar that define this mapping between this association
between text and meaning
you could use it in both direction either you have it x and you use
the grammar to derive its meaning
but you can use it for generation you start some of the meaning
and then you user grammar was to decide you know what is the corresponding sentence
given by the grammar
and of course is grammar that soon as you have lots coverage they become very
ambiguous so there's a huge ambiguity problem
it's not tractable basically you get you know tiles on some intermediate results thousands of
outputs
and the initial search space is used so you combine usually you combines this grammar
with some statistical modules that will basically designed to reduce the search space
and to limit the output to one always you outputs
and finally generating from texas here again very different approach the main point them and
consensus again was that you generate from text they are basically foreman operations you want
to model i mean all of some of them depending on the application which is
pleats rewrite real during delete
thing is about learning went to split a long sentence into as the several sentences
for example the simplification where you want to simplify text
we ordering is just moving constituent around the words around
again because maybe you want to simplify the paraphrase is another text to text the
generation application
you want to rewrite again maybe to simplify of the paraphrase so we write a
word or rewrite the phrase
and you want to decide what is what you can see deeds in particular if
you're doing simplification
so in general free many free very different approaches to those free task depending on
but the input the is
and this completely change with annual approach so what the new one approach that it
is it's really completely change the field so now is the before generation was a
very small fields and now at a c l so the main completion linguistics conference
generation is one of the top
you know he gets the top number of submissions i think the second ranking for
number of submission in that field
so it's really
change completely
and why changes because is encoderdecoder framework willie allows you to model or free task
in the same way
so all the techniques the
methods that you can never up to improve the encoderdecoder framework
there will be novel but you know
it is common framework which makes it much easier to
take ideas from one field from one stuff it or not
so the encoderdecoder framework is it's very simple you have you input and it can
be data
text or meaning representation
you are encoded into a vector representation and then you use the power of the
in your language model to the code so that the decoder is going to produce
the text
one word at a time using recurrent network
and we know you know that
no language model much more powerful than previous
mobile language model because the context of
in limited amount of context into account
okay so we have this unifying framework but what i want to doing this talk
at a
of course the that you know the problem still remain the task are different and
you still have to handle them somehow
so whether we'll doing this talk is so you based on some work we've been
doing focus on two main points how to improve encoding or how to adapt encoding
to various energy task
and if i have time to talk a little bit about training data again off
and you know with they stiff this problems that findings disparity that so it's all
supervised approach usually and unsupervised mean you have this training data in this case the
training data has to be
the texts in the input
but this inputs candy already how to get so this meeting a presentation you know
where do you get them from or even the you know getting an alignment a
parallel corpus between
database fragment and the corresponding text is also very difficult to get right
so often you have you don't have much data
and of course this neural networks they want a lot of data so of and
you have to be clever about what you do with the training data
okay something coding and we talk about three different points modeling graph structured input
so we see that
the encoderdecoder framework initially at least in the first steps
the encoder was usually always a recurrent network so
no matter whether the input was that x or meaning representation of a graph
you know a knowledge base
it was people where using this recurrent network i think order because you know the
encoderdecoder framework
was very successful machine translation and people with all the building on this for doing
this
but of course you know after a while people for the about some of the
input is graph so maybe it's not such a good idea to model it as
a sequence
so let's do something of so we talk about
how to model
graph structure input
then i would talk about generating from texas where here i will focus on the
an application where the input is a very large quantities of text and the problem
is that if you are you know neural networks are only so good that encoding
large quantities of text so it's snowing big in fact for machine translation that
you know the longer the input these
so both the performances
and here we're not talking about long sentence is well within talking about single wrong
text to have that as an tokens or something so what do you do in
that case if you still want to do text to text the generation
and i will talk a little bit about jen normalization so some device that can
be used in some application
again because the data is not so big how can you
improve its so you can generalise better
okay so first encoding graphs
so i say the input so graphs
they all curve for example if you have not answering the input or meaning representation
so here you have an example from the mr to two thousand and seventeen challenge
where is the task was
given
meaning representations of this amr is means abstract meaning representation you part be considered as
a matter to match basically it's the it's a meaning representation you know the rights
which is can be written like it's written on the right
but basically you can see that the graph with the note that the concepts and
the edges of the relation between the concept
right so here the
this meaning representation idea would correspond to the sentence here
us officials had an expert meeting group meeting in january two thousand two in new
york and then you see that the you know that the at the top of
the tree you have z holds concept and then the arg zero with the person
and then account even read it's but contrary to the basically so united state but
then there are some concepts
so the task was to generate from these from this mr and the mr can
this you know the graph
was another challenging two thousand and seventeen which is how to generate from set saw
a the f triple
and so here the
we what we do this we extracted the sets about the f triple from to
be paid which
we had a method to ensure that this that's about the f triple where
could be match into a meaningful social texts
and then we had crowdsourcing people associating the sets of triples with the corresponding text
so that i said in this case where the pilot data set with the input
with a set of triples and the output with this text that was available i
think the content of this triples
so you probably can sit here
but for example the exact
example i showed here is like you have free triple is that repere is the
subject pretty property object
the first simple sets junk the harp state and then the date john gonna have
birthplace and then the place and then shown to how occupation fighter pilots so you
for example you have this very triples
and then that is would be to generate something like john blah ha born in
some of you on nineteen forty tool with twenty six worked as a fighter pilots
so this was or the task
and the point again here is that
so when you are generating from there are like doing here then this data can
be seen as a graph where the
well that's a graph of the subject in the pair and the object of the
entities in your triples and the edges of the relation between a triples
okay so i they send initially people where and you know apply for these two
task initially people with that
it simply using recurrent network so that we have linear rising is a graph to
just to a service offers a graph using
you know some kind of traversal methods
and then the so they then they have the sequence of tokens
and then they just encoding using a recurrent network
and so here use in example where you know the tokens
input to the rnn always tn are basically the concepts and d and the relation
that are present in the meaning representation
and then you the code from that
okay
so
of course that problems
intuitively it's not very nice you know pure modeling a graph as a sequence well
and then also there is technically that some problems that occur in the in that
not a
local dependency that at low content the graphical we could become long range
so
it's okay so these two edges here they are the same distance from eight writing
the initial graph but now when it's you know right you see that the crew
members of the first stage
is much closer to the to the a node which is that in the in
than this one right so you really
the linearization is creating those long range dependencies and then again we know that lstms
are not very good at dealing with long-range dependencies
so also you know technically you think well maybe it's not such a great idea
okay so people have been looking at this and they propose the various a graph
encoders so the idea is no instead of using an lstm to encode your linearise
graph you propose to you just use a graph encoder which is going to lead
is going to models of relation between the nodes inside a graph
and you and then you the code from the output of the graph encoder so
there were several proposal
which i won't go into detail via basically the amount in cohen propose a graph
published in the network
and this to approach the uses some graph a recurrent network
okay so we build on this idea here
we
at this you know i started this is introduction of putting your own energy because
i think it's when important to know all about the history of energy
to have ideas about how to improve the new one approach and here is this
proposal was really based on the previous approach the previous work on a grammar based
grammar based generation so this idea that you have a grammar is that and that
you can use to produce a text
so in this pre in your work
what people show this okay you have to a grammar and you have meaning representation
then you use the grammar to decide to tell you
which sentences that describe our associate
with this meaning representation
so you
see it's like it's you know it's not good parsing problem
if i say you know you have a sentence you have a grammar
and then you want to decide what are the meaning representation of the syntactic tree
associated basis grammar with the sentence
it's a parting problem
so all i'm saying what i'm doing it's other reversing the problem
instead of starting from the text that stuff of the meaning representation that say what
you know what that's a grammar tells me audio the okay sentence si that's to
say to do this sentence
so
it was a parting problem and then people started working on this reverse a parting
problem to generate sentences
and they found it was very hard problem because of all this ambiguity
and they had like two types of algorithm bottom and top down
you know eyes are you start from the from the meaning representation and then you
tried to be of the
it's really a syntactic tree that is allowed by a grammar and you get out
of that you get the sentence or you got top-down so you just user grammar
and try to be of the relation that are going to map you in jail
meaning representation
so there were these two approaches and they both had problems and what people in
the end it
if they combine both approaches so they use both top-down and bottom-up they have some
he breeds algorithm which was used which we are using both top-down and bottom-up a
information
so here this is what we did more that's we
the idea was okay and those graph encoders the and they have a unique
representation graph encoding of the input graph of the input meaning representation
what we want to do is to
well they're this idea that both bottom-up and top-down information are important
so we are going to encode each node in the graph using two encoders
one that's is that goes
basically top-down from the graph and the others that goes bottom-up for the graph so
is what it's gives us is that each node in the graph is going to
have
two encodings of buttons that reflect the top down view of the graph and the
other
the bottom-up view of the graph
what and that so in terms of number
we could show of course you know the weather with independence that
you know the we could
outperform the state-of-the-art so those are with the state-of-the-art so this is a more recent
one which are of course we are no longer state-of-the-art
brenda and the time we we're right so without improving a little bit over as
its previous approaches
more importantly it's all those numbers i don't what was it that was sitting here
blah so of course it's there's always runways evaluation is always very difficult to evaluate
those
generated text
because you don't want to look at them one by one what you can
you have to the human evaluation in fact side but if you have large quantities
and if you want to compare many systems you have to have an automatic metrics
so what people use these learn from machine translation and they are well known problems
which is you know you can generate a perfectly correct sentence that match the input
practically
but if it does not look like the reference sentence which is what you compute
your blue against then it will get very low score
so you have to have some other evaluation or should try hadley
so what we did this one on problem with neural network is semantic adequacy of
and they
the generate very nice looking texts right because this
you know language models are very powerful but often
the normal to match the input so it's a bit problematic you know because if
you when you want to have a generation application
it will it has to match input otherwise in right
it's very dangerous the asian in a way
so
what we try to do here is we wanted to measure the semantic adequate because
the semantic adequacy of a generator
meaning
how much that it to match you know how much the generated text
match the input
and then we what we did this we use the
the textual entailment system that basically give and give a sentence tells you whether the
first one entails the other
so is the first
you know with the second sentence implied so entailed by the first sentence
and then if you do it both ways
on the
owing to sentence si so that's being t s q and that's to intel speed
then you know the t and q are semantically equivalent right
logically that would be the fink
so we did something similar we wanted to check semantic equivalence on text
we use these tools that have been developed in competition linguistic to determine whether
two sentences on a relation entailment
and we looked at of direction between so we're comparing the reference and the generated
sentence and what you see here is that the always the graph approach is much
better
at that i mean at producing sentence see that are entailed by the reference
and also much better
it's producing
sentence is that entails a reference
we also the human evaluation
where basically the way to questions to the human evaluators
is it semantically it quite that the output x match the input
and it it's readable and then again you see so this is the in orange
this is our system and the result the sequence systems and you see that is
a large improvements
so this you know this all points to direction where you know using the graph
encoder ways you have a graph at least
i meaning representation of the graph is a good idea
okay another thing we found this
it's also a valuable in its often to combine local and global information local information
meaning local to the node in the graph
and global sort of giving information about the structure
all the surrounding graph
so in this so this is that it is still the same the
dual bottom-up so this is top-down bottom-up souls
this is a picture of the system
we have this graph encoder that eh could top down view of the of the
of the graph the bottom of view and then you
so you have
these are the than the encoding of the nodes
and then what you do is you
okay so you end up with free and could free embedding so each node one
embedding is basically the embedding of the label
the correct things is no one so the concept so that
so it's a word basically what i'm betting
and the other two of the bottom-up and top-down embedding of the node
and what we do is would ban same from an lstm
so we have a notion of context for each and would
which is given by you know the preceding nodes in the graph and we found
that this also improve our results
and we also apply this idea so this local plus global information
idea to the to another task he has a task was it's a surface is
another challenge on the generating from depending on who the dependency trees
so the idea is the meaning the input meaning representation is this case is an
older dependency tree
where the where the nodes are they created with them and so the
something like this
and then what you have to do is to generate a sentence from it so
basically this task has to send task one of them is how to real those
of them as into correct sentence
and then when you have the correct order
how to inflict the words so you want
for example you want apple to become apples
this case
so we worked on that so this was also some work we did with you
have any push you coughing in time slot
so what we did again it's a so then we transform basically what happened so
where we handle this
the way we handle this was
as follows so he i'm just i'm just focusing he on the word or the
problem how to
maps this and all the tree
to a sequence of or of elements
so i'm not talking about the world interaction problem
so what we do is we basically binarize achieve for us
so every everything become binary
and then we have c we use a multilayer perceptron to decide on the older
of each child's with respect to the head so here we're going to this that
we have a
we have a
we build a training corpus where we say okay i know you know from the
corpus from the from the reference that i proceed likes
et cetera and then you so this is a training corpus and the task is
medically given to
given the child and ahead
and the parent
how do your there's m is apparent first saw is apparent second
so we where doing so it's and then we found again that combining local always
global information helps
and this is the this is a picture of the of the model
you have that the embedding for your two nodes
and you concatenate them so you build in your presentation
and then you have
the unveiling although the of the normal also know that are below other parent node
so the subtree the is dominated by the by the parent node
and again we found that you know if you combine these two information
you get much better result in the world ordering task
so that what this shows is that
taking into account in this case the subtree
you know that
top down view of the node of that node
hams and it's really helps
so here you say you again have this bleu score this is the basic question
is one where you do data expansion talk about it later and this is the
one with the new encoder so when you when you do take into account this
additional global information so you see this quite a big improvement
okay so this was a bad decoding graph
what i want to talk about now is a bad
what you do with what you what can you do when you know the
the input text that you have to have there is very large
so in particular we look at two different tasks
we looked at the different is one of them is question answering on free phone
for more web text and the others multi document summarization
the first task is you have you have a query
and basically what you're going to do is you're going to retrieve a lot of
information from the web
meaning a lot meaning something like to know that doesn't tokens
so basically that the first one hundred sun hits from the web
and then you are going to use all six takes as input press the question
as input to generation
and the task with the generate summaries answer the question
so it's quite difficult is this is a text to text generation task
and the other one is multi document summarization
we use that we get some dataset
what
is that we keep at article tight also you have a title former we could
be the article
then you retrieve
information again some that a from the web about you know using this title other
query
and the goal is to generate a paragraph we keep at a paragraph that talk
about this title
so basically the first paragraph a week over you have to generate the first paragraph
of that we keep at a page
so here's an example you have the question so this is the a life i've
i do it i five is ask is you i'd if you where five so
with a pos it is simple language
so the question would be why consumers that still terrified of genetically modify logan i'd
organisms
so there is little debating the scientific community or whether that's safe or not
and then you retrieve some documents on the web search
from the web and then this is this would be in this in this case
the
the target and so
so not only in the input text very long but the output x is also
not sure what is not a single sentence is really a paragraph
so the question is how to encode two hundred thousand are then generate from it
previous work you know it took a way out in a way they basically use
tf-idf to select the most relevant
wave hit so they're not taking
or even sentences so that are not taking the whole results of the website so
just take you know they limit the results to a few thousand votes using basic
basically tf-idf score and
okay so what we wanted to do with to see could we have with their
way that we could encode all
all these two hundred thousand words that where we're three from the web
and then coded and use it for generation and the idea was
to convert the text into a graph
no this in this case not
not a new mode not a graph encoding but sweeter graph it's embody graph like
we used to do in
and information extraction and see whether that help
so when to see how do we do this
and so no so here is an example
the query with explain the fury of relativity
and here's a toy example right we have to those two documents the idea is
that buildings this graph allows us to reduce
to reduce the size of the input drastically
we see why later
so the idea is we use
to tools to existing tools from compression linguistics coreference resolution and information extraction tools
court what coreference resolution status and gives task is that it tells us what are
the mentioning that x that talk about the same entity
and then once we know was this we group them into a single node in
the knowledge graph
and the triples the transform the text into sets of triples
basically relation between binary relation between entities and so the band is a relations i
used to be at the edges in the graph and the entities that in the
nodes
so here's an example those two documents and then you have like in blue you
had those for mention of albert einstein they will all p combined into one note
and then the information extraction
we'll tells us that there is this triple that you can extract from
from this sentence here i'd but i'd sign a german theoretical physicist publish the fury
of relativity you can
this is the open i need to will tell you
that you can transform the sentence into these two triples here
no german
the german signals to the german no justice this one here into this triple
and similarly take this one here they've lobbed if you're over at t and give
you this triple
and so is the and that's how you build the graph so basically by using
coreference resolution and
information extraction
and another thing that is that was important was that buildings described
it's sort of
a giving us a notion of or
the important information or you information that is repeated in the input
because every time we are going to have
you know a different mention of the same entity
we will we will keep score of how many times it's entities mentioned
and we will use this in the graph representation to give a score to be
the weights
to each node a to each edges in the graph
so here if i goal either bit in more detail so you have
if you
constructs a graph incrementally by going for your sentences so you first at the sentence
here
i'd it as a graph or you are you at the corresponding triples to the
graph
and now you had this one here
and you see if you're real for a t v t was already mention
so now the corresponding node has a he's is the weight of this corresponding note
is incremented by one and you go on like this
we also have a filter operation that said you know if
if a sentence had
nothing to do with the query we don't include it's
right so and we do this using tf-idf
so to avoid including is a graph information that is totally relevant
okay so we built this graph and then we are going to linearise a graph
right
so it different from previous approach where we are going from sequence to graph here
we're going to grab from graft a sequence because the graph is too big so
it you know what you could time as a graph encoder but we didn't do
this
i might be the next step but this that quite the graph so i'm not
sure how whistles graphing coders would work
so we dinner as a graph but then to keep some information about the graph
structure
we use this two additional invading so it's token so we have a the encoder
is a transform and this case so we have since it's not a recurrent network
we have the position i'm betting add it to the word i'm betting
to keep track of where the in the sentence or word is
and we had these two additional embeddings that gives that's information
a bad you know the weight of each node and edges
and the relevance to the core
so the global view of the of the model was
you have your linearise graph as i said with different embeddings for four different embeddings
for each node or edges
you press it was a transformer we use
memory compress attention to this is for scaling better we used up cat tensions is
you only look at the point in the encoder which have at the top attention
so we encode the graph as a sequence
when core the query we combined involve using those that tension
and then we decode from that
so these pictures he additional or the amount of reduction you get from the graph
construction
and it's
and then
the proportion of meeting and so tokens because you might think you know okay you
may be choose because by compressing the text into this graph
by reducing the redundancies
maybe you lose some important information
but is actually not the case
so what the first graph shows is that
if you do website so you have something like two hundred thousand tokens
and if we you what we what we are graph construction process that it's reduces
to
or if e ten thousand tokens
right
and that we compare this with a just extracting triples on the text and not
constructing the graph in you see that you still have a lot of that would
not be enough to reduce the size
and it seconds a second graph shows is that
and it says the proportion of missing answer tokens
wise lower bit are missing and the tokens
you don't want to many
so we are talking about comparing with the reference and sign you want to have
as many tokens in your output
that come from the reference as possible so you don't want to many missing tokens
might so what this shows is the previous approach using tf-idf filtering where you don't
consider the whole so hundred thousand tokens but simply
i think it's hundred tokens
this is what happens if you encode the graph from this eight hundred and fifty
tokens that the tf-idf approaches using
and the seas
and so this is the number of meeting talk and so you know the higher
other words
but we what we see so we if we encode the everything so this is
the one we and core the whole
it's not very
so we encode the whole
one
with this one on that but those the what if we encore the whole input
text the one hundred with
web page
and so all the information we take on the web
you see that the actually the performance is better
and these are the general without again so in this case using rouge comparing against
or reference an answer again they are
issues with this so here we compare with the tf idf approach
with
extracting the tables but not constructing the graph in here was a graph and then
you see you always "'cause" get to you know some improvements but the important point
mainly is that we can so we get some improvement with respect to the tf
idf approach
you go from twenty eight something to twenty nine something
but also what's important in which can we can really scale to the whole
two hundred web pages
and here's an example showing the output of the system
which is a that's a very
i've only the very impressive so there's a but also illustrates some problems with the
evaluate the automatic evaluation metrics so
the question is why is touching make it might micro fibre time of such an
uncomfortable feeling
then you have these and so this is the reference and this is the generated
answer
generated answer is you know make sense the micro fibre is made up of a
bunch of tiny fibres
that attached to them
when you touch them
the fibres that make up the micro fibre an attractive to each other
when that actually attracted to the other end of the fibre which is what makes
them uncomfortable so this part is a bit strange but overall it makes sense
and it's relevant to the question and you know you have to think it's generated
from this and it doesn't talk and so it's not so bad
but what it also shows that you know they almost not overlapping words between the
generated answer in the and the reference and so it's an example where
you know that automatic metrics to give it a bad score actually
whereas in fact this is a pretty okay sentence
how much time your hand
fifteen
so with this
okay so one nothing about encoding
is that sometimes
sometimes again you don't have so much data
so you model your abstracting away a over the data might help in generalizing
it so here i'm going back here to this task of generating from an older
dependency trees
so you had this no is it input this is what you have to produce
as output
and a idea here was that so this was another work that and it's with
the
this was in that study i don't see
idea was that a that here we just use before we you know in this
other approach we have sort of
attaining the two into a binary tree and then having this a multilayer perceptron to
although the trees so local ordering of the of the knowledge
but we do is we just haven't encoder-decoder which basically learn to map
are linearise version of the and all the dependency tree
in two
the correct order of the lemons
so it's different approach and also what we'd it's is
we for twelve this work downlink problem is not so much determined by word sits
model dependent on syntax
so maybe we can abstracts of other words we can just get rid of the
words
and you know if this was reduce data about sparsity it
it wouldn't we be more general we don't want you know the specific words to
have an impact basically
so what we did is we actually got rid of the words
so here you have your input word input
input dependency trees that is not older oppose the john the need for example
and what we do is we linearize this tree
and we remove the words so we have factored representation of each node where we
keep track of you know the pos tag the parent node
and can't remember with this one days
i guess the position
well i don't know for the
zero one two okay a member
anyways important point is that are we got we would within right tree and remove
the words
so we only keep
basically postech information
structural information what is apparent
and i apparent
and what is a grammatical relations the dependency relation between the node and the parents
so here you're saying eats for example is replaced by this id one you know
it's over
you know the parent is a route
about sorry it's populated i didn't ct of its related by the what relation to
this no to the remote so if i think another example that would be clear
john for example
where is the subject john here is replaced by id for it's a proper noun
so this is a bust act and its the subject and the i think it's
missing the parent node
okay so we need we delexicalised we didn't linearise and delexicalise the tree
and then we build this corpus where the target is that the lexical i sequence
with the correct all other so here you see that you have the proper nouns
first a verb the determinant and down
and basically we train a sec two sec model to go from here to here
and then we have a lexical i so we keep a mapping of you know
what id one is and then you generate so you can just use of the
mapping to we lexicalised a sentence
and what you see that it really helps
so this a surface realization task is
it data for everything about ten languages so it is
big czech and english spanish finish french
italian dutch portuguese and russian
and you see here the difference between
doing the sec two sick with whereas a tree
contain all the words so where we haven't change of thing and the seeds doing
it's without the words of the delexicalised version and you see that for all languages
you get quite a big improvement in terms of bleu score
and we use a similar idea here on the so this was generating from another
dependency trees but
there was is all the task you know generating from abstract meaning representation
in fact a here we built a new data set off range but the
so here is the same idea we represent the notes by a concatenation of those
the factored model where
each node is the concatenation of different types of embedding so you know it's a
can take on the post like the numbers and the and morphological syntactic features
and again we delexicalise everything
and again we found oops
yes and again we found that you know that get this improvement so this is
a baseline weights not the lexical i and this is that when we delexicalise so
you get two points improvements
okay
so i think as i mentioned the beginning of and you know the that is
that they don't have they are not very being so the in particular for example
the surface realization challenge
is
it is you know there is a training sets a like a two thousand packets
so you have to be a little bit
sometimes you have to be clever or
constructive in what to do with the training data
one thing we found is
it is often useful to
to extend your training data outweighs information that is implicit it's in a to be
found
is that the implicit in the available training data
so again going back to this example where the problem was to we members of
the
the problem is to
we attack the problem by having this classifier that would that i mean
so delatable they're of a parent and child
and so you know you know the on your training data was like this you
had you had the parent and you had the child and you had the position
of the child with respect as a parent
and this is that we had and we folds well you know it should learn
that if this is true then also this is two days that the if the
you know
if the if the chinese to the left of the parents it should also learned
somehow that the parent is to the right of the chart
but in fact we found that it
didn't learns that we have also what we did is we just and it was
payers whenever we had these spans in the training data we add the despair to
the training data so we double the size of the training data
but we also give more explicit information about what possible constraints there are so usually
you know the subject it before the verb at a us thing you know and
the verb is after the subject
and again you know you see that there is a large improvement
and also went swimming poof expands the data is to use competition linguistic tools that
are available and that was done already in two thousand and seventeen billion these constants
where the idea is that so this was for this so generating for amr data
so the training data was
it or manually validated or constructed
for the task for the for the shared task but in fact there are a
semantic parser that if you give them a sentence they will give you the error
more so i mean they are not one hundred percent reliable but they can produce
a tamil
so what you can do is you just generate a lot of you generate a
lot of training data are by simply using a semantic parser on available data
and this was i think what about constancy so you basically part two hundred thousand
you get what sentences with this semantic parser
and then so you do some pre-training on this data and then you do some
fine tuning on the other shan't as the test set
and so we
we use this again you know for the first approach i show the good the
graph encodes the dual top-down bottom-up
i think writing approach and again we so what you know like the other approaches
we should we see that knows this really improve performance
case on getting to the ends
so you know i mentioned some
some things you can do a better encoding of your input and better training data
there's of course many open issues
some of them the affine particularly interesting is a multilingual generation so we saw in
the surface realization shared task there are ten languages but is still a reasonably simple
that's and you can have some data
they take it on the universal dependency tree banks
it so what would be interesting is you know how can you generates in multiple
languages from data from knowledge bases
or even from texas a few at simplifying can you simply find different languages
they all cosine supposed to be seen interpretability issues
i that's at the beginning you noses the standard approach is this encoder and decoder
into and approach
which that the wherewithal those modules that we had before
but in fact now
you know one way to make the mobile more interpretable is to have to reconstruct
those modules right so instead of having to sing a mandarin system
you had difference different networks for each of the of the task and people are
starting to work on this in particular is a fine to cost
coarse-to-fine approach
where you first for example generate the structure of the text and then you feel
in the details of example
and generalized inverse is memorising
they have been problems with you know that that's that are very repetitive and it's
really important to have very good test sets to control for the test set and
for example a lot of the data a lot of the shared task do not
provide the sort of unseen test set in the sense that
i don't know you add generating newspaper texts about you would like the test set
also to contain a the test for you know what happen if you applied your
mobile two different types of text one so i think this you know having sort
of out of main test set is really important for general for the for testing
so did normalization of the system is also linked to you know what can you
do with also learning of them in addition to go from one a type of
text another
and that's it thank you
is a personal questions
i one you shown some results of acceptability of text generation that was something like
seven five percent before was sixty percent somewhere in the middle
thus annotation and i want to show that to ask you is this like a
zero one people say are you accept or and not accept or dts you've a
degree was like i don't know so i'd human evaluation you mean yes the evaluation
human annotation is usually on a scale from one to five
okay because you shown to be percentage i think
at some points so i'm wondering if that is
readability
sorry to compare so they know what in this case in which they just compare
the output of two systems
so the compare the output of the second set to the output of the crisis
ten
and then they and they say which ones they prefer
so the percent is like sixty four people prefer this one
if you have them from one to five okay that because this is preference test
right do we know in these similar or reliability
good is the score between one to five
no i think i think a four
to get back to the paper i don't remember
but i think it is not is not that one to five or that was
wrong so it's been here comparison between two different if they have to do this
often systems that don't know which one and they have to say which one day
before
hi
okay
the quite
so i like to thank you for the great i since you cover their summer
kinds of generation start many in the relationship a slu engine texts first summarization you
can generate x is very a
conversational system and was curious is that's
a very different kind of problem the architecture and main a state-of-the-art approach how to
converge at that's another architecture they are very different by different sets of that well
so
so the question is whether we have very different your approach for the given depending
on the input of different on the task
this version task
so i that initially for all online photo for years everybody was using this encoder-decoder
are often within a recurrence encoder
and the difference is where what was the input space so in that i don't
for example
going to take as input
the current dialog the user turn press maybe the context or receiving some information about
the dialogue right
if you adding question answering qa take a question
and as a supporting evidence
so it was really more about
you know which kind of input you had and that was the only difference but
now more and more and then people are paying attention to you know
in fact there are differences between this task so what is the structure of an
input what is the goal do you want to focus on identifying important information or
you know
the problems are very think it remains very different a so you have to tutor
so very different problems in a way so this was but that was trying to
show in fact
but the fact that dialogue and generation there is a thing people at time different
approaches to do the encoder and decoder and that problem is not very is
and you can you can learn to see of these units that are a lot
campaign see that encoding okay that the time things is the decoder generates things not
able to generate things more san okay and a and various state transition now
yes so there is a known problem your data set and that the potential generate
very generic answers like i don't know or maybe you're not very informative we actually
working on
using external information
to produce to have direct systems that
actually are produced more informative and so the idea in this case is high or
the problem is how you're achieve so you have your dialogue context you have euro
user question no user turn
and it's a bit similar to the text approach to produce you do you look
on the web or in some sources all for some information that is relevant to
a what is being discussed
and you in the range and so now on you and we joint but with
this additional information and the whole busy gives you more informative dialogue so instead of
you know avoids is empty utterances
but the system now hides all this knowledge it can use to generate more informative
response
so there are a number of you know calmly i two challenge for example is
using providing this kind of data sets where you have it dialogue plus some additional
information related to the topic of the dialogue or chat image where you have
you have any image
and so there's a dialogue is based on the image
so again you can is it the dialogue system should actually use the content of
the image to provide some informative
and so a again this slide is something i think what you have really human
in evaluation is i think is something that speaks to a lot of people in
this room because it's at least for speech it's been shown that you really need
humans to george whether or not something is adequate and natural and many of those
things so
i wonder that because this to my understanding was perhaps the only subjective human evaluation
results double contain only slides so mostly people optimising towards objective metrics do you think
there is a risk of overfitting to these metrics maybe in particular tasks or so
on
do you see where do you see the role of the humans judging generated text
in your field now one in the future
so the human evaluation of sas will important because is automatic metrics
that you need them you need
two dev larger system and you need them to compare you know exhaust yet is
you have a the output of many system so you need some automatic metrics
but they are imperfect right
so you all you also need you meant evaluation
often the shared task actually organiser human evaluation and they
they do this i mean it's i think it's getting better and better because getting
a people are getting more experience
and they are better and better platform and you know i'm guidelines don't know how
to do this
we are not optimising with respect to those human
objective because it's just impossible right so the over fitting would be a with respect
to the training data where you do you know maximum likelihood
trying to maximize the likelihood of the training data mostly using cross-entropy so they are
say there is some oracles on using a reinforcement learning where you optimize with respect
to actually your evaluation metrics for the rouge number
that morning spent reading
we could and you so to me it's is that the main problem that you
cup is kind of during the
student a task and right beyond is the about this that's correct
looking internet finally the information that you want them right m is the about that
my question is very often the type of owns for that you give depends on
the type of person that is going to receive is what you will employ the
same and sort of you are going to give to work during your soul child
or to an expert in the figures that will go with screen your
is there any research on how can you can be sooner or limit
the answers so that it fits the user
there really but i can think of ways right now i mean there is a
people often find that you if you have some kind of parameter like this that
you want to used to influence the outputs so you have wine input and then
you want to different outputs depending on this parameter
often just i think this to your training data actually helps a lot
i think this was in
so people do this with emotions
for example should that x p us either should be happy so are there is
a might use there is they want to use emotion detector and emotion you know
some something that gives you in any motion tag for the for the sentence and
then they would
produce
i mean you need to the training data right but if you
if you can have this training data and you can label the training example with
the
personalities that you want to generate for then it works reasonably well in fact the
chat image a data it's nice place
it's
the dialogue is the dialogue has you have four same image you might have different
dialogue depending on the
personality so there's i-th input as a personality and the personalities is like something like
two hundred and fifty personalities
can be you know
jockeying serious or whatever and so they had the database the training data taking into
account this personality
so you can generate dialogues that have
about the same image
with different on depending on the personal but for example within the same or what
would be possible to open a constraint on be the vocabulary but you can use
on the output
yes
so in the encoder decoder
yes you could do that
this is not something normally people do the just use the whole vocabulary and then
they hope that the model is going to learn to focus on that vocabularies that
correspond to certain feature but maybe you could say that
you already mentioned it somewhat but
this you also raises or effective questions on the right the text more than maybe
in synthesis
that you really need to get it right or two
you have some other problems consistency or units indication of something bend it is
c
this is the proposed a statistical approach or can you
can you solve this
well that's a i mean i think one
when the problem i think that is the l c of the i c with
the with the current approach to new approach to generation is that they're not necessarily
semantically faithful what okay thing right so you know that they can print think that
have nothing to do with the input we can see the problem
i'm not sure a syntactical problem in the sense you know that generators are not
really out in the one that are not super useful either but in an application
so you know for people for industrial people want to develop application clearly it's a
problem right because there you don't want to sell a generative that
that is not faithful
but i mean ethical problems we have plenty in general in nlp
that's not time we had so that's a think is documents