0:00:13uh thank you uh first of all of the uh organized
0:00:16five
0:00:18speak
0:00:18uh on uh
0:00:20uh biological pathway inference
0:00:23and uh
0:00:24what i'm going to do is i'm going to describe
0:00:27a approach
0:00:29to this problem
0:00:30of
0:00:31combining
0:00:33in this case gene expression data so this
0:00:35continuously value
0:00:37uh
0:00:38abundance
0:00:39of uh
0:00:40a messenger or are a as measured uh on a V microarray array chip
0:00:46by now with ontological
0:00:48data of which describes
0:00:51something about
0:00:52gene function
0:00:53in particular
0:00:55uh the products
0:00:56that the
0:00:57are generated uh or induced
0:01:00by gene expression
0:01:02okay so
0:01:03the standard
0:01:04approach is to uh i one for matt at uh uh
0:01:08yeah i data analysis
0:01:10is
0:01:10to be data rip
0:01:12ignore
0:01:13any kind of functional
0:01:15uh or biological system
0:01:17a type of uh of of priors
0:01:20and then after to you
0:01:21john your analysis
0:01:22holdout out particular
0:01:24i i i a gene uh a uh uh factors let's say from the data
0:01:28then you go and try of uh you know to validate validated or to make some inferences on
0:01:32what's actually going on what of the functional relationships between these genes and can you
0:01:36somehow in court the pork core icsi's make calm incorporate them into a functional path
0:01:42so we do the simultaneous
0:01:44so here we're going to
0:01:45simultaneously do
0:01:47uh uh clustering variable selection
0:01:50and the
0:01:52and then
0:01:52uh
0:01:53functional
0:01:54i don't shen
0:01:56so
0:01:57uh i think everybody here has a least some
0:02:00vague notion at a minimum
0:02:02of uh the fact
0:02:03the gene is a segment
0:02:05of the N A
0:02:06uh that uh codes for protein
0:02:09and uh
0:02:10uh course
0:02:11uh not all of the uh
0:02:13uh
0:02:14all the go nucleotides uh
0:02:16on the D a on the D N A um as trans
0:02:19code for proteins but the genes
0:02:21uh in particular
0:02:22are are the ones that uh while just understand and can describe some sort of function to
0:02:27uh
0:02:28they
0:02:29these functions which are
0:02:31a primarily production of proteins true the pro is all
0:02:34and the poor uh be process of a translation
0:02:37or organised into what by all just call pathway so pathways or sequences
0:02:42of
0:02:43uh
0:02:44activation
0:02:46of different
0:02:46genes or protein products
0:02:49that need
0:02:50from one state to another
0:02:52right so
0:02:53uh you know a pathway
0:02:55uh four
0:02:56the inflammatory response
0:02:58leads
0:02:59uh starts with
0:03:00a a uh uh uh some uh
0:03:03infectious agent or some
0:03:04some in salt to the uh and you system and ends up with production of sight of kinds that uh
0:03:10basically a induce sinful information there's a very complicated sequence of gene expression
0:03:16that uh is associated with that process
0:03:19so um one of the principal problems is the discovery of
0:03:23how these pathways become perturbed were deregulated
0:03:27uh under uh for example disease states
0:03:30and uh
0:03:32uh the
0:03:33principal
0:03:34uh fact uh of uh the matter is
0:03:36that
0:03:37these functions are not just expressed by a particular gene expression
0:03:41uh uh uh uh uh a factor but they're expressed over time
0:03:45and over space
0:03:47and
0:03:47that's what we're going to talk about in particular
0:03:50in the context of
0:03:51the uh
0:03:52uh
0:03:54so your response to infection "'em" addition shows some data later around for like be so techniques as fusion
0:03:59of of
0:04:00uh expression data
0:04:02and uh ontological
0:04:03a gene not gene ontology data
0:04:06so
0:04:06uh this just shows a a uh i i a typical picture that you'll find in this pretty impressed at
0:04:12a particular case and nature of use in two thousand five
0:04:15which describes the uh you know molecular biologist understanding
0:04:19of how it's cell responds to infection faction in terms of protein production productions all of the user or protein
0:04:25uh in the uh
0:04:26uh the the the nucleus you have
0:04:29the uh uh a process of a
0:04:31the an a transcription and replication
0:04:33and that generate proteins that the
0:04:35uh that are located at particular regions within the cell so close to binding sites
0:04:41receptor sites
0:04:42production of uh
0:04:43uh of sight of combines in fear on so for
0:04:46a so there is a very complicated
0:04:49uh diagram here
0:04:51had ways
0:04:52can be i characterised as these sequences of
0:04:55events that leads for example to sell that program cell that
0:04:59i thought that was this
0:05:00uh
0:05:00which are is is an immune response
0:05:04so the the point though
0:05:05uh is that we want to somehow
0:05:08uh compress all of this complicated and and relatively vague information this picture
0:05:13to some kind of
0:05:14topological
0:05:16uh a constraint
0:05:18on a how
0:05:19uh say
0:05:20two genes
0:05:21can be related or not
0:05:23so this shows
0:05:24uh what's called a gene ontology semantic graph and this captures
0:05:29function
0:05:30of of of different genes
0:05:32in particular captures
0:05:34one of three
0:05:35a gene ontological uh uh uh classifications
0:05:39uh which is
0:05:40the uh cellular location
0:05:42of the protein that's produced by a particular G
0:05:45so you have here for example
0:05:47uh in the membrane us sell your membrane versus and uh
0:05:51in the a protein complex
0:05:53oh or and the nucleus
0:05:54uh down here you'll have different genes that are associated
0:05:58with
0:05:59this particular location
0:06:01in terms of proteins that they produce the larger the circle
0:06:04the more genes are in that particular uh uh
0:06:07functional
0:06:08uh body for this particular uh
0:06:12uh a process this particular pathway which is the of one
0:06:17"'kay" so
0:06:18this is the diagram
0:06:19that basically were gonna you is to merge
0:06:22with the the the expression data
0:06:25the raw expression data this is this comes from literature
0:06:29gene ontology is a
0:06:30a database which collects
0:06:32uh from different data
0:06:34uh a a a database is uh
0:06:36that uh represent experimental uh and validate results on
0:06:41uh in this case location
0:06:43a cellular location of protein production
0:06:45we're gonna we're gonna take this semantic description
0:06:48oh
0:06:49uh relations between
0:06:51between genes that are there attached to a particular component
0:06:54and we're gonna use that to sort of precondition the clustering
0:06:57oh the gene expression
0:06:59that's and not nutshell
0:07:00oh what we're doing
0:07:02so this just shows
0:07:03that that's the this shows the how it's sort of a
0:07:06uh put together you know more graphic uh
0:07:09uh a context
0:07:10we have a gene microarray array here
0:07:12with uh uh a a genes are expressed say over different treatments in class one which might be help the
0:07:19in the class two or the trip or you all is uh
0:07:21a subject
0:07:22uh and uh
0:07:23uh and then uh these would be different genes along the rows
0:07:27and we take uh uh these ontological
0:07:30uh uh
0:07:32speakers they shorten the previous slide
0:07:34so this might be the nucleus
0:07:36uh this one might be the side of plasma of this might be the for use all
0:07:40and that then
0:07:41i gives a a like a
0:07:43prior on how closely related these genes are in terms of uh function
0:07:51right so
0:07:52going from
0:07:53clusters the functional pathways is is
0:07:56a very difficult problem
0:07:57and uh
0:07:58uh the the problem is that genes with similar to have
0:08:01russian
0:08:02uh do not necessarily have similar function
0:08:05right so we if use correlation
0:08:08the correlate late
0:08:08gene expression from two different genes and say that they but they the same function just simply because
0:08:13they seem to have the same shape
0:08:15over their temporal
0:08:17uh uh expression profile
0:08:19uh that
0:08:20maybe completely spur it's they may not have share function
0:08:23how do you incorporate function in
0:08:26the uh as additional information
0:08:28you use
0:08:29G on top
0:08:32so uh in order to uh to capture this ontological
0:08:36uh uh uh function uh relationship between two genes
0:08:40uh we're gonna use basically a a a a a a manifold learning uh in a uh a a a
0:08:45type of a uh approach
0:08:46a lost uh eigen maps approach which is going to basically bed
0:08:51the genes into a lower dimensional map a manifold
0:08:55where
0:08:55distance is
0:08:57in that manifold are gonna be directly proportional
0:09:00two
0:09:01the ontological similarity between those two G so of the genes live in
0:09:05one of the common
0:09:07a locations within the cell in terms of the protein production
0:09:10uh then uh the similarity W Y G between the two genes i J
0:09:15uh will be more
0:09:17right and uh
0:09:18and that will be used as a weighting in this uh a plus in eigen maps
0:09:22a a clustering procedure which will give us a lower dimensional uh
0:09:26uh
0:09:28the plus raffle plus N
0:09:30induced
0:09:31in of the date
0:09:33okay
0:09:33no where does the gene expression come
0:09:36it comes in in a very weak sense in this uh in
0:09:40this embedding you but look at it is being driven by the ontology
0:09:43is been driven by function
0:09:45so similar functions in this case being
0:09:47similar locations within the cell
0:09:49that these genes jeans and
0:09:52uh
0:09:53uh that's R W Y G K
0:09:55but the gene um
0:09:57expression
0:09:58uh controls the neighbour
0:10:01uh so we're going to
0:10:03uh basically a zero wait
0:10:05if the expression profiles are to dissimilar
0:10:08"'kay" so it's it's a way of
0:10:10of conditioning
0:10:11uh the um
0:10:13uh the embedding which would just be based on pure
0:10:16uh on ontology
0:10:18uh
0:10:18uh based on the uh
0:10:20a similarity of that of the gene expression profile
0:10:23i'm not gonna go through a this look life the eigen mass spell can and i U V
0:10:26in two thousand to two publish very nice paper on and sites five is the say that are gonna fine
0:10:31yeah clear that's why i and Y J and some lower dimensional space
0:10:36uh maybe two dimensions the visualise
0:10:38uh such that
0:10:39you basically preserve distance
0:10:42in the ontological space
0:10:43now of force
0:10:45uh well we prune neighbours
0:10:47if there expression profiles are to the simple
0:10:51okay
0:10:52so
0:10:53this uh embedding i has been applied to uh the uh a particular dataset that's out there um called the
0:11:00young data set
0:11:01uh
0:11:02by um
0:11:03uh
0:11:04looking at the
0:11:06in this case
0:11:07uh in each vitro
0:11:08uh to kill and uh tuberculosis uh
0:11:11uh
0:11:12infection
0:11:13of mac fe just cells
0:11:15uh that uh
0:11:17i then are pass say using mike raise an a and R T P C R
0:11:21to produce that gene map that he map actually four
0:11:25uh and a there are eight time points they wanna basically look at
0:11:29how the uh this the these uh a dendritic a mac the phase cells
0:11:34respond
0:11:35uh after that trip kill when uh uh
0:11:38uh has been introduced
0:11:40and so uh here's the data for the control group
0:11:44here's the data for the tuberculosis uh a group again over time
0:11:48and so we're gonna be trying to do was determined changes
0:11:51uh uh uh a a that and and associate those changes from control to the uh
0:11:56but a brick you and uh group
0:11:58uh with
0:11:58functional uh uh
0:12:00uh
0:12:01protein of production
0:12:04okay so
0:12:05we're gonna take a first of all
0:12:07uh a difference between the control and tuberculosis so that we can have a baseline which is
0:12:12control
0:12:13and then we're going to and bed
0:12:16those expression profiles as deferential expression profiles
0:12:20into a two dimensional space using this what plus the eigen map
0:12:24um so uh
0:12:26i'm first gonna show use all uh the standard
0:12:29functional pca and batting
0:12:31which uh simply applies
0:12:34uh a
0:12:35singular value decomposition
0:12:38on
0:12:39a a a a a a spline basis
0:12:41interpolation over time of those uh the an expression profiles i should be the previous slide
0:12:46and uh
0:12:48uh then afterwards it applies a a a gaussian mixture model clustering
0:12:52to try an associate these different uh
0:12:55uh uh these different genes
0:12:57uh as uh hopefully a associated with different pathways or different the
0:13:02uh a function
0:13:04so uh what you see is the sound of classic
0:13:07this classic uh a a uh
0:13:09uh
0:13:09concentration
0:13:11of uh
0:13:12uh of
0:13:13of measure here on
0:13:15on B
0:13:16uh where where the the mit the
0:13:18a gaussian mixture models
0:13:20have basically the trying to match with a two dimensional
0:13:23a domain they're trying to simultaneously capture clusters
0:13:27which might
0:13:28actually be two dimensions
0:13:30but uh there are also clusters are probably just one dimension so it's a very
0:13:34uh a a a a a a a very heterogeneous speech
0:13:38if you go and look at after you clustered
0:13:40you would expect these plus to do very good right "'cause" this clusters uh obviously
0:13:44the centroid doesn't even near
0:13:46uh
0:13:48i near the any any particular uh a a G
0:13:51uh
0:13:51uh
0:13:52the uh
0:13:53uh that that we can classify how good this clustering is simply by looking at the percentages
0:13:59oh of uh
0:14:00uh a a of each one of these genes in a given cluster that in this in the same location
0:14:04in the set
0:14:05and
0:14:06is over fifty percent
0:14:08uh do not uh are not call like it locate over fifty percent of all the genes in any cluster
0:14:13are not co-located co located the cell which indicates that's again
0:14:16that uh gene expression over time
0:14:19other profiles do not
0:14:21i discriminate accurately between uh a genes that with different function
0:14:26oh yeah hand if you use this manifold colour betting that i described
0:14:30you get a much nicer
0:14:31us spread
0:14:32oh of the uh uh a of these clusters
0:14:35into uh well defined groups
0:14:38these are for different clusters
0:14:40uh the uh you drop but to a much
0:14:43much lower
0:14:44percentage of uh
0:14:46a of uh impurity right some money more of the genes within the cluster groups label green blue and
0:14:52uh turquoise and so forth
0:14:54a a i close to each other within the cell
0:14:57which is which is the sign of course
0:14:59uh that uh
0:15:01the the method that we implemented
0:15:03is actually capturing this co location ontology
0:15:06um
0:15:07and does so would this just i was some by clustering and in the C is i'm not gonna bother
0:15:11dwelling on that i'm running out of time
0:15:14uh but in the indicating that the we have improved performance not surprisingly
0:15:18because we're using ontological date gene ontology data
0:15:22to condition
0:15:24uh the uh
0:15:25these cluster
0:15:26if you don't if you just use a functional pca
0:15:29uh you get uh a a cluster indices
0:15:32a for these various pathways these uh that uh
0:15:35or
0:15:36uh i have much lower or a quality and if you use our method that uh which you can see
0:15:41just pairing these numbers
0:15:43this is the this is the quality index between zero one one
0:15:46that uh again indicate the purity of each shot
0:15:49uh of each one these clusters
0:15:51in terms of the number of that
0:15:53a genes that i within
0:15:55the same class
0:15:58okay so in conclusion
0:15:59i describe this this method
0:16:01uh which deals with calm embedding
0:16:04of both
0:16:05uh expression data and functional
0:16:07uh and
0:16:09uh of genes
0:16:10uh uh in terms of their work uh expression under in this particular case that tuberculosis uh
0:16:16uh infection
0:16:17uh i i've uh uh
0:16:19basically uh describe how we do this using a plus an eigen map
0:16:23uh it allows us
0:16:24to uh
0:16:26uh to to if you like couple
0:16:28the
0:16:29uh uh the variable selection
0:16:31clustering
0:16:33and
0:16:34functional annotation in one package
0:16:36and as a result
0:16:37we can improve uh
0:16:39uh
0:16:40pathway way analysis
0:16:42uh by by doing
0:16:44that's a say
0:16:57on what you
0:16:58elaborate a little bit on the test in
0:17:01a pitch T
0:17:02do you terms
0:17:03yeah on case is use that you now using the hierarchical structure at that you it aims
0:17:09we also yeah so those distance
0:17:13uh we are to go back to
0:17:17this
0:17:18equation so
0:17:19yeah i i skipped over this
0:17:22uh partially because there's a type
0:17:24but
0:17:24S
0:17:26a G should
0:17:28uh but uh uh so this distance is defined in terms of
0:17:32the
0:17:33number of
0:17:34go terms
0:17:36which are common to the two gene
0:17:40and the number go terms that
0:17:42uh you know a are margin
0:17:44so that what that's actually doing if you look at the
0:17:47uh this graph here
0:17:49is it saying if i if i have to genes
0:17:52and i look at where they'll a ice let's say the you have to genes that lie within
0:17:57uh you know this uh this particular we're alan membrane co location
0:18:01well then the car you go back and look at the pair
0:18:04as the parent
0:18:06uh statistics
0:18:07that tell you they give you this topological mapping
0:18:10now granted
0:18:12it's only the parents we don't look up the grandparents and great grandparents and so forth
0:18:17uh but we are taking account of the pollen G and that at least first sense
0:18:21yeah
0:18:22sure
0:18:24a question
0:18:33thank you what can you do is you have only a partial knowledge of the ontology
0:18:37which is a
0:18:38actually what we have because
0:18:41you know one can't believe
0:18:44the that all published results and so
0:18:46uh
0:18:48we don't have when we but we we would like to have
0:18:51uh measures the cough
0:18:53uh
0:18:54in terms of you know
0:18:55the degree to which the ontology can be relied upon
0:18:59that just doesn't exist yet
0:19:01right
0:19:02uh
0:19:03i think that uh
0:19:05it's one of the main deficiencies
0:19:08of
0:19:09the functional annotations
0:19:11we have today that there is no
0:19:13figure of merit figure of confidence finance that one can you
0:19:17that allows you to
0:19:18you know you know systematic way know what kind of waiting you need to apply to that on ontological information
0:19:25in order to balance
0:19:28the uncertainty of ontology versus the uncertainty
0:19:31uh of gene expression right
0:19:34so
0:19:35the answer is not a satisfactory one and fortunately i can't
0:19:38can give you
0:19:39uh
0:19:40and not be answer there
0:19:47if it to a question that its um you have
0:19:50the thing impulse impossible to use or disease use
0:19:53the
0:19:54gene expression location be change
0:19:56yes
0:19:57yes absolutely
0:19:59in then how do you i mean which means that you you know way you you gone the by use
0:20:03or completely a
0:20:05you locate
0:20:07a gene
0:20:08yeah them the ontology which is maybe not the one corresponding to the disease
0:20:13so you
0:20:13yeah that that's an excellent question and and it's related
0:20:16of course uh
0:20:18it to the fact that the ontology really should be a temporal
0:20:22database
0:20:23it's not
0:20:23right a it collapses
0:20:25the entire time course
0:20:27of you know functional activation
0:20:29into a summer
0:20:31the only the only way that we account for that
0:20:34is by the fact that
0:20:36the ontological uh i notation each one of these genes
0:20:41uh is not unique it's not just a one dimensional quantity
0:20:45so
0:20:45a gene may be long simultaneously
0:20:49two
0:20:49several locations to in of it if the protein production
0:20:53uh that it
0:20:54that it's responsible for
0:20:56in a two different phases
0:20:58uh is say in the nucleus under one phase and in uh you know that
0:21:05over over the
0:21:06the the the membrane and in another fit
0:21:09but again the ontology is not rich enough yet to be able to capture that temporal information
0:21:14if it was
0:21:16uh we could obviously do much better and we could really start talking about pathways which are temporally modulated
0:21:22excellent question
0:21:30basically
0:21:31you know with slide
0:21:33saw
0:21:33for or change if the data
0:21:35or
0:21:36right
0:21:37yeah just one at this time approach change to the
0:21:40yeah station
0:21:42right
0:21:43right i like to agree with you there that we've done that but we have
0:21:46i
0:21:47uh the because it we have we could do that if we computed distance
0:21:52sort of short time distance over a window
0:21:55between gene expression
0:21:56but we actually print
0:21:58compute the distance over the entire temporal
0:22:00uh
0:22:02period that's collected
0:22:03so time again is collapsed
0:22:06but if we truly had ontological data that was
0:22:09temporally
0:22:10uh specific
0:22:12we could develop a much more sophisticated model
0:22:15B
0:22:16frankly frankly more used
0:22:18but this is a is the beginning
0:22:20right exactly
0:22:22so yeah that's that uh al again
0:22:25i