<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Raphaël's research website</title>
    <description></description>
    <link>/</link>
    <atom:link href="/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Wed, 30 Nov 2022 07:37:38 -0500</pubDate>
    <lastBuildDate>Wed, 30 Nov 2022 07:37:38 -0500</lastBuildDate>
    <generator>Jekyll v4.2.0</generator>
    
      <item>
        <title>The Change Of Variable Formula and the Gaussian Integral</title>
        <description>&lt;p&gt;In this post, I present a simple way to calculate the Gaussian Integral, that is a very appealing application of the change of variable formula.&lt;/p&gt;

&lt;!--more--&gt;

&lt;h1 id=&quot;the-gaussian-integral&quot;&gt;The Gaussian Integral&lt;/h1&gt;

&lt;p&gt;The Gaussian Integral is a very important result that gives the value of the integral $$I=\int\limits_{-\infty}^{\infty} e^{-t^2} dt=\sqrt{\pi}$$.&lt;/p&gt;

&lt;p&gt;This integral is famous since it relates two important constants $e$ and $\pi$.
Moreover, it is widely used for instance in probability theory, as it is the partition function of the standard Normal Distribution $\mathcal{N}(0,1)$.&lt;/p&gt;

&lt;p&gt;However, it can be shown that it is impossible to integrate this using simple functions. In this article I will describe how a simple trick allows to calculate it in a few lines of derivations.&lt;/p&gt;

&lt;h1 id=&quot;the-jacobian-and-the-jacobian-matrix&quot;&gt;The Jacobian and the Jacobian Matrix&lt;/h1&gt;

&lt;p&gt;Let $\mathcal{U}$ and $\mathcal{V}$ be two open subsets of $\mathbb{R}^n$ and $\mathbb{R}^m$ respectively and $\phi:\mathcal{V}\rightarrow\mathcal{U}$ be a differentiable map between them.&lt;/p&gt;

&lt;p&gt;$\phi$ maps every vector $v=(v_1,…,v_m)\in \mathcal{V}$ to a unique vector $\phi(v)=u=(u_1,…,u_n)\in \mathcal{U}$.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Jacobian Matrix&lt;/strong&gt; of $\phi$ at point $v$, denoted $J_{\phi}(v)$ is the matrix whose entries $J_{\phi}^{i,j}(v)$, is the infinitesimal difference on the $i$-th output $u_i$ of $\phi$ obtained by applying an infinitesimal change to the $j$-th output $v_j$.&lt;/p&gt;

&lt;p&gt;In mathematical terms we have $J_{\phi}(v) = (\frac{\partial u_i}{\partial v_j})_{i,j}$.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Jacobian&lt;/strong&gt; is the determinant of the Jacobian Matrix, i.e. $Jac_\phi(v) =det(J_{\phi}(v))$.
The Jacobian has an important geometrical interpretation.&lt;/p&gt;

&lt;p&gt;Indeed, the $j$-th column of the Jacobian Matrix contains the coordinates of the image of the canonical basis vectors $e_1,…,e_j$ of $\mathbb{R}^m$ through a first order approximation of $\phi$ around $v$.
The images of these basis vector form a signed hypervolume in the space $\mathbb{R}^n$, whose value is given specifically by $det(J_{\phi}(v))$.&lt;/p&gt;

&lt;p&gt;Thus, the Jacobian tells us by how much the unit volume of hypercube in $\mathbb{R}^m$ is stretched when passed through the linear transformation $J_{\phi}(v)$.&lt;/p&gt;

&lt;h1 id=&quot;the-change-of-variable-formula&quot;&gt;The Change of Variable Formula.&lt;/h1&gt;

&lt;p&gt;Let’s suppose that we want to cal.culate an integral of a real-valued continuous function $f$ over an open set $\mathcal{U}$ of a measured space $(\mathcal{X}, \mu)$ :
$$\int\limits_{u\in\mathcal{U}} f(u)\mu(du),$$ but that there’s no easy way to do this due to a discrepancy between the structure of $\mathcal{U}$ and the form of $f$.&lt;/p&gt;

&lt;p&gt;Suppose moreover that there exists another open set $\mathcal{V}$ on a space $(\mathcal{Z}, \nu)$ and a continuously differentiable map $\phi:\mathcal{V}\rightarrow\mathcal{U}$ between the two sets.&lt;/p&gt;

&lt;p&gt;Then we have&lt;/p&gt;

&lt;p&gt;$$
\int\limits_{u\in\mathcal{U}} f(u)\mu(du) =
\int\limits_{v\in\mathcal{V}} f\circ\phi(v)  Jac_\phi(v) \nu(dv)
$$&lt;/p&gt;

&lt;p&gt;In this expression, $Jac_{\phi}(v)$ is the Jacobian of $\phi$ around $v$.&lt;/p&gt;

&lt;p&gt;This tells us that we can compute the integral in the space $\mathcal{Z}$ if it is simpler, provided that we rescale the differential by the Jacobian.&lt;/p&gt;

&lt;h2 id=&quot;example-the-polar-coordinates&quot;&gt;Example: the polar coordinates&lt;/h2&gt;

&lt;p&gt;The polar coordinates system allows describing points in $\mathbb{R}^2$ not as a pair of Cartesian coordinates but instead using a radius and an angle.&lt;/p&gt;

&lt;p&gt;It is represented by the transformation $\phi:(u,v)\mapsto (u.cos(\theta), v.sin(\theta))$, or its inverse transformation $\psi:(x,y)\mapsto(\sqrt{x^2+y^2},arctan(\vert\frac{y}{x}\vert))$&lt;/p&gt;

&lt;p&gt;The Jacobian of $\phi$ has the very simple form $Jac_{\phi}(u,\theta) = u$.&lt;/p&gt;

&lt;p&gt;For any integrable function $f :\mathbb{R}^2 \rightarrow \mathbb{R}$ such that the integral $\int\limits f(x,y)dxdy$ exists, the following holds:&lt;/p&gt;

&lt;p&gt;$$
\int f(x,y)dxdy =\int f\circ \phi(u,\theta) u du d\theta
$$&lt;/p&gt;

&lt;p&gt;Intuitively, if we want to use the polar coordinates instead of the Cartesian ones, we just need to scale the differential $dxdy$ by a factor $u$.&lt;/p&gt;

&lt;p&gt;One can think of $dxdy$ (resp. $du d\theta$) as the volume of the parallelogram obtained from the pair of vectors $(dx \vec{e_x}, dy \vec{e_y})$ (respectively $(du \vec{e_u}, d\theta \vec{e_\theta})$), where $\vec{e_{.}}$ are the unit vectors of the canonical basis of each coordinate spaces.&lt;/p&gt;

&lt;p&gt;Virtually, the change of variable formula tells us how these volumes relate to each other, provided that we know a differentiable transformation going from one space to the other.&lt;/p&gt;

&lt;h1 id=&quot;application-to-the-gaussian-integral&quot;&gt;Application to the Gaussian Integral.&lt;/h1&gt;

&lt;p&gt;To apply the previous result to the Gaussian Integral $I=\int\limits_{-\infty}^{\infty} e^{-t^2} dt$, we will look instead at a similar integral but in dimension 2:&lt;/p&gt;

&lt;p&gt;$$
J=\int\limits_{-\infty}^{\infty}\int\limits_{-\infty}^{\infty} e^{-(x^2+y^2)} dx dy.
$$&lt;/p&gt;

&lt;p&gt;Surprisingly, this integral will show to be easier to calculate. Moreover, using Fubini’s theorem, we can easily show that&lt;/p&gt;

&lt;p&gt;$$
J
=\int\limits_{-\infty}^{\infty}\int\limits_{-\infty}^{\infty} e^{-(x^2+y^2)} dx dy
=\int\limits_{-\infty}^{\infty} e^{-x^2} dx \int\limits_{-\infty}^{\infty}e^{-y^2}dy
= I^2
$$&lt;/p&gt;

&lt;p&gt;Now in order to calculate it, we first realize that if we denote $S_R =[-R,R]^2$ the square of size $R$ for any $R&amp;gt;0$, and $J_R= \int\limits_{S_R} e^{-(x^2+y^2)} dx dy$, we have:&lt;/p&gt;

&lt;p&gt;$$
J= lim_{R \rightarrow \infty} J_R
$$&lt;/p&gt;

&lt;p&gt;Moreover, since the diagonal of the square $S_R$ has length $\sqrt{2} R$, if we denote $D_R = \{ (x,y)\vert x^2+y^2 \leq R^2\}$ the disk of radius $R$, we have that $D_R \subset S_R \subset D_{\sqrt{2}R}$.&lt;/p&gt;

&lt;p&gt;Hence, denoting $K_R= \int\limits_{D_R} e^{-(x^2+y^2)} dx dy$, we have&lt;/p&gt;

&lt;p&gt;$$K_R \leq J_{R} \leq K_{\sqrt{2}R}.$$&lt;/p&gt;

&lt;p&gt;Now we will show that $K_R \rightarrow \pi$ when $R \rightarrow \infty$, as a consequence, $J_R$ will also tend to $\pi$ since it is surrounded by expressions that tend to $\pi$.&lt;/p&gt;

&lt;p&gt;This is where the change of variable formula comes at play. Indeed, we will express the integral in polar coordinates.
Indeed, denoting $(x,y)=\phi(u,\theta) = (u.cos(\theta), u.sin(\theta))$ and $f(x,y)=e^{-(x^2+y^2)}$
we have that $$\int f(x,y) dxdy = \int f\circ \phi (u,\theta) Jac_{\phi}(u,\theta) du d\theta$$&lt;/p&gt;

&lt;p&gt;Thus:&lt;/p&gt;

&lt;p&gt;$$
\begin{aligned}K_R &amp;amp;= \int\limits_{D_R} e^{-(x^2+y^2)} dx dy \\ &amp;amp;=
\int\limits_{[0,R[\times [0,2\pi[} e^{-r^2} r dr d\theta  \\ &amp;amp;=
\frac{1}{2}\int\limits_{[0,\sqrt{R}[\times [0,2\pi[} e^{-s} ds d\theta\\ &amp;amp;=
\pi (1-e^{-R})&lt;br /&gt;
\end{aligned}
$$&lt;/p&gt;

&lt;p&gt;So $K_{R}\rightarrow \pi$ when $R \rightarrow \infty$.&lt;/p&gt;

&lt;p&gt;This shows that $I=\sqrt{\pi}$&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;We have seen that the change of variable formula is a powerful tool to calculate integrals that don’t have closed forms.
The Jacobian allows to express the ratio between the infinitesimal volume used in the integrals.
This technique can be applied to calculate seemingly complicated integrals in a few lines.
This change of variable formula has been used extensively in Machine Learning, especially in the reparametrization of distributions in Representation Learning.&lt;/p&gt;

&lt;p&gt;For instance Normalizing Flows or Variational Autoencoders rely heavily on it.&lt;/p&gt;

&lt;ol class=&quot;bibliography&quot;&gt;&lt;/ol&gt;
</description>
        <pubDate>Wed, 30 Nov 2022 00:00:00 -0500</pubDate>
        <link>/articles/22/gaussian-integral</link>
        <guid isPermaLink="true">/articles/22/gaussian-integral</guid>
        
        
      </item>
    
      <item>
        <title>Investigating Different distance function for Latent Distance graph models</title>
        <description>&lt;p&gt;In this post, we consider Latent Space models for graphs, and investigate the impact of the distance function used on the embedding space.&lt;/p&gt;

&lt;!--more--&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Latent Space models for graphs are defined such that the independent edge link probabilities are given by functions of the distance between some embeddings. $\newcommand{\norm}[1]{\vert\vert #1 \vert\vert}$. Denoting $i,j$ some node indices, $z_i \in \mathbb{R}^d$ some node embeddings of dimension $d&amp;gt;0$.&lt;/p&gt;

&lt;p&gt;$$
\begin{align}
\label{model}
y_{ij} &amp;amp;\sim Bernoulli(\theta_{ij}) \\ \theta_{ij} &amp;amp;= \sigma(x_{ij}) \\  x_{ij} &amp;amp;= \gamma- g(\norm{z_i - z_j}^2)
\end{align}
$$&lt;/p&gt;

&lt;p&gt;where $x_{ij}$ are the logits of the model, $\theta_{ij}$ are the edge link probabilities and $\sigma$ is the sigmoid link function.&lt;/p&gt;

&lt;p&gt;The function g can be any non-decreasing, non-negative smooth function. For instance:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;If g is the identity function: $x_{ij} = \gamma- \norm{z_i - z_j}^2$&lt;/li&gt;
  &lt;li&gt;If g is the square root function: $x_{ij} = \gamma- \norm{z_i - z_j}$&lt;/li&gt;
  &lt;li&gt;If g is the log: $x_{ij} = \gamma- 2\log(\norm{z_i - z_j})$&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We would like to investigate the impact on this distance function on the embeddings found by performing Maximum Likelihood Estimation of the model, given an observed graph.&lt;/p&gt;

&lt;h1 id=&quot;likelihood-and-gradient-of-the-model&quot;&gt;Likelihood and gradient of the model&lt;/h1&gt;

&lt;p&gt;The likelihood of a given observed undirected graph $\hat{G}$ with adjacency matrix $A={a_{ij}}$ is given by:&lt;/p&gt;

&lt;p&gt;$$p(\hat{G}) = \prod\limits_{i&amp;lt;j} \theta_{ij}^{y_{ij}} (1-\theta_{ij})^{1-y_{ij}}$$&lt;/p&gt;

&lt;p&gt;Hence we get the following negative log-likelihood, as a function of the embeddings $z_i$:&lt;/p&gt;

&lt;p&gt;$$L(z) = \sum\limits_{i&amp;lt;j} log(1+ exp(x_{ij}))  - y_{ij} x_{ij}$$&lt;/p&gt;

&lt;h1 id=&quot;gradient&quot;&gt;Gradient&lt;/h1&gt;

&lt;p&gt;For a given node $i$, we compute the gradient of the loss function with respect to the embedding $z_i$.&lt;/p&gt;

&lt;p&gt;This one is given by:&lt;/p&gt;

&lt;p&gt;$$\nabla_{z_i}L(z) = \sum_{j\neq i} (\nabla_{z_i}x_{ij}) (y_{ij} - \sigma(x_{ij}))$$&lt;/p&gt;

&lt;p&gt;Moreover, using the chain rule gives us the gradient of the logit $x_{ij}$ with respect to the embeddings:&lt;/p&gt;

&lt;p&gt;$$\nabla_{z_i}x_{ij} = -2(z_i - z_j) g’(\norm{z_i-z_j}^2)$$&lt;/p&gt;

&lt;p&gt;So finally we get the following gradient:&lt;/p&gt;

&lt;p&gt;$$\nabla_{z_i}L(z) = \sum_{j\neq i} 2(z_i - z_j) g’(\norm{z_i-z_j}^2) (y_{ij} - \sigma(x_{ij}))$$&lt;/p&gt;

&lt;h1 id=&quot;interpretation-in-terms-of-forces&quot;&gt;Interpretation in terms of forces&lt;/h1&gt;

&lt;p&gt;As we see in the previous expression, the gradients with respect to the embeddings can be view as a set of forces pulling or repulsing the embeddings away from each other depending on whether the corresponding nodes are linked in the graph or not.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;If the nodes $i$ and $j$ are connected (i.e. $y_{ij}=1$), we get :
$$y_{ij} - \sigma(x_{ij}) = 1 - \sigma(x_{ij}) = \frac{1}{1+\exp(x_{ij})}&amp;gt;0.$$
So the associate gradient term will be the following &lt;em&gt;attractive force&lt;/em&gt;
$$
\begin{aligned}
\vec{f_{ij}^{+}} &amp;amp;= 2(z_i - z_j) g’(\norm{z_i-z_j}^2) (y_{ij} - \sigma(x_{ij})) \\ &amp;amp;= (z_i - z_j) (\frac{2 g’(\norm{z_i-z_j}^2)}{1+\exp(x_{ij})})
\end{aligned}
$$
Indeed, since $g$ is non-decreasing we have $\frac{2 g’(\norm{z_i-z_j}^2)}{1+\exp(x_{ij})} &amp;gt;=0$, so this vector is oriented from the embedding $z_j$ to the embedding $z_i$, hence the term “attractive”.
Later, we might be interested in how the intensity of this force scales with the distance between embeddings.&lt;/li&gt;
  &lt;li&gt;If the nodes $i$ and $j$ are not connected, we have $y_{ij} - \sigma(x_{ij}) = -\sigma(x_{ij})$ the embeddings are connected by the following $repulsive force$ (essentially pushing away $z_j$ from $z_i$):
$$
\begin{aligned}
\vec{f_{ij}^{-}} = - (z_i - z_j) (\frac{2 g’(\norm{z_i-z_j}^2)}{1+\exp(-x_{ij})})
\end{aligned}
$$&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Denoting the &lt;em&gt;sign&lt;/em&gt; variable $s_{ij} = 1$ if $y_{ij}=1$ and $s_{ij} = -1$ if $y_{ij}=0$, we get the following compact formula for this force term:&lt;/p&gt;

&lt;p&gt;$$
  \begin{aligned}
  \vec{f_{ij}} = s_{ij} (z_i - z_j) (\frac{2 g’(\norm{z_i-z_j}^2)}{1+\exp(s_{ij} x_{ij})})
  \end{aligned}
$$&lt;/p&gt;

&lt;h1 id=&quot;examples&quot;&gt;Examples&lt;/h1&gt;

&lt;p&gt;Using different distance functions, we can derive the attractive and repulsive forces to have an idea of their intensity.&lt;/p&gt;

&lt;h3 id=&quot;identity-distance-function&quot;&gt;Identity distance function&lt;/h3&gt;

&lt;p&gt;In the case where $g$ is simply the identity function, we get a signed force equal to
$$\vec{f_{ij}} =  s_{ij}(z_i-z_j)\frac{2}{1+\exp(s_{ij} (\gamma - \norm{z_i-z_j}^2))}$$
Thus, in that case the norm of the force is given by&lt;/p&gt;

&lt;p&gt;$$\norm{\vec{f_{ij}}} = \frac{2\norm{z_i-z_j}}{1+\exp(s_{ij} (\gamma - \norm{z_i-z_j}^2))}$$&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;For positive edges ($s_{ij}=1$), this force reaches its minimum when the embeddings match, and will tend to infinity exponentially in the squared distance between embeddings:
$$\norm{\vec{f_{ij}}} \sim 2\norm{z_i-z_j}exp(\norm{z_i-z_j}^2)$$ when $\norm{z_i-z_j} \rightarrow +\infty$.&lt;/li&gt;
  &lt;li&gt;For negative edges ($s_{ij}=-1$), this force becomes decreasing in the distance, and tends to $0$ when $\norm{z_i-z_j} \rightarrow +\infty$.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;square-root-distance-functions&quot;&gt;Square root distance functions&lt;/h3&gt;

&lt;p&gt;If $g$ is the squared root function, we get a signed force equal to
$$\vec{f_{ij}} =  \frac{s_{ij}(z_i-z_j)}{\norm{z_i - z_j}(1+\exp(s_{ij} (\gamma - \norm{z_i-z_j}^2)))}$$&lt;/p&gt;

&lt;p&gt;The norm of this force term writes:
$$\norm{\vec{f_{ij}}} = \frac{2}{1+\exp(s_{ij} (\gamma - \norm{z_i-z_j}^2))}$$&lt;br /&gt;
This has the following assymptotic behavior:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;when $\norm{z_i-z_j} \rightarrow +\infty$.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;log-distance-functions&quot;&gt;Log distance functions&lt;/h3&gt;

&lt;p&gt;If $g$ is the log, we get&lt;/p&gt;

&lt;p&gt;$$
\vec{f_{ij}} =  \frac{2s_{ij}(z_i-z_j)}{\norm{z_i - z_j}^2(1+\exp(s_{ij} (\gamma - \norm{z_i-z_j}
^2)))}
$$&lt;/p&gt;

&lt;p&gt;In that case, the norm of the force is&lt;/p&gt;

&lt;p&gt;$$
\norm{\vec{f_{ij}}} = \frac{2}{\norm{z_i - z_j}(1+\exp(s_{ij} (\gamma - \norm{z_i-z_j}
^2)))}
$$&lt;/p&gt;

&lt;p&gt;For both positive and negative edges, this one tends to $+\infty$ when the distance tends to $0$&lt;/p&gt;

&lt;!-- We see that the first order derivative of the distance function has an impact on the type of force  --&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;In this article, we evaluate how the distance function in latent space models impact the form of the attractive and repulsive forces that govern the Maximum Likelihood Estimation.
We see that depending on the form of the first order derivative of the distance function, the forces will have different behaviors when the distances go to 0 or to $\infty$&lt;/p&gt;

&lt;ol class=&quot;bibliography&quot;&gt;&lt;/ol&gt;
</description>
        <pubDate>Sat, 01 Oct 2022 00:00:00 -0400</pubDate>
        <link>/articles/22/latent_space_functions</link>
        <guid isPermaLink="true">/articles/22/latent_space_functions</guid>
        
        
      </item>
    
      <item>
        <title>About train-test splitting for link prediction</title>
        <description>&lt;p&gt;In this post, we describe the link prediction task for networks, and the strategies to evaluate link prediction methods&lt;/p&gt;

&lt;!--more--&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Link prediction is a machine learning task that that consists in predicting missing links on an observed network. Applications of this tasks include recommender systems, friend recommendations on social media, proediction of protein-protein interaction in biology, among others.&lt;/p&gt;

&lt;p&gt;A link prediction method can be seen as a score function, that takes as input an edge $e$ and yields a score $s(e) \in [0,1]$&lt;/p&gt;

&lt;h1 id=&quot;evaluations-strategy&quot;&gt;Evaluations strategy&lt;/h1&gt;

&lt;p&gt;We want to evaluate a Link prediction method on a given graph $G= (U,E)$, $U$ being the set of nodes in the graph and $E\subset U\times U$ the set of edges.&lt;/p&gt;

&lt;h1 id=&quot;pruning---traintest-splitting&quot;&gt;Pruning - Train/test splitting&lt;/h1&gt;

&lt;p&gt;The first step is to remove some of the observed edges.&lt;/p&gt;

&lt;p&gt;Doing so yields a pruned graph $\newcommand{tE}{\tilde{E}}$ $\newcommand{\tG}{\tilde{G}}$ $\tG = (U,\tE)$, where $\tE\subset E$. Note that this pruned graph can be disconnected and have isolated nodes.&lt;/p&gt;

&lt;p&gt;We denote $\newcommand{\mE}{E_{missing}}$$\mE$ the set of edges that were removed during this pruning operation, i.e. $\mE=E \backslash \tE$.&lt;/p&gt;

&lt;p&gt;In contrast, we denote $\newcommand{\nE}{E_{neg}}$$\nE$ the set of true negative edges, i.e. the edges that were effectively not in the original graph.&lt;/p&gt;

&lt;p&gt;This set is disjoint from the set of missing edges: $\nE \cap \mE = \emptyset$&lt;/p&gt;

&lt;p&gt;The goal of link prediction is to score the edges that were removed from the graph higher than the edges that were not in the graph in the first place.&lt;/p&gt;

&lt;p&gt;In other words, we construct a binary classification dataset composed of edges $e\in \mE \cup \nE$, and a response variable $y=1$ if $e \in \mE$ and $y=0$ else.&lt;/p&gt;

&lt;h1 id=&quot;metrics&quot;&gt;Metrics&lt;/h1&gt;

&lt;p&gt;Within the context described before, the missing edges can be reframed &lt;em&gt;true positive&lt;/em&gt;, while the negative edges cna be called &lt;em&gt;true negative&lt;/em&gt;. The link prediction methods can then be evaluated as a binary classifier, trying to discriminate edges that were removed during the pruning process, from edges that were already negative in the original graph.&lt;/p&gt;

&lt;p&gt;Given a threshold value $\tau$, we can define a decision rule on the set of edges $e$.&lt;/p&gt;

&lt;h2 id=&quot;example&quot;&gt;Example&lt;/h2&gt;

&lt;p&gt;Let’s suppose that our test set is composed of 7 edges $e_1,…,e_7$, and that we set the decision threshold to $\tau=0.89$.&lt;/p&gt;

&lt;p&gt;In the table below, we order the edges in order of their link prediction scores, and predict them to be positive if their score is higher than the threshold.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Edge&lt;/th&gt;
      &lt;th&gt;$e_1$&lt;/th&gt;
      &lt;th&gt;$e_2$&lt;/th&gt;
      &lt;th&gt;$e_3$&lt;/th&gt;
      &lt;th&gt;$e_4$&lt;/th&gt;
      &lt;th&gt;$e_5$&lt;/th&gt;
      &lt;th&gt;$e_6$&lt;/th&gt;
      &lt;th&gt;$e_7$&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Score&lt;/td&gt;
      &lt;td&gt;0.95&lt;/td&gt;
      &lt;td&gt;0.94&lt;/td&gt;
      &lt;td&gt;0.92&lt;/td&gt;
      &lt;td&gt;0.9&lt;/td&gt;
      &lt;td&gt;0.8&lt;/td&gt;
      &lt;td&gt;0.79&lt;/td&gt;
      &lt;td&gt;0.73&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Label&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Prediction&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As we can see, using this threshold, we have selected $P(\tau)=4$ edges as positive.&lt;/p&gt;

&lt;p&gt;Among these 4 edges:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;$e_1$, $e_2$ and $e_3$ are correctly labeled as positive (namely ) so the number of true positives is $TP(\tau)=3$.&lt;/li&gt;
  &lt;li&gt;$e_4$ is wrongly labeled as positive, so the number of false positives is $FP(\tau)=1$.&lt;/li&gt;
  &lt;li&gt;$e_6$ is wrongly labeled as negative, so the number of false negative is $FP(\tau)=1$.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Based on these rate, we can derive the following confusion matrix: &lt;label for=&quot;note&quot; class=&quot;margin-toggle sidenote-number&quot;&gt;&lt;/label&gt;&lt;input type=&quot;checkbox&quot; id=&quot;note&quot; class=&quot;margin-toggle&quot; /&gt;&lt;span class=&quot;sidenote&quot;&gt;Note that all the values above are piecewise constant functions of the decision threshold, where the cutoff points are the score values of each samples. &lt;/span&gt;&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt; &lt;/th&gt;
      &lt;th&gt;Predicted Positive&lt;/th&gt;
      &lt;th&gt;Predicted Negative&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Positive&lt;/td&gt;
      &lt;td&gt;TP=3&lt;/td&gt;
      &lt;td&gt;FN=1&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Negative&lt;/td&gt;
      &lt;td&gt;FP=1&lt;/td&gt;
      &lt;td&gt;TN=2&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 id=&quot;precision-and-recall&quot;&gt;Precision and Recall&lt;/h3&gt;

&lt;p&gt;The precision is the ratio of positively predicted edges that have a positive ground truth label.&lt;/p&gt;

&lt;p&gt;The recall is the percentage of edges having a ground truth positive label that were effectively classified, or “recalled” as positive.&lt;/p&gt;

&lt;p&gt;To remind of these notions, one can think of the score as the result of a search on Google. Given a query term, Google looks for most relevant articles and returns them, ordered by relevance. One can then select the first $K$ of these articles, and decide whether they are actually relevant or not (i.e. the ground truth is given by the user). The precision score tells us how precise the results are, ensuring that Google doesn’t yield too many unrelevant articles. The recall on the other hand, ensures that Google will give a high rank to a good proportion of the articles that are relevant for the user.&lt;/p&gt;

&lt;ol class=&quot;bibliography&quot;&gt;&lt;/ol&gt;
</description>
        <pubDate>Fri, 25 Mar 2022 00:00:00 -0400</pubDate>
        <link>/articles/22/on_link_prediction</link>
        <guid isPermaLink="true">/articles/22/on_link_prediction</guid>
        
        
      </item>
    
      <item>
        <title>The Expectation-Maximization algorithm</title>
        <description>&lt;p&gt;In this post, I explain the popular Expectation-Maximization algorithm under a variational methods perspective.$\newcommand{\set}[1]{\left\{ #1 \right\}}$&lt;/p&gt;

&lt;!--more--&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Let’s consider an observation $x\in\mathcal{X}$. In a traditional statistical setting, we posit a model $\set{ p_{\theta} \vert \theta \in \Theta} $ describing the generation process of the variable $x$, and we would like to make inference on the parameter $\theta$.&lt;/p&gt;

&lt;p&gt;However, observations are often best described by introducing an unobserved latent variable $z\in \mathcal{Z}$. This unobserved variable can either correspond to an existing unobserved real-world attribute of the data, or to an abstract hypothesis made to describe the data (for instance in clustering, we define clusters and cluster assignments as latent variables).&lt;/p&gt;

&lt;p&gt;In that context, computing the log-likelihood of the data involves integrating over the latent variable, since:&lt;/p&gt;

&lt;p&gt;$$
p_{\theta}(x) = \int p_{\theta}(x,z) dz
$$&lt;/p&gt;

&lt;h1 id=&quot;maximum-likelihood-estimation&quot;&gt;Maximum Likelihood Estimation&lt;/h1&gt;

&lt;p&gt;Let’s suppose we aim to provide a Maximum Likelihood Estimate $\hat{\theta}^{(MLE)}$ of the parameter $\theta$. Classically, to do this we maximize the log-likelihood of the data:&lt;/p&gt;

&lt;p&gt;$$
\max\limits_{\theta} \log \left(\int p_{\theta}(x,z) dz \right)
$$&lt;/p&gt;

&lt;p&gt;Unfortunately, due to the presence of the integral, this function of $\theta$ is often non-convex, and thus difficult to optimize as such.&lt;/p&gt;

&lt;p&gt;However, in many cases, once the latent variable is known, the log-likelihood becomes convex and becomes easy to optimize, &lt;em&gt;for this particular value of the latent variable&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;$$\max\limits_{\theta} \log \left(p_{\theta}(x,z) \right)$$&lt;/p&gt;

&lt;p&gt;The Expectation-Maximization algorithm &lt;a class=&quot;citation&quot; href=&quot;#Dempster1977&quot;&gt;(Dempster et al., 1977)&lt;/a&gt; provides a solution in this particular case, by essentially performing alternate optimization on the parameter, and a variational distribution.&lt;/p&gt;

&lt;h1 id=&quot;variational-expression-of-the-log-likelihood&quot;&gt;Variational Expression of the log-likelihood&lt;/h1&gt;

&lt;p&gt;$\newcommand{\PZ}{\mathcal{P(\mathcal{Z})}}$ Let $\PZ$ denote the set of all possible densities defined on the latent space $\mathcal{Z}$.&lt;/p&gt;

&lt;p&gt;As is common in variational methods, we use Jensen’s inequality to rewrite the log-likelihood of the data as:&lt;/p&gt;

&lt;p&gt;$$
log(p_{\theta}(x)) = \underset{q \in \PZ}{max} \space F(x, q;\theta)
$$&lt;/p&gt;

&lt;p&gt;Where $F(x,q; \theta)$ is the &lt;em&gt;variational free energy&lt;/em&gt; defined as&lt;/p&gt;

&lt;p&gt;$$
F(x, q;\theta) = \int
\log(
  \frac{p_{\theta}(x,z)}{q(z)}
)
  q(z)dz
$$&lt;/p&gt;

&lt;p&gt;As a consequence, the maximization of the log likelihood becomes a double maximization problem:&lt;/p&gt;

&lt;p&gt;$$
\max \limits_{\theta} \log \left(p_{\theta}(x) \right) = \max\limits_{\theta} \max\limits_{q\in \PZ} F(x,q;\theta)
$$&lt;/p&gt;

&lt;h1 id=&quot;alternate-optimization-of-the-variational-free-energy&quot;&gt;Alternate optimization of the Variational Free Energy&lt;/h1&gt;

&lt;p&gt;Since the optimization problem involves two maximizations, the fundamental idea of the EM algorithm is thus to perform alternate optimization.&lt;/p&gt;

&lt;p&gt;Supposing that at step $t$ we obtain a parameter value $\theta^{t}$ and $q^t$, we update these values one after the other in two steps&lt;/p&gt;

&lt;h3 id=&quot;the-expectation-step&quot;&gt;The Expectation Step&lt;/h3&gt;

&lt;p&gt;In that step, we fix the value of the parameter $\theta^{t}$ and maximize $F$ with respect to the variational distribution:&lt;/p&gt;

&lt;p&gt;$$q^{t+1}=\underset{q}{argmax}\space F(x,q,\theta^{t}).$$&lt;/p&gt;

&lt;p&gt;Using the Lagrange multiplier methods, one can easily find that the optimal distribution is given by the conditional distribution of the latent variable given the data $x$, under the current set of parameters $\theta^t$:&lt;/p&gt;

&lt;p&gt;$$q^{t+1}(z) = p_{\theta^t}(z \vert x)$$&lt;/p&gt;

&lt;p&gt;Injecting back this optimal distribution into the variational free energy yields a new convex function of $\theta$, given by:$\newcommand{\E}{\mathbb{E}}$&lt;/p&gt;

&lt;p&gt;$$&lt;/p&gt;

&lt;p&gt;\begin{aligned}F(x,q^{t+1}, \theta) &amp;amp;=
F(x,p_{\theta^t}(. \vert x), \theta)\\ &amp;amp;=
\int
\log(
  \frac{p_{\theta}(x,z)}{p_{\theta^t}(z \vert x)}
)
  p_{\theta^t}(z \vert x) dz \\ &amp;amp;=
\E_{Z\sim p_{\theta^t}(.\vert x)}[\log(p_{\theta}(x,Z))] + H(p_{\theta^t}(. \vert x))
\end{aligned}
$$&lt;/p&gt;

&lt;p&gt;where $H(p_{\theta^t}(. \vert x))$ is the entropy of the conditional latent distribution $p_{\theta^t}(. \vert x)$.&lt;/p&gt;

&lt;p&gt;Since the entropy term doesn’t depend on the parameter $\theta$, the annex function to optimize is given by the expectation of the joint likelihood, under the conditional distribution of the latent variable given the observation and the current parameter:&lt;/p&gt;

&lt;p&gt;$$
\theta \mapsto E(\theta)=\E_{Z\sim p_{\theta^t}(.\vert x)}[\log(p_{\theta}(x,Z))]
$$&lt;/p&gt;

&lt;p&gt;Thus this step is called the &lt;strong&gt;Expectation&lt;/strong&gt; step.&lt;/p&gt;

&lt;h3 id=&quot;the-maximization-step&quot;&gt;The Maximization step&lt;/h3&gt;

&lt;p&gt;Once the annex function has been derived in the Expectation step, its convexity allows us to easily maximize it with respect to the model parameter:&lt;/p&gt;

&lt;p&gt;$$
\theta^{t+1} = \max\limits_{\theta} E(\theta)
$$&lt;/p&gt;

&lt;p&gt;This step is thus called the &lt;strong&gt;Maximization step&lt;/strong&gt;.&lt;/p&gt;

&lt;h1 id=&quot;example-the-gaussian-mixture-model&quot;&gt;Example: the Gaussian Mixture Model.&lt;/h1&gt;

&lt;p&gt;In the Gaussian Mixture Model, we are given a dataset $x={x_1,…,x_n}$, and our goal is to assign a cluster label $c_i\in\{1,…,K\}$ to each datapoint.&lt;/p&gt;

&lt;p&gt;To do so, an unobserved cluster assignment variable $z_i \in \{1,…,K\}$ is introduced.&lt;/p&gt;

&lt;p&gt;Moreover, each cluster $k\in \{1,…,K\}$ is associated with a Gaussian Distribution $\mathcal{N}(\mu_k, \Sigma_k)$.&lt;/p&gt;

&lt;p&gt;Given this, we assume the following generating process for the datapoints:&lt;/p&gt;

&lt;p&gt;$$
\begin{aligned}
z_i &amp;amp;\sim \mathcal{M}(1, (\lambda_1,…,\lambda_K)) \\ x_i\vert z_i=k &amp;amp;\sim \mathcal{N}(\mu_k, \Sigma_k)
\end{aligned}
$$&lt;/p&gt;

&lt;p&gt;The goal is to find the set of parameters
$\theta=(\lambda_1,…,\lambda_K, \mu_1,…,\mu_K, \Sigma_1,…,\Sigma_K )$ that maximizes the likelihood $ p(x| \theta) $.&lt;/p&gt;

&lt;h3 id=&quot;e-step&quot;&gt;E-step&lt;/h3&gt;

&lt;p&gt;As mentionned before, in the E-step we will compute a surrogate function that is the expectation for $z$ distributed under the its conditional distribution given the data and the current set of parameters $\theta^{t}$&lt;/p&gt;

&lt;p&gt;Using Bayes’ Rule, we get that this distribution is&lt;/p&gt;

&lt;p&gt;$$
\alpha_{i,k}^{t} \overset{\Delta}{=} p(z_i=k|x_i, \theta^{t}) = \frac{\mathcal{N}(x_i;\mu_k^{t}, \Sigma_k^{t})}{\sum_{k’=1}^{K}\mathcal{N}(x_i;\mu_{k’}^{t}, \Sigma_{k’}^{t})}
$$&lt;/p&gt;

&lt;p&gt;We deduce that the surrogate function to maximize is&lt;/p&gt;

&lt;p&gt;$$E(\theta)= \sum_{i=1}^{n} \sum_{k=1}^{K} \alpha_{k,i}^{t}\left[ log(\lambda_k)+\log(\mathcal{N}(x_i;\mu_k,\Sigma_k)) \right]$$&lt;/p&gt;

&lt;p&gt;Note that this maximization has to be done on the set of admissible parameters.
In particular, we should have $\sum_{k=1}^{K}\lambda_k = 1$.&lt;/p&gt;

&lt;h3 id=&quot;m-step&quot;&gt;M-step&lt;/h3&gt;

&lt;p&gt;In the M-step, we maximize the surrogate function derived at the E-step, with respect to the model parameters.
In the case of the Gaussian Mixture Model, this has a closed form that writes as follows.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Maximizing $E(\theta)$ with respect to $\lambda_1,…,\lambda_K$ under the constraint $\sum_{k=1}^{K}\lambda_k = 1$ yields :&lt;/p&gt;

    &lt;p&gt;$$
\lambda_{k}^{t+1} = \frac{1}{n}\sum_{i=1}^{n}\alpha_{i,k}^{t}
$$&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Maximizing $E(\theta)$ with respect to $\mu_1,…,\mu_{K}$ yields the following re-weighted empirical mean of the data points in each clusters:&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;$$
\mu_{k}^{t+1} = \frac{\sum_{i=1}^n\alpha_{i,k}^{t}x_i}{\sum_{i=1}^n\alpha_{i,k}^{t}}
$$&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Similarly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;$$
\Sigma_{k}^{t+1} = \frac{\sum_{i=1}^n\alpha_{i,k}^{t}(x_i-\mu_k^{t+1})(x_i-\mu_k^{t+1})^T}{\sum_{i=1}^n\alpha_{i,k}^{t}}
$$&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;In this article we have presented the Expectation-Maximization algorithm and its close connection with variational methods. While its name suggest otherwise, this algorithm is simply a form of alternate optimization of the Variational Free Energy, with respect to the model parameters on the one hand, and a variational distribution defined on the latent space on the other hand.&lt;/p&gt;

&lt;p&gt;It can be noted that although this algorithm is described in the context where we optimize a log-likelihood, it applies more generally to &lt;em&gt;any setup&lt;/em&gt; where the objective function involves integrating over a latent variable.&lt;/p&gt;

&lt;ol class=&quot;bibliography&quot;&gt;&lt;li&gt;&lt;span id=&quot;Dempster1977&quot;&gt;Dempster, A. P., Laird, N. M., &amp;amp; Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data Via the EM Algorithm . In &lt;i&gt;Journal of the Royal Statistical Society: Series B (Methodological)&lt;/i&gt; (Vol. 39, Number 1, pp. 1–22). https://doi.org/10.1111/j.2517-6161.1977.tb01600.x&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;
</description>
        <pubDate>Mon, 14 Mar 2022 00:00:00 -0400</pubDate>
        <link>/articles/22/expectation-maximization</link>
        <guid isPermaLink="true">/articles/22/expectation-maximization</guid>
        
        
      </item>
    
      <item>
        <title>Variational Methods</title>
        <description>&lt;p&gt;In this post, I give an overview of variational methods in the context of Bayesian inference.&lt;/p&gt;

&lt;!--more--&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;We consider a setup where we observe a random variable $x\in\mathcal{X}$, that is conditioned by a unobserved variable $z \in \mathcal{Z}$. Examples of such setups include Latent Dirichlet allocation or Latent space models for graphs.&lt;/p&gt;

&lt;h1 id=&quot;bayesian-setting&quot;&gt;Bayesian Setting&lt;/h1&gt;

&lt;p&gt;We adopt a Bayesian view, where we provide a prior distribution $p(z)$ on the hidden variable $z$.&lt;/p&gt;

&lt;p&gt;We would like to perform inference on the variable $z$ conditioned on the observation of $x$, namely our goal is to find a posterior distribution $p(z\vert x)$.&lt;/p&gt;

&lt;p&gt;Using Bayes ‘rule, the latter is given by :&lt;/p&gt;

&lt;p&gt;$$p(z\vert x) = \frac{p(x\vert z) p(z)}{p(x)}$$&lt;/p&gt;

&lt;h2 id=&quot;the-evidence-and-its-intractability&quot;&gt;The evidence and its intractability&lt;/h2&gt;

&lt;p&gt;Evaluating the posterior above involved evaluating the denominator, also called the &lt;em&gt;evidence&lt;/em&gt; :&lt;/p&gt;

&lt;p&gt;$$p(x) = \int\limits p(x\vert z) p(z) dz$$&lt;/p&gt;

&lt;p&gt;This evaluation requires integrating over a high dimensional latent space. In some cases the integrand $p(x\vert z) p(z)$ might adopt a nice form, making the integral tractable possibly in closed form. However in the general case computing this high dimensional integral is difficult.&lt;/p&gt;

&lt;h2 id=&quot;a-possible-approach-monte-carlo-markov-chain&quot;&gt;A possible approach: Monte-Carlo Markov Chain&lt;/h2&gt;

&lt;p&gt;In order to tackle the intractability of the evidence, a traditional method involves approximating this integral by sampling from a Markov Chain, and using the obtain samples ($z_1,…,z_n$) to compute a Monte Carlo estimate of the form:&lt;/p&gt;

&lt;p&gt;$$p(x) \approx \frac{1}{n}\sum\limits_{i=1}^{n} p(x\vert z_i) p(z_i) $$&lt;/p&gt;

&lt;p&gt;In the most common approaches (for instance the Metropolis-Hastings algorithm), the Markov transitions only require evaluating the numerator $p(x \vert z_i)p(z_i)$, and under some hypotheses the Markov chain is guaranteed to cover the latent space after a certain number of interations.&lt;/p&gt;

&lt;p&gt;While this approach allows to estimate the exact posterior distribution, it suffers from the curse of dimensionality, since the number of samples requires to get a good Monte-Carlo estimate scales exponentially with the latent space dimension.&lt;/p&gt;

&lt;h1 id=&quot;variational-inference&quot;&gt;Variational inference&lt;/h1&gt;

&lt;p&gt;In order to counter the effects of dimensionality, another different approach is to estimate an approximation of the posterior.&lt;/p&gt;

&lt;p&gt;As we will see, such an approximation casts the inference problem into an optimization problem, where the optimization variable is a density function. The term &lt;em&gt;variational&lt;/em&gt; comes from the fact we use a function, $q$ as an optimization variable in that formulation.&lt;/p&gt;

&lt;h2 id=&quot;jensens-inequality-and-the-elbo&quot;&gt;Jensen’s inequality and the ELBO&lt;/h2&gt;

&lt;p&gt;This is done by using Jensen’s inequality: for any positive density $z \mapsto q(z)$ we have:&lt;/p&gt;

&lt;p&gt;$$
\begin{aligned}\log(p(x)) &amp;amp;= log(\int\limits \frac{p(x, z)}{q(z)} q(z) dz)\\ &amp;amp;\geq
\int\limits \log(\frac{p(x, z)}{q(z)}) q(z)) dz) \\ &amp;amp;=
F(x, q)
\end{aligned}
$$&lt;/p&gt;

&lt;p&gt;where we define the functional $$q \mapsto F(x, q) = \int\limits \log(\frac{p(x, z)}{q(z)}) q(z)) dz).$$ $F$ is commonly known as the $ELBO$ in variational inference litterature. As we can see, it is a function of both the observation $x$ and the density $q$.&lt;/p&gt;

&lt;h2 id=&quot;link-with-the-kullback-leibler-divergence&quot;&gt;Link with the Kullback-Leibler divergence&lt;/h2&gt;

&lt;p&gt;The Kullback-Leibler divergence between the variational density $q$ and the posterior distribution $p(. \vert x)$ writes:&lt;/p&gt;

&lt;p&gt;$$
\begin{aligned}
KL(q\vert \vert p(.\vert x)) &amp;amp;=
\int\limits \log(\frac{q(z)}{p(z \vert x)}) q(z) dz \\ &amp;amp;=
\int\limits \log(\frac{p(x)q(z)}{p(x,z)}) q(z) dz \\ &amp;amp;=
log(p(x)) - \int\limits \log(\frac{p(x,z)}{q(z)}) q(z) dz \\ &amp;amp;=
log(p(x)) - F(x,q)
\end{aligned}
$$&lt;/p&gt;

&lt;p&gt;Thus, thus ELBO can be rewritten as $$F(x,q) = log(p(x)) - KL(q\vert \vert p(.\vert x)).$$&lt;/p&gt;

&lt;p&gt;Maximizing $F$ with respect to $q$ is the same as minimizing the divergence between the $q$ and the posterior distribution.&lt;/p&gt;

&lt;h2 id=&quot;approximating-the-posterior-distribution&quot;&gt;Approximating the posterior distribution&lt;/h2&gt;

&lt;p&gt;$\newcommand{\PZ}{\mathcal{P(\mathcal{Z})}}$ Let $\PZ$ denote the set of all possible densities defined on the latent space $\mathcal{Z}$. The previous formula gives us a variational definition of the posterior:&lt;/p&gt;

&lt;p&gt;$$
p(. \vert x) = \underset{q \in \PZ}{argmax} \space F(x, q)
$$&lt;/p&gt;

&lt;p&gt;In variational inference we approximate this true posterior by instead optimizing on a subset of $\PZ$, denoted $\newcommand{\Q}{\mathcal{Q}}$$\Q$.&lt;/p&gt;

&lt;p&gt;$$
p(. \vert x) \approx \underset{q \in \Q}{argmax} \space F(x, q)
$$&lt;/p&gt;

&lt;p&gt;For instance $Q$ is often taken as a set of gaussian distributions on $Z$.&lt;/p&gt;

&lt;p&gt;Using different tricks (e.g. the Mean-Field approximation) allows this familiy of method to scale better than Monte-Carlo estimation, but in contrast doesn’t yield an estimate of the exact posterior.&lt;/p&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;In this article we have seen how to use variational inference to approximate the posterior distribution in models having unobserved variables.&lt;/p&gt;

&lt;p&gt;Note that the hidden variable $z$ can be root nodes in the graphical model, for instance in the case where $z$ are the parameters of the models, or interior nodes, as is the case for instance in Variational Autoencoders. In the latter case the ELBO is used as a computational vehicle to backpropagate to the parameters of a neural network, using the reparameterization trick.&lt;/p&gt;

&lt;ol class=&quot;bibliography&quot;&gt;&lt;/ol&gt;
</description>
        <pubDate>Tue, 01 Mar 2022 00:00:00 -0500</pubDate>
        <link>/articles/22/variational_methods</link>
        <guid isPermaLink="true">/articles/22/variational_methods</guid>
        
        
      </item>
    
      <item>
        <title>A visualization of latent space distance models for Graphs</title>
        <description>&lt;p&gt;In this article I give a visualization of latent space distance models for graphs, and how they allow to disantangle the metric structure of the graph from prior information such as node/edge attributes.&lt;/p&gt;

&lt;!--more--&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Latent space distance models for graphs form a broad class of statistical models for graph. In what follows we give a visual explanation of the mechanisms that allow these models to disantangle different effects in the network generation process.&lt;/p&gt;

&lt;h1 id=&quot;latent-space-models-for-graphs-lsm&quot;&gt;Latent space models for graphs (LSM)&lt;/h1&gt;

&lt;p&gt;$\newcommand{\Zcal}{\mathcal{Z}}$ Given a graph $G=(U,E)$ where $U$ is a set of nodes and $E\subset U\times U$ is the set of edges of $G$, a LSM supposes for each node $i\in U$ the existence of an underlying latent representation $z_{i}$ in a metric space $\Zcal$. We denote by $y_{ij}$ the (binary) indicator that the edge $ij$ is in $E$.&lt;/p&gt;

&lt;p&gt;Then, supposed that the edges $ij$ in the network are independently generated by a Bernoulli distributions $Bern(\theta_{ij})$, such that: $$\theta_{ij} = f(z_i, z_j)$$ for a certain similarity function $f$.&lt;/p&gt;

&lt;p&gt;Examples of Latent Space models include the stochastic block model, where the embeddings are discrete, the graphon model &lt;a class=&quot;citation&quot; href=&quot;#Lovsz2012LargeNA&quot;&gt;(Lovász, 2012)&lt;/a&gt;, and the Latent space distance model introduced in &lt;a class=&quot;citation&quot; href=&quot;#Hoff2002&quot;&gt;(Hoff et al., 2002)&lt;/a&gt;.&lt;/p&gt;

&lt;h1 id=&quot;latent-space-distance-models-lsdm&quot;&gt;Latent space distance models (LSDM)&lt;/h1&gt;

&lt;p&gt;Here we focus on the generic latent space distance models, presented in &lt;a class=&quot;citation&quot; href=&quot;#Hoff2002&quot;&gt;(Hoff et al., 2002)&lt;/a&gt;. Those are such that the similarity function is composed by a logit passed through an activation function $h$ (Usually the sigmoid function):&lt;/p&gt;

&lt;p&gt;$$a_{ij} \sim Bernoulli(\theta_{ij}) $$
 $$\theta_{ij} = h(2\gamma +\alpha_i + \alpha_j+  \lambda^Tx_{ij} - d(z_i,z_j))$$&lt;/p&gt;

&lt;p&gt;Where $\alpha_i$ and $\alpha_j$ &lt;em&gt;sociality&lt;/em&gt; parameters, $x_{ij}$ are predefined edge features, $\gamma$ is a bias parameter and $d$ is a distance, such as the euclidean distance.&lt;/p&gt;

&lt;!-- While using a similarity measure that is not a distance can also lead to interesting models, here we suppose that $d$ is the euclidean distance --&gt;

&lt;h3 id=&quot;deterministic-version-of-the-random-graphs-above&quot;&gt;Deterministic version of the random graphs above.&lt;/h3&gt;

&lt;p&gt;In order to geometrically explain how LSDMs disantangle, a possible approach is to make the (random) edge Bernoulli random variables, deterministic, by changing the link function. Indeed the sigmoid function is a smooth version of a non-continuous function, the Heaviside step function, given by $h(x) = \mathbb{1}_{{x&amp;gt;0}}$. This one yields an activation equal to 1 for positive inputs and 0 for negative inputs.&lt;/p&gt;

&lt;figure&gt;&lt;img src=&quot;/assets/img/sigmoid_vs_heaviside.png&quot; /&gt;&lt;figcaption class=&quot;maincolumn-figure&quot;&gt;&lt;em&gt;The heaviside function in red, and the sigmoid function in green&lt;/em&gt;&lt;/figcaption&gt;&lt;/figure&gt;

&lt;p&gt;The deterministic graph is given by the following link indicators:&lt;/p&gt;

&lt;p&gt;$\newcommand{\ind}{\mathbb{1}}$&lt;/p&gt;

&lt;p&gt;$$
y_{ij} = \ind_{ d(z_i,z_j) \leq 2\gamma +\alpha_i + \alpha_j+  \lambda^Tx_{ij}}
$$&lt;/p&gt;

&lt;p&gt;This one has a natural visual interpretation, as shown in the following image.&lt;/p&gt;

&lt;figure&gt;&lt;img src=&quot;/assets/img/cne_deg1.png&quot; /&gt;&lt;figcaption class=&quot;maincolumn-figure&quot;&gt;Disks associated with each node&lt;/figcaption&gt;&lt;/figure&gt;
&lt;p&gt;As can be seen, each embedding $z_i$ is endowed with a disk $D_i$ of radius $\alpha_i+\gamma$ such that the minimum distance between $D_i$ and $D_j$ in order for the nodes to connect is $\lambda^T x_{ij}$. If a given node has a large disk, it will naturally form more connections, independent on the position of the disk center.&lt;/p&gt;

&lt;p&gt;Moreover the prior similarity between nodes $i$ and $j$ is high, then the disk need not be too close for the connection to form. As a consequence, the embeddings will not encode the prior information contained in the term $\lambda^T x_{ij}$.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;In this article we focus on latent space distance models, and provide a visual interpretation of the mechanism that allow these models to learn vector representation of nodes that do not encode information known in advance in the form of node and edge attributes.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ol class=&quot;bibliography&quot;&gt;&lt;li&gt;&lt;span id=&quot;Lovsz2012LargeNA&quot;&gt;Lovász, L. M. (2012). Large Networks and Graph Limits. &lt;i&gt;Colloquium Publications&lt;/i&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span id=&quot;Hoff2002&quot;&gt;Hoff, P. D., Raftery, A. E., &amp;amp; Handcock, M. S. (2002). Latent space approaches to social network analysis. &lt;i&gt;Journal of the American Statistical Association&lt;/i&gt;, &lt;i&gt;97&lt;/i&gt;(460), 1090–1098. https://doi.org/10.1198/016214502388618906&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;
</description>
        <pubDate>Tue, 01 Feb 2022 00:00:00 -0500</pubDate>
        <link>/articles/22/latent-space-viz</link>
        <guid isPermaLink="true">/articles/22/latent-space-viz</guid>
        
        
      </item>
    
      <item>
        <title>Conditional Network Embedding, a Latent Space Distance perspective</title>
        <description>&lt;p&gt;In this article I summarize the Conditional Network Embedding model, and underline its connection with the broader class of Latent Space Distance models for graphs.&lt;/p&gt;

&lt;!--more--&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Conditional Network Embedding (CNE) &lt;a class=&quot;citation&quot; href=&quot;#KangLB19&quot;&gt;(Kang et al., 2019)&lt;/a&gt; is a node embedding method for graphs that has been successfully applied to visualization and prediction. It allows the user to generate node embeddings that respect the network structure, while factoring out prior knowledge known in advance. Applications of this include visualizing the nodes in a network without representing undesired effect, such as for instance having the high degree nodes concentrated in the center of the embedding space. The resulting embeddings can also be used to predict links while controlling the influence of sensitive node attributes on the predictions. This has great interest in producing fair link prediction on social networks, such as in &lt;a class=&quot;citation&quot; href=&quot;#buyl20a&quot;&gt;(Buyl &amp;amp; De Bie, 2020)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In what follows we aim to give a comprehensive view of the underlying mechanism that make CNE good at producing embeddings that factor out prior information.&lt;/p&gt;

&lt;!--

In what follows we express the Conditional Network Embeddings model as a
statistical model for which the parameter space is the cartesian product
of the space of embedding matrices and regression parameters w.r.t. edge
features \$\$f_{ij}\$\$. --&gt;

&lt;h1 id=&quot;conditional-network-embedding&quot;&gt;Conditional network embedding&lt;/h1&gt;

&lt;p&gt;Conditional Network Embedding is a graph embedding method.&lt;/p&gt;

&lt;p&gt;Given an undirected graph $G=(U,E)$ where $U$ is the set of nodes and $E\subset U\times U$ is the set of nodes it yields a mapping from the set of nodes to a $d$-dimensional space:&lt;/p&gt;

&lt;p&gt;$$
\begin{aligned}
CNE \colon U &amp;amp;\rightarrow &amp;amp; \mathbb{R}^d \\ u &amp;amp;\mapsto &amp;amp; z_u &lt;br /&gt;
\end{aligned}
$$&lt;/p&gt;

&lt;h1 id=&quot;factoring-out-prior-information-in-embeddings&quot;&gt;Factoring out prior information in embeddings&lt;/h1&gt;

&lt;p&gt;$\newcommand{\norm}[1]{\vert \vert #1 \vert \vert }$ In CNE, we suppose that we have encoded our prior expectations about an observed graph $\hat{G}$ into a MaxEnt distribution(see &lt;a href=&quot;//articles/20/maxent&quot;&gt;my post about Maxent&lt;/a&gt; or the paper &lt;a class=&quot;citation&quot; href=&quot;#debie2010maximum&quot;&gt;(Bie, 2010)&lt;/a&gt;). Moreover, we suppose that each node $i \in U$ is represented by an (unknown) embedding vector $z_i \in \mathbb{R}^d$, and that for two nodes $i \neq j$, their connection only depends on the embedding through the euclidean distance between their embeddings $d_{ij} = \norm{z_i-z_j}$.&lt;/p&gt;

&lt;p&gt;Based on that, CNE uses Bayes’ rule to define the link probability conditioned on the MaxEnt distribution:&lt;/p&gt;

&lt;p&gt;$$
P_{ij}(a_{ij}|z_i, z_j)= \frac{
\mathcal{N}_{+}(d_{ij} | s(a_{ij}))
P_{ij}(a_{i,j})
}{
\sum\limits_{a \in {0,1}}
\mathcal{N}_{+}(d_{ij} | s(a))
P_{ij}(a)
}
$$&lt;/p&gt;

&lt;p&gt;where&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;$d_{ij} = \vert\vert z_i-z_j \vert\vert$ is the euclidean distance between embeddings $z_i$ and $z_j$.&lt;/li&gt;
  &lt;li&gt;$\mathcal{N}_{+}(d\vert s(a))$ denotes a half normal density with spread parameter s(a).&lt;/li&gt;
  &lt;li&gt;$s$ is a spread function such that $s_0=s(0)&amp;gt;s(1)=s_1$&lt;/li&gt;
  &lt;li&gt;$P_{ij}(a)$ is the MaxEnt prior Bernoulli distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thus, CNE postulates a distribution over the distance between embeddings, such that the distances between embeddings of non-edges are more spread around 0 than for edges.&lt;/p&gt;

&lt;p&gt;Finally, the probability of full graph $G$ is defined as the product of the independent link probabilities:&lt;/p&gt;

&lt;p&gt;$$
P(G\vert Z) =\prod_{i\neq j}P_{ij}(a_{ij}|z_i, z_j)
$$&lt;/p&gt;

&lt;h3 id=&quot;retrieving-the-link-bernoulli-probabilities&quot;&gt;Retrieving the link Bernoulli probabilities&lt;/h3&gt;

&lt;p&gt;As seen before, the full likelihood of a graph under the CNE model can be written as product of independent probabilities, one for each node pair. As the link indicator $a_{ij}$ between each node pair $ij$ is a Bernoulli random variable, one can transform the expression in order to retrieve the Bernoulli probabilities.&lt;/p&gt;

&lt;p&gt;Indeed, it can be shown that the edge link probabilties can be rewritten as: $$P_{ij}(a_{ij} \vert z_i, z_j) =  Q_{ij}^{a_{ij}}(1-Q_{ij})^{(1-a_{ij})}$$&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;$Q_{ij} = \sigma \left(\alpha + \lambda^Tf_{ij} - \beta.\frac{d_{ij}^2}{2} \right)$&lt;/li&gt;
  &lt;li&gt;$\alpha=\log(\frac{s_1}{s_0})$ is a non-negative constant.&lt;/li&gt;
  &lt;li&gt;$\beta=(\frac{1}{s_1^2} - \frac{1}{s_0^2}) \geq 0$ is a scaling constant.&lt;/li&gt;
  &lt;li&gt;$\sigma$ still denotes the sigmoid function&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;proof&quot;&gt;Proof&lt;/h4&gt;

&lt;p&gt;In order to retrieve this form, we first recall the form of the Half-Normal distribution:&lt;/p&gt;

&lt;p&gt;$$
\begin{aligned}
p_{\mathcal{N}_{+}(.\vert s)}(d) = \sqrt{\frac{2}{\pi s^2}} exp(- \frac{d^2}{2 s^2})
\end{aligned}
$$&lt;/p&gt;

&lt;p&gt;Moreover, the MaxEnt prior distribution writes:&lt;/p&gt;

&lt;p&gt;$$
\begin{aligned}
P_{ij}(a_{ij})=\frac{exp(\lambda^Tf_{ij}(G))}{1+exp(\lambda^Tf_{ij}(G))}
\end{aligned}
$$&lt;/p&gt;

&lt;p&gt;Since $P_{ij}(a_{ij} \vert z_i, z_j)$ is a Bernoulli probability, we have $Q_{ij} = P_{ij}(1 \vert z_i, z_j)$&lt;/p&gt;

&lt;p&gt;Injecting $a_{ij}=1$ in the expression of $P_{ij}(a_{ij} \vert z_i, z_j)$ and simplifying gives:&lt;/p&gt;

&lt;p&gt;$$
\begin{aligned}Q_{ij}=&amp;amp; \frac{
\sqrt{\frac{2}{\pi s_1^2}}
exp(- \frac{d_{ij}^2}{2 s_1^2} + \lambda^Tf_{ij})
}{
\sqrt{\frac{2}{\pi s_1^2}}
\exp(- \frac{d_{ij}^2}{2 s_1^2} + \lambda^Tf_{ij}) +
\sqrt{\frac{2}{\pi s_0^2}}
\exp(- \frac{d_{ij}^2}{2 s_0^2})
} \\ = &amp;amp;
\frac{1}{
  1 +
\exp\left(- \frac{d_{ij}^2}{2}(\frac{1}{s_0^2} - \frac{1}{s_1^2}) - \lambda^Tf_{ij} - log(\frac{s_0}{s_1})\right)
} \\ =&amp;amp;
\sigma(\lambda^Tf_{ij} + log(\frac{s_0}{s_1}) - \frac{d_{ij}^2}{2}(\frac{1}{s_1^2} - \frac{1}{s_0^2})) &lt;br /&gt;
\end{aligned}
$$&lt;/p&gt;

&lt;p&gt;where $\sigma:x \mapsto \frac{1}{1+exp(-x)}$ is the sigmoid function.&lt;/p&gt;

&lt;h3 id=&quot;connection-with-latent-space-models-for-graphs&quot;&gt;Connection with Latent space models for graphs&lt;/h3&gt;

&lt;p&gt;As we see, the independent link logits logit in CNE are given by subtracting the scaled distance between embeddings to prior terms and a constant bias: $$logit(Q_{ij})=C+ \lambda^Tf_{ij} - D . d_{ij}^2$$ where $C= log(\frac{s_0}{s_1})$ and $D=0.5*(\frac{1}{s_1^2} - \frac{1}{s_0^2})$&lt;/p&gt;

&lt;p&gt;(The logit is defined as the inverse of the sigmoid function: $\sigma(logit(p)) = logit(\sigma(p))=p$)&lt;/p&gt;

&lt;p&gt;Intuitively, the term $\lambda^Tf_{ij}$ encodes a prior similarity value between $i$ and $j$ that doesn’t need to be represented by a small distance between the embeddings $z_i$ and $z_j$.&lt;/p&gt;

&lt;p&gt;This type of statistical model has been studied in a variety of previous work, in the name of Latent Space Distance Models &lt;a class=&quot;citation&quot; href=&quot;#Hoff2002&quot;&gt;(Hoff et al., 2002; Turnbull &amp;amp; Hons, 2019; Ma et al., 2020)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The common principle of this type of method is use the latent distance between vector representations as sufficient statistics for the link indicator variable.&lt;/p&gt;

&lt;h1 id=&quot;example-with-the-degree-and-edge-features-as-prior&quot;&gt;Example with the degree and edge features as prior.&lt;/h1&gt;

&lt;p&gt;Here we given an example of CNE model where we retrieve the Bernoulli probabilities $Q_{ij}$ given some prior statistics.&lt;/p&gt;

&lt;p&gt;We consider a simple example of CNE, where the MaxEnt statistics used are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The degree of each node $i$: $f_i^{(degree)}(G) = \sum\limits_{j\in \cal{N}(i)} a_{ij}$ where $\cal{N}(i)$ is the set of neighbors of $i$. This leads to $n$ statistics at the graph level. For each edge $ij$ the corresponding edge-level statistics vector $f_{ij}$ are given by $[E_i^n \vert\vert E_j^n]$, where for each node $i$, $E_i^n$ is the n-dimensional one-hot encoding of the node $i$ and $\vert\vert$ represents the concatenation operation. Denoting $\alpha \in \mathbb{R}^{2n}$ the vector of coefficients associated to these degree statistics, the corresponding logit value is equal to $$\alpha^Tf_{ij}=\alpha_i + \alpha_j$$&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Some edge-level features $x_{ij}$. We denote $\theta$ the associated coefficient and the logit values coming from it are equal to : $$\theta^T x_{ij}$$&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So by stacking all these features, we get the following prior term:&lt;/p&gt;

&lt;p&gt;$$\lambda^Tf_{ij}=\alpha_i + \alpha_j + \theta^T x_{ij}$$&lt;/p&gt;

&lt;p&gt;The CNE Bernoulli probabilities are thus equal to:&lt;/p&gt;

&lt;p&gt;$$Q_{ij} = \sigma \left(C + \alpha_i + \alpha_j + \theta^T x_{ij} - D. d_{ij}^2 \right) $$&lt;/p&gt;

&lt;!--
# Visual explanation

In order to geometrically explain how CNE factors out prior knowledge, a possible approach is to imagine the (random) edges as Bernoulli random variables, to make them deterministic variables conditioned on the embeddings.

### Deterministic version of the random graphs above.

The sigmoid function is a smooth version of a non-continuous function, the Heaviside step function, given by $h(x) = \mathbb{1}_{\{x&gt;0\}}$.
This one yields an activation equal to 1 for positive inputs and 0 for negative inputs.

![Heaviside](/figures/sigmoid_vs_heaviside.png)
_The heaviside function in red, and the sigmoid function in green_

Let's consider a CNE model, where we use as constraints the degrees of each nodes, as well as other features.
The CNE expression looks like:


\$\$
Q_{ij} = \sigma \left(2 \gamma +\alpha_i + \alpha_j+ \theta^T x_{ij} - \vert\vert z_i-z_j\vert\vert^2 \right)
\$\$

In the deterministic CNE expression, the link indicators would then look like:

\$\$
a_{ij} =h\left(2\gamma +\alpha_i + \alpha_j+ \theta^Tx_{ij} - \vert\vert z_i-z_j\vert\vert \right)
\$\$

This has a natural visual interpretation, as shown in the following image
![CNE-DEG](/figures/cne_deg1.png)

As can be seen, each embedding $z_i$ is endowed with a disk $D_i$of radius $\alpha_i+\gamma$ such that the minimum distance between $D_i$ and $D_j$ in order for the nodes to connect is $\theta^T x_{ij}$.

If the prior similarity is high, the the disk need not be too close for the connection to form. As a consequence, the embeddings will not encode the prior information. --&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;We have seen that write the posterior distribution of CNE as a product of Bernoulli distributions, and looking for the Bernoulli parameters allow us to express the CNE model as a Latent Space model for graphs. Such an observation is useful to analyze the theoretical properties (consistency, convergence bounds) of the models, as well as to generalize the approach to different types of graphs (weighted, temporal graphs for instance).&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ol class=&quot;bibliography&quot;&gt;&lt;li&gt;&lt;span id=&quot;KangLB19&quot;&gt;Kang, B., Lijffijt, J., &amp;amp; Bie, T. D. (2019). Conditional Network Embeddings. &lt;i&gt;7th International Conference on Learning Representations, ICLR 2019,
               New Orleans, LA, USA, May 6-9, 2019&lt;/i&gt;. https://openreview.net/forum?id=ryepUj0qtX&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span id=&quot;buyl20a&quot;&gt;Buyl, M., &amp;amp; De Bie, T. (2020). DeBayes: a Bayesian Method for Debiasing Network Embeddings. &lt;i&gt;Proceedings of the 37th International Conference on Machine Learning&lt;/i&gt;, &lt;i&gt;119&lt;/i&gt;, 1220–1229. https://proceedings.mlr.press/v119/buyl20a.html&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span id=&quot;debie2010maximum&quot;&gt;Bie, T. D. (2010). &lt;i&gt;Maximum entropy models and subjective interestingness: an application to tiles in binary databases&lt;/i&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span id=&quot;Hoff2002&quot;&gt;Hoff, P. D., Raftery, A. E., &amp;amp; Handcock, M. S. (2002). Latent space approaches to social network analysis. &lt;i&gt;Journal of the American Statistical Association&lt;/i&gt;, &lt;i&gt;97&lt;/i&gt;(460), 1090–1098. https://doi.org/10.1198/016214502388618906&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span id=&quot;Turnbull2019&quot;&gt;Turnbull, K. R., &amp;amp; Hons, M. (2019). &lt;i&gt;Advancements in Latent Space Network Modelling&lt;/i&gt;. &lt;i&gt;December&lt;/i&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span id=&quot;Ma2020a&quot;&gt;Ma, Z., Ma, Z., &amp;amp; Yuan, H. (2020). Universal latent space model fitting for large networks with edge covariates. &lt;i&gt;Journal of Machine Learning Research&lt;/i&gt;, &lt;i&gt;21&lt;/i&gt;, 1–67.&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;
</description>
        <pubDate>Sun, 01 Nov 2020 00:00:00 -0400</pubDate>
        <link>/articles/20/cne_latent</link>
        <guid isPermaLink="true">/articles/20/cne_latent</guid>
        
        
      </item>
    
      <item>
        <title>Maximum Entropy models for Graphs</title>
        <description>&lt;p&gt;In this post, we give an overview Maximum Entropy models for graphs, as presented in previous work &lt;a class=&quot;citation&quot; href=&quot;#debie2010maximum&quot;&gt;(Bie, 2010)&lt;/a&gt; and &lt;a class=&quot;citation&quot; href=&quot;#adriaens&quot;&gt;(Adriaens et al., 2017)&lt;/a&gt;. We show how these models can be used to derive prior distributions on graphs.&lt;/p&gt;

&lt;!--more--&gt;

&lt;h3 id=&quot;introduction&quot;&gt;Introduction&lt;/h3&gt;

&lt;p&gt;Many real-world phenomena can be (at least partially) described in the form of networks. Examples include social networks, user behavior online, neurons in the brain, ecological networks etc…&lt;/p&gt;

&lt;p&gt;However, while the set of all possible network with a given number of nodes $n$ is very large ($2^{\frac{n(n-1)}{2}}$), the set of real-world networks lie on a very small subset of these, meaning that the &lt;em&gt;majority&lt;/em&gt; of possible networks have a negligeable probability of occuring in practice.&lt;/p&gt;

&lt;p&gt;While a defining the prior probability of all possible graphs is infeasible due to their huge number, one can easily define prior expectations on the &lt;em&gt;properties&lt;/em&gt; of this graph, depending on its nature.&lt;/p&gt;

&lt;p&gt;For instance, one might have an idea of the number of links, the number of links &lt;em&gt;per node&lt;/em&gt; (their degree).&lt;/p&gt;

&lt;p&gt;In a social network, one might have prior expectations about the number of links connecting any two communities. Indeed, the fact that people from the same community tend to connect more than from different ones is a fact commonly observed in real-world social networks and often quoted as &lt;em&gt;homophily&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Based on these prior expactations about the structural properties of the graph, the &lt;em&gt;Maximum Entropy&lt;/em&gt; (MaxEnt) principle can be used to cast these expectations into a fully-fledged probability distribution on the combinatorial space of all possible graphs.&lt;/p&gt;

&lt;h1 id=&quot;formalizing-prior-expectations&quot;&gt;Formalizing prior expectations&lt;/h1&gt;

&lt;p&gt;In this paragraph, we describe mathematically the aforementioned &lt;em&gt;prior expectations&lt;/em&gt;. To do this let’s first introduce some notations.&lt;/p&gt;

&lt;h2 id=&quot;notations&quot;&gt;Notations&lt;/h2&gt;

&lt;p&gt;$\newcommand{\Gcal}{\mathcal{G}}$ $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\Gcal}{\mathcal{G}}$ $\newcommand{\Gcal}{\mathcal{G}}$ $\newcommand{\Gcal}{\mathcal{G}}$ $\newcommand{\Gcal}{\mathcal{G}}$Let $U$ a set of nodes of size $n$. A graph is a tuple $G=(U, E)$ where $E\subset U \times U$ is the set of edges of the graph. For a fixed set of nodes $U$, we denote by $\Gcal$ the set of possible undirected graphs connecting the nodes in $U$. Each graph $G\in\Gcal$ can be fully described by its adjacency matrix: $A=(a_{ij})\in \{0,1\}^{n^2}$, such that $a_{ij}=1$ if and only if the nodes $i$ and $j$ are connected. For each node $i \in U$, we denote $\mathcal{N}(i)$ its set of neighbors. We denote by $\newcommand{\PG}{\mathcal{P}(\mathcal{G})}$ $\PG$ the set of graph distributions, i.e. the set of all probability distributions on the set of graphs.&lt;/p&gt;

&lt;h2 id=&quot;prior-statistics&quot;&gt;Prior statistics&lt;/h2&gt;

&lt;p&gt;The properties graph can be expressed as &lt;em&gt;graph statistics&lt;/em&gt;, which are measurable functions taking as input a graph and yielding a real number:&lt;/p&gt;

&lt;p&gt;$$
\begin{align} f&amp;amp;: &amp;amp;G  &amp;amp;\mapsto&amp;amp; &amp;amp;f(G)\\ &amp;amp;&amp;amp;\Gcal &amp;amp;\rightarrow&amp;amp; &amp;amp;\R
\end{align}
$$&lt;/p&gt;

&lt;p&gt;Examples of such statistics include for instance:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The degree of each node $i$: $$f_i^{(degree)}(G) = \sum\limits_{j\in \cal{N}(i)} a_{ij}$$&lt;/li&gt;
  &lt;li&gt;The number of connections between two node subsets $W, W’ \subset U$: $$f_{W,W’}^{(block)}(G) = \sum\limits_{i,j \in W \times W’} a_{ij}$$&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As we see, any graph property that can be mathematically computed as a real number can be defined as a graph statistic.&lt;/p&gt;

&lt;h2 id=&quot;expected-value-of-a-prior-statistics&quot;&gt;Expected value of a prior statistics&lt;/h2&gt;

&lt;p&gt;$\newcommand{\Ebb}{\mathbb{E}}$ For a given graph distribution $P \in \PG$, and a graph statistic $f$, one can define the expectation of this graph statistic as: $$\Ebb[f(G)]= \sum_{G\in \Gcal} f(G)P(G)$$&lt;/p&gt;

&lt;p&gt;This is the mathematical definition of what we mean when we expect a given property about the graph to have a certain value.&lt;/p&gt;

&lt;p&gt;The above value allows to compute what a given observer, whose subjectivity is encoded in the prior distribution $P$, expects the graph property $f$ to be.&lt;/p&gt;

&lt;!--
\$\$
f: G\in \Gcal \mapsto f(G) \in \R*+
\Gcal \rightarrow \R*+
\$\$ --&gt;

&lt;h1 id=&quot;maximum-entropy-models&quot;&gt;Maximum Entropy models&lt;/h1&gt;

&lt;p&gt;Supposing that we encode our prior expectations into $K$ statistics $f_1,…,f_K$ where each $f_k$ is a real-valued graph function, then the maximum entropy principle can be used to convert those into a &lt;em&gt;prior distribution&lt;/em&gt; on the set of possible graphs.&lt;/p&gt;

&lt;h2 id=&quot;graph-distributions&quot;&gt;Graph distributions&lt;/h2&gt;

&lt;p&gt;A &lt;em&gt;graph distribution&lt;/em&gt; is a probability distribution defined on the set of graphs $\Gcal$. In other words, it can be identified with a function $P$ that gives for each graph $G \in \Gcal$ the likelihood $P(G)$ of observing this particular graph.&lt;/p&gt;

&lt;h2 id=&quot;entropy-of-a-graph-distribution&quot;&gt;Entropy of a graph distribution.&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;entropy value&lt;/em&gt; of any distribution $P$ being defined as $$H(P) = -\sum_{G\in\Gcal} P(G)\log(P(G))$$&lt;/p&gt;

&lt;p&gt;This quantity measures the average amount of information provided by the observation of a graph, under the distribution $P$.&lt;/p&gt;

&lt;p&gt;For instance, if for a given observer all the graphs are equiprobable, the information provided by the observation of a graph is high. In other words this observer will be very &lt;em&gt;surprised&lt;/em&gt; on average by the observation.&lt;/p&gt;

&lt;p&gt;In contrast, an observer that only gives a non-zero probability to a particular graph $G_0$, and zero probability to all the other graphs, doesn’t get any information when observing a graph sampled from its prior probability $H(P)=0$ in that case.&lt;/p&gt;

&lt;!--
Showing a particular graph $\hat{G}$ to an observer having a high entropy prior distribution $P$ will make the latter very _surprised_, in the sense that it will provide him with a lot of information on average.

In contrast, an observer with a low entropy prior, for instance if the observer only expects one graph $\hat{G}$ tohappen  --&gt;

&lt;!-- Under this principle, we want to find a distribution on the set of possible graphs $\Gcal$, that has maximum entropy value, --&gt;

&lt;h2 id=&quot;maximizing-the-entropy-under-statistics-based-constraints&quot;&gt;Maximizing the entropy under statistics-based constraints&lt;/h2&gt;

&lt;p&gt;Supposing that we encode our prior expectations into $K$ statistics $f_1,…,f_K$ where each $f_k$ is a real-valued graph function, then the maximum entropy principle can be used to derive a resulting &lt;em&gt;prior distribution&lt;/em&gt; on the set of possible graphs.&lt;/p&gt;

&lt;p&gt;While prior expectations about the graph are provided in the form of graph statistics value, we would like to define a distribution over the set of graphs, such that the expected value of the statistics under this distribtution are equal to the one that we expect. In other words we want to impose &lt;em&gt;soft constraints&lt;/em&gt; on the graph distribution.&lt;/p&gt;

&lt;p&gt;Namely, we want our distribution to satisfy for all $k=1,…,K$: $$\Ebb[f(G)]= c_k$$&lt;/p&gt;

&lt;p&gt;where $c_k$ is our prior expectation value for the statistic $k$.&lt;/p&gt;

&lt;p&gt;Under these constraints, we use the Maximum Entropy principle to derive the &lt;em&gt;least informative&lt;/em&gt; graph prior distribution satisfying the soft constraints.&lt;/p&gt;

&lt;p&gt;Achieving this amounts in solving the Maximum Entropy constrained optimization problem:&lt;/p&gt;

&lt;!-- % \left\{ --&gt;

&lt;p&gt;$$
\begin{array}{cc}
\max\limits_{P} &amp;amp; H(P) \\ \text{such that}  &amp;amp;\Ebb[f(G)]= c_k , k=1,…,K\\ &amp;amp;\sum_{G\in\Gcal}P(G)=1
\end{array}
$$&lt;/p&gt;

&lt;!-- % \right./ --&gt;

&lt;p&gt;It can be shown that the maximum entropy distribution can be written, for a certain parameter vector $\lambda \in \mathbb{R}^K$ and each graph $G\in \mathcal{G}$:&lt;/p&gt;

&lt;p&gt;$$
P^*_{\lambda}(G) =
\frac{
\exp(\lambda^T f(G))
}{
\sum_{G \in \mathcal{G}}\exp(\lambda^T f(G))
}
$$&lt;/p&gt;

&lt;p&gt;Where $f(G)=(f_1(G), …, f_K(G))$ is the vector of graph statistics.&lt;/p&gt;

&lt;h2 id=&quot;link-with-maximum-likelihood-estimation&quot;&gt;Link with Maximum Likelihood Estimation&lt;/h2&gt;

&lt;p&gt;There is a strong connection between the above Maximum Entropy problem and Maximum Likelihood estimation. First we note that these two problems are distinct: while the first is a variational optimization problem (the optimization variable is the probability distribution $P$), the second is an simple convex optimization problem where the optimization variable is the parameter vector $\lambda$.&lt;/p&gt;

&lt;p&gt;Their common point is that they are dual problems from each other. Indeed, for any distribution $P$ the Lagrangian associated with the MaxEnt Problem writes:&lt;/p&gt;

&lt;p&gt;$$
\begin{aligned}
\mathcal{L}(P, \lambda)
=&amp;amp;-\sum\limits_{G \in \mathcal{G}} P(G) log(P(G))\\ &amp;amp;- \sum\limits_{k=1}^{K} \lambda_k (\sum\limits_{G \in \mathcal{G}} P(G)  f_k(G) -  c_k )
\end{aligned}
$$&lt;/p&gt;

&lt;p&gt;$\newcommand{\Lcal}{\mathcal{L}}$ $\newcommand{\Ghat}{\hat{G}}$ $\newcommand{\Pstar}{P^*_{\lambda}}$&lt;/p&gt;

&lt;p&gt;In the context of statistics where we observe a graph $\Ghat$ and set $c_k=f_k(\Ghat)$ for all the statistics $k=1,…,K$, it can be easily shown that&lt;/p&gt;

&lt;p&gt;$$\Lcal(\Pstar, \lambda) = -\log(\Pstar(\Ghat)).$$ Hence the Lagrangian is exactly equal to the negative log-likelihood of the model.&lt;/p&gt;

&lt;h2 id=&quot;factorized-form&quot;&gt;Factorized form&lt;/h2&gt;

&lt;p&gt;A broad range of graph statistics can be decomposed as of edge-specific statistics, i.e.: $\newcommand{\fijk}{f_{ij}^{(k)}}$ $$f_k(G)= \sum\limits_{i \neq j} \fijk(a_{ij}),$$&lt;/p&gt;

&lt;p&gt;For instance, the degree of a node is equal to the sum of the corresponding row of the adjacency matrix, and the volume of interaction between two communities is the sum of the entries located in a block of the adjacency matrix.&lt;/p&gt;

&lt;p&gt;It can be shown that for these statistics the MaxEnt distribution factorizes over the set of edges. More precisely, in that case we can derive edge-specific statistic vectors, denoted $f_{ij}(G)$, such that:&lt;/p&gt;

&lt;p&gt;$$\Pstar(G)=\prod\limits_{i\neq j} P_{ij}(a_{ij})$$ Where for each edge $ij$, $P_{ij}$ is a Bernoulli probability with parameter $$\frac{1}{1+exp(-\lambda^T f_{i,j}(G))}$$ This expression allows to express the graph distribution as a joint distribution of independent edge-specific Bernoulli variables $a_{ij}$. Moreover, the Bernoulli probabilities for each edge are given by a linear logit $\lambda^T f_{i,j}(G)$, passed through the sigmoid function $\sigma :x\mapsto \frac{1}{1+exp(-x)}$.&lt;/p&gt;

&lt;h2 id=&quot;maxent-in-practice-how-to-turn-prior-knowledge-statistics-into-a-maxent-distribution&quot;&gt;MaxEnt in practice: how to turn prior knowledge statistics into a MaxEnt distribution&lt;/h2&gt;

&lt;p&gt;In practice, such a distribution can used to extract prior information from an observed graph $\hat{G}$. We recall that the input of this procedure is a set of graph statistic functions, that each quantify an aspect of our expectation on the graph distribution. Based on this, one can apply the statistics $f_k$ to the observed graph, and use the obtained values To do this, one just needs to maximize the above likelihood of the observed graph with respect to the parameter vector $\lambda$:&lt;/p&gt;

&lt;p&gt;$$
\begin{aligned}
\max\limits_{\lambda\in \mathbb{R}^K} P(\hat{G}) &lt;br /&gt;
\end{aligned}
$$&lt;/p&gt;

&lt;p&gt;It can be noted that this Maximum Likelihood problem can be solved using logistic regression. Indeed, for each each edge, we access a feature vector $f_{i,j}(\hat{G})$ use it to predict the presence of absence or link between nodes $i$ and $j$.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;We have seen how Maximum Entropy models for graph can be used to formalize prior knowledge about a graph, encoded as soft constraints.&lt;/p&gt;

&lt;p&gt;The resulting model has been widely studied in network science literature, under the name of P* (p-star) model, or Exponential random graph models. I&lt;/p&gt;

&lt;p&gt;The dyad-independent expression has served as the basis of Later work such as Conditional Network Embeddings &lt;a class=&quot;citation&quot; href=&quot;#KangLB19&quot;&gt;(Kang et al., 2019)&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ol class=&quot;bibliography&quot;&gt;&lt;li&gt;&lt;span id=&quot;debie2010maximum&quot;&gt;Bie, T. D. (2010). &lt;i&gt;Maximum entropy models and subjective interestingness: an application to tiles in binary databases&lt;/i&gt;.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span id=&quot;adriaens&quot;&gt;Adriaens, F., Lijffijt, J., &amp;amp; De Bie, T. (2017). Subjectively interesting connecting trees. In M. Ceci, J. Hollmén, L. Todorovski, &amp;amp; C. Vens (Eds.), &lt;i&gt;Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part II&lt;/i&gt; (Vol. 10535, Number 2, pp. 53–69). Springer International Publishing. http://dx.doi.org/10.1007/978-3-319-71246-8_4&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span id=&quot;KangLB19&quot;&gt;Kang, B., Lijffijt, J., &amp;amp; Bie, T. D. (2019). Conditional Network Embeddings. &lt;i&gt;7th International Conference on Learning Representations, ICLR 2019,
               New Orleans, LA, USA, May 6-9, 2019&lt;/i&gt;. https://openreview.net/forum?id=ryepUj0qtX&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;

&lt;!-- In this paragraph, we have seen how MaxEnt model allow us to encode prior knowledge into a graph distribution $P(G)$ and for a certain type of statistics this translates into a set of independent bernoulli variables with proabilities $P_{ij}(a_{ij})=\sigma(\lambda^Tf_{ij}(G))$.
Now we will see how, once we have derived such a MaxEnt distribution, we can use it to find embeddings conditional on this distribution.

\$\$
\$\$ --&gt;
</description>
        <pubDate>Thu, 01 Oct 2020 00:00:00 -0400</pubDate>
        <link>/articles/20/maxent</link>
        <guid isPermaLink="true">/articles/20/maxent</guid>
        
        
      </item>
    
  </channel>
</rss>
