Maximum Entropy models for Graphs

October 1, 2020

In this post, we give an overview Maximum Entropy models for graphs, as presented in previous work (Bie, 2010) and (Adriaens et al., 2017). We show how these models can be used to derive prior distributions on graphs.

Introduction

Many real-world phenomena can be (at least partially) described in the form of networks. Examples include social networks, user behavior online, neurons in the brain, ecological networks etc…

However, while the set of all possible network with a given number of nodes $n$ is very large ($2^{\frac{n(n-1)}{2}}$), the set of real-world networks lie on a very small subset of these, meaning that the majority of possible networks have a negligeable probability of occuring in practice.

While a defining the prior probability of all possible graphs is infeasible due to their huge number, one can easily define prior expectations on the properties of this graph, depending on its nature.

For instance, one might have an idea of the number of links, the number of links per node (their degree).

In a social network, one might have prior expectations about the number of links connecting any two communities. Indeed, the fact that people from the same community tend to connect more than from different ones is a fact commonly observed in real-world social networks and often quoted as homophily.

Based on these prior expactations about the structural properties of the graph, the Maximum Entropy (MaxEnt) principle can be used to cast these expectations into a fully-fledged probability distribution on the combinatorial space of all possible graphs.

Formalizing prior expectations

In this paragraph, we describe mathematically the aforementioned prior expectations. To do this let’s first introduce some notations.

Notations

$\newcommand{\Gcal}{\mathcal{G}}$ $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\Gcal}{\mathcal{G}}$ $\newcommand{\Gcal}{\mathcal{G}}$ $\newcommand{\Gcal}{\mathcal{G}}$ $\newcommand{\Gcal}{\mathcal{G}}$Let $U$ a set of nodes of size $n$. A graph is a tuple $G=(U, E)$ where $E\subset U \times U$ is the set of edges of the graph. For a fixed set of nodes $U$, we denote by $\Gcal$ the set of possible undirected graphs connecting the nodes in $U$. Each graph $G\in\Gcal$ can be fully described by its adjacency matrix: $A=(a_{ij})\in \{0,1\}^{n^2}$, such that $a_{ij}=1$ if and only if the nodes $i$ and $j$ are connected. For each node $i \in U$, we denote $\mathcal{N}(i)$ its set of neighbors. We denote by $\newcommand{\PG}{\mathcal{P}(\mathcal{G})}$ $\PG$ the set of graph distributions, i.e. the set of all probability distributions on the set of graphs.

Prior statistics

The properties graph can be expressed as graph statistics, which are measurable functions taking as input a graph and yielding a real number:

$$ \begin{align} f&: &G &\mapsto& &f(G)\\ &&\Gcal &\rightarrow& &\R \end{align} $$

Examples of such statistics include for instance:

The degree of each node $i$: $$f_i^{(degree)}(G) = \sum\limits_{j\in \cal{N}(i)} a_{ij}$$
The number of connections between two node subsets $W, W’ \subset U$: $$f_{W,W’}^{(block)}(G) = \sum\limits_{i,j \in W \times W’} a_{ij}$$

As we see, any graph property that can be mathematically computed as a real number can be defined as a graph statistic.

Expected value of a prior statistics

$\newcommand{\Ebb}{\mathbb{E}}$ For a given graph distribution $P \in \PG$, and a graph statistic $f$, one can define the expectation of this graph statistic as: $$\Ebb[f(G)]= \sum_{G\in \Gcal} f(G)P(G)$$

This is the mathematical definition of what we mean when we expect a given property about the graph to have a certain value.

The above value allows to compute what a given observer, whose subjectivity is encoded in the prior distribution $P$, expects the graph property $f$ to be.

Maximum Entropy models

Supposing that we encode our prior expectations into $K$ statistics $f_1,…,f_K$ where each $f_k$ is a real-valued graph function, then the maximum entropy principle can be used to convert those into a prior distribution on the set of possible graphs.

Graph distributions

A graph distribution is a probability distribution defined on the set of graphs $\Gcal$. In other words, it can be identified with a function $P$ that gives for each graph $G \in \Gcal$ the likelihood $P(G)$ of observing this particular graph.

Entropy of a graph distribution.

The entropy value of any distribution $P$ being defined as $$H(P) = -\sum_{G\in\Gcal} P(G)\log(P(G))$$

This quantity measures the average amount of information provided by the observation of a graph, under the distribution $P$.

For instance, if for a given observer all the graphs are equiprobable, the information provided by the observation of a graph is high. In other words this observer will be very surprised on average by the observation.

In contrast, an observer that only gives a non-zero probability to a particular graph $G_0$, and zero probability to all the other graphs, doesn’t get any information when observing a graph sampled from its prior probability $H(P)=0$ in that case.

Maximizing the entropy under statistics-based constraints

Supposing that we encode our prior expectations into $K$ statistics $f_1,…,f_K$ where each $f_k$ is a real-valued graph function, then the maximum entropy principle can be used to derive a resulting prior distribution on the set of possible graphs.

While prior expectations about the graph are provided in the form of graph statistics value, we would like to define a distribution over the set of graphs, such that the expected value of the statistics under this distribtution are equal to the one that we expect. In other words we want to impose soft constraints on the graph distribution.

Namely, we want our distribution to satisfy for all $k=1,…,K$: $$\Ebb[f(G)]= c_k$$

where $c_k$ is our prior expectation value for the statistic $k$.

Under these constraints, we use the Maximum Entropy principle to derive the least informative graph prior distribution satisfying the soft constraints.

Achieving this amounts in solving the Maximum Entropy constrained optimization problem:

$$ \begin{array}{cc} \max\limits_{P} & H(P) \\ \text{such that} &\Ebb[f(G)]= c_k , k=1,…,K\\ &\sum_{G\in\Gcal}P(G)=1 \end{array} $$

It can be shown that the maximum entropy distribution can be written, for a certain parameter vector $\lambda \in \mathbb{R}^K$ and each graph $G\in \mathcal{G}$:

$$ P^*_{\lambda}(G) = \frac{ \exp(\lambda^T f(G)) }{ \sum_{G \in \mathcal{G}}\exp(\lambda^T f(G)) } $$

Where $f(G)=(f_1(G), …, f_K(G))$ is the vector of graph statistics.

Link with Maximum Likelihood Estimation

There is a strong connection between the above Maximum Entropy problem and Maximum Likelihood estimation. First we note that these two problems are distinct: while the first is a variational optimization problem (the optimization variable is the probability distribution $P$), the second is an simple convex optimization problem where the optimization variable is the parameter vector $\lambda$.

Their common point is that they are dual problems from each other. Indeed, for any distribution $P$ the Lagrangian associated with the MaxEnt Problem writes:

$$ \begin{aligned} \mathcal{L}(P, \lambda) =&-\sum\limits_{G \in \mathcal{G}} P(G) log(P(G))\\ &- \sum\limits_{k=1}^{K} \lambda_k (\sum\limits_{G \in \mathcal{G}} P(G) f_k(G) - c_k ) \end{aligned} $$

$\newcommand{\Lcal}{\mathcal{L}}$ $\newcommand{\Ghat}{\hat{G}}$ $\newcommand{\Pstar}{P^*_{\lambda}}$

In the context of statistics where we observe a graph $\Ghat$ and set $c_k=f_k(\Ghat)$ for all the statistics $k=1,…,K$, it can be easily shown that

$$\Lcal(\Pstar, \lambda) = -\log(\Pstar(\Ghat)).$$ Hence the Lagrangian is exactly equal to the negative log-likelihood of the model.

Factorized form

A broad range of graph statistics can be decomposed as of edge-specific statistics, i.e.: $\newcommand{\fijk}{f_{ij}^{(k)}}$ $$f_k(G)= \sum\limits_{i \neq j} \fijk(a_{ij}),$$

For instance, the degree of a node is equal to the sum of the corresponding row of the adjacency matrix, and the volume of interaction between two communities is the sum of the entries located in a block of the adjacency matrix.

It can be shown that for these statistics the MaxEnt distribution factorizes over the set of edges. More precisely, in that case we can derive edge-specific statistic vectors, denoted $f_{ij}(G)$, such that:

$$\Pstar(G)=\prod\limits_{i\neq j} P_{ij}(a_{ij})$$ Where for each edge $ij$, $P_{ij}$ is a Bernoulli probability with parameter $$\frac{1}{1+exp(-\lambda^T f_{i,j}(G))}$$ This expression allows to express the graph distribution as a joint distribution of independent edge-specific Bernoulli variables $a_{ij}$. Moreover, the Bernoulli probabilities for each edge are given by a linear logit $\lambda^T f_{i,j}(G)$, passed through the sigmoid function $\sigma :x\mapsto \frac{1}{1+exp(-x)}$.

MaxEnt in practice: how to turn prior knowledge statistics into a MaxEnt distribution

In practice, such a distribution can used to extract prior information from an observed graph $\hat{G}$. We recall that the input of this procedure is a set of graph statistic functions, that each quantify an aspect of our expectation on the graph distribution. Based on this, one can apply the statistics $f_k$ to the observed graph, and use the obtained values To do this, one just needs to maximize the above likelihood of the observed graph with respect to the parameter vector $\lambda$:

$$ \begin{aligned} \max\limits_{\lambda\in \mathbb{R}^K} P(\hat{G})
\end{aligned} $$

It can be noted that this Maximum Likelihood problem can be solved using logistic regression. Indeed, for each each edge, we access a feature vector $f_{i,j}(\hat{G})$ use it to predict the presence of absence or link between nodes $i$ and $j$.

Conclusion

We have seen how Maximum Entropy models for graph can be used to formalize prior knowledge about a graph, encoded as soft constraints.

The resulting model has been widely studied in network science literature, under the name of P* (p-star) model, or Exponential random graph models. I

The dyad-independent expression has served as the basis of Later work such as Conditional Network Embeddings (Kang et al., 2019).

References

Bie, T. D. (2010). Maximum entropy models and subjective interestingness: an application to tiles in binary databases.
Adriaens, F., Lijffijt, J., & De Bie, T. (2017). Subjectively interesting connecting trees. In M. Ceci, J. Hollmén, L. Todorovski, & C. Vens (Eds.), Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part II (Vol. 10535, Number 2, pp. 53–69). Springer International Publishing. http://dx.doi.org/10.1007/978-3-319-71246-8_4
Kang, B., Lijffijt, J., & Bie, T. D. (2019). Conditional Network Embeddings. 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. https://openreview.net/forum?id=ryepUj0qtX