About train-test splitting for link prediction
March 25, 2022
In this post, we describe the link prediction task for networks, and the strategies to evaluate link prediction methods
Introduction
Link prediction is a machine learning task that that consists in predicting missing links on an observed network. Applications of this tasks include recommender systems, friend recommendations on social media, proediction of protein-protein interaction in biology, among others.
A link prediction method can be seen as a score function, that takes as input an edge $e$ and yields a score $s(e) \in [0,1]$
Evaluations strategy
We want to evaluate a Link prediction method on a given graph $G= (U,E)$, $U$ being the set of nodes in the graph and $E\subset U\times U$ the set of edges.
Pruning - Train/test splitting
The first step is to remove some of the observed edges.
Doing so yields a pruned graph $\newcommand{tE}{\tilde{E}}$ $\newcommand{\tG}{\tilde{G}}$ $\tG = (U,\tE)$, where $\tE\subset E$. Note that this pruned graph can be disconnected and have isolated nodes.
We denote $\newcommand{\mE}{E_{missing}}$$\mE$ the set of edges that were removed during this pruning operation, i.e. $\mE=E \backslash \tE$.
In contrast, we denote $\newcommand{\nE}{E_{neg}}$$\nE$ the set of true negative edges, i.e. the edges that were effectively not in the original graph.
This set is disjoint from the set of missing edges: $\nE \cap \mE = \emptyset$
The goal of link prediction is to score the edges that were removed from the graph higher than the edges that were not in the graph in the first place.
In other words, we construct a binary classification dataset composed of edges $e\in \mE \cup \nE$, and a response variable $y=1$ if $e \in \mE$ and $y=0$ else.
Metrics
Within the context described before, the missing edges can be reframed true positive, while the negative edges cna be called true negative. The link prediction methods can then be evaluated as a binary classifier, trying to discriminate edges that were removed during the pruning process, from edges that were already negative in the original graph.
Given a threshold value $\tau$, we can define a decision rule on the set of edges $e$.
Example
Let’s suppose that our test set is composed of 7 edges $e_1,…,e_7$, and that we set the decision threshold to $\tau=0.89$.
In the table below, we order the edges in order of their link prediction scores, and predict them to be positive if their score is higher than the threshold.
Edge | $e_1$ | $e_2$ | $e_3$ | $e_4$ | $e_5$ | $e_6$ | $e_7$ |
---|---|---|---|---|---|---|---|
Score | 0.95 | 0.94 | 0.92 | 0.9 | 0.8 | 0.79 | 0.73 |
Label | 1 | 1 | 1 | 0 | 0 | 1 | 0 |
Prediction | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
As we can see, using this threshold, we have selected $P(\tau)=4$ edges as positive.
Among these 4 edges:
- $e_1$, $e_2$ and $e_3$ are correctly labeled as positive (namely ) so the number of true positives is $TP(\tau)=3$.
- $e_4$ is wrongly labeled as positive, so the number of false positives is $FP(\tau)=1$.
- $e_6$ is wrongly labeled as negative, so the number of false negative is $FP(\tau)=1$.
Based on these rate, we can derive the following confusion matrix: Note that all the values above are piecewise constant functions of the decision threshold, where the cutoff points are the score values of each samples.
 | Predicted Positive | Predicted Negative |
---|---|---|
Positive | TP=3 | FN=1 |
Negative | FP=1 | TN=2 |
Precision and Recall
The precision is the ratio of positively predicted edges that have a positive ground truth label.
The recall is the percentage of edges having a ground truth positive label that were effectively classified, or “recalled” as positive.
To remind of these notions, one can think of the score as the result of a search on Google. Given a query term, Google looks for most relevant articles and returns them, ordered by relevance. One can then select the first $K$ of these articles, and decide whether they are actually relevant or not (i.e. the ground truth is given by the user). The precision score tells us how precise the results are, ensuring that Google doesn’t yield too many unrelevant articles. The recall on the other hand, ensures that Google will give a high rank to a good proportion of the articles that are relevant for the user.