Temporal Graph Benchmark. Challenging and realistic datasets for… | by Shenyang(Andy) Huang | Dec, 2023

The goal of dynamic link property prediction is to predict the property (often the existence) of a link between a node pair at a timestamp.

Negative Edge Sampling. In real applications, the true edges are not known in advance. Therefore, a large number of node pairs are queried, and onlypairs with the highest scores are treated as edges. Motivated by this, we the link prediction task as a ranking problem and sample multiple negative edges per each positive edge. In particular, for a given positive edge (s,d,t), we fix the node s and timestamp t and sample q different destination nodes d. For each dataset, q is selected based on the trade-off between evaluation completeness and test set inference time. Out of the q negative samples, half are sampled uniformly at random, while the other half are historic negative edges (edges that were observed in the set but are not present at time t).

Performance metric. We use the filtered Mean Reciprocal Rank (MRR) as the metric for this task, as it is designed for ranking problems. The MRR computes the reciprocal rank of the true destination node among the negative or fake destinations and is commonly used in recommendation and knowledge graph literature.

MRR performance on tgbl-wiki and tgbl-review datasets

Results on datasets. On the small tgbl-wiki and tgbl-reviewdatasets, we observe that the best performing are quite different. In addition, the top performing models on tgbl-wiki such as CAWN and NAT have a significant reduction in performance on tgbl-review. One possible explanation is that the tgbl-reviewdataset has a much higher surprise index when compared to the tgbl-wikidataset. The high surprise index shows that a high ratio of test set edges is never observed in the training set thus tgbl-reviewrequires more inductive reasoning. In tgbl-review, GraphMixer and TGAT are the best performing models. Due to their smaller size, we are able to sample all possible negatives for tgbl-wikiand one hundred negatives for tgbl-reviewper positive edge.

MRR performance on tgbl-coin, tgbl-comment and tgbl-flight datasets.

Most methods run out of GPU memory for these datasets thus we compare TGN, DyRep and Edgebank on these datasets due to their lower GPU memory requirement. Note that some datasets such as tgbl-commentor tgbl-flightspanning multiple years thus potentially resulting in distribution over its long time span.

effect of number of negative samples on tgbl-wiki

Insights. As seen above in tgbl-wiki, the number of negative samples used for evaluation can significantly impact model performance: we see a significant performance drop across most methods, when the number of negative samples increases from 20 to all possible destinations. This verifies that indeed, more negative samples are required for robust evaluation. Curiously, methods such as CAWN and Edgebank have relatively minor drop in performance and we leave it as future work to investigate why certain methods are less impacted.

total training and validaiton time of TG models

Next, we observe up to two orders of magnitude difference in training and validation time of TG methods, with the heuristic Edgebank always being the fastest (as it is implemented simply as a hashtable). This shows that improving the model and scalability is an important future direction such that novel and existing models can be tested on large datasets provided in TGB.

Source link