Machine Learning on Knowledge Graphs at NeurIPS 2020

  • by

NeurIPS is a major venue covering a wide range of ML & AI topics. Of course, there is something interesting for Graph ML aficionados and knowledge graph connoisseurs 🧐. Tune in to find out!

This year, NeurIPS had 1900 accepted papers 😳 and 100+ among them are on graphs. Plus take into account several prominent workshops likeĀ KR2ML,Ā DiffGeo4ML, andĀ LMCA. Be sure to check their proceedings as such workshop papers are likely to appear at future venues like ICLR, ICML, or ACL. Furthermore, šŸ‘‰ theĀ GML Newsletter #4Ā šŸ‘ˆ by Sergey Ivanov presents an overview of Graph ML papers from NeurIPS including theory, oversmoothing, scalability, and many more, so check it out as well.

In this post, I’d like to put an emphasis on a particular type of graphs,Ā knowledge graphs (KGs), and explore with you 10 papers that might be quite influential in 2021. NeurIPS papers are often a bit more high level with more theory than NLP applications in *CL conferences, so I’d summarize them as:

Behind the curtain of transductive link prediction we step into the valley of logical reasoning tasks for KGs to excel at

Get some ā˜•ļø, šŸµ, or even some Glühwein – today in our agenda:

  1. Query Embedding: Beyond Query2Box
  2. KG Embeddings: NAS, šŸ“¦ vs šŸ”®, Meta-Learning
  3. SPARQL and Compositional Generalization
  4. Benchmarking: OGB, GraphGYM, KeOps
  5. Wrapping up

If this in-depth educational content on is useful for you, you canĀ subscribe to our AI research mailing listĀ to be alerted when we release new material.Ā 

Query Embedding: Beyond Query2Box

Query embedding (QE) is about answering queries against a KG directly in the embedding space without any SPARQL or graph database engines. Given that most of KGs are sparse and incomplete, query embedding algorithms are able to infer missing links (with a certain probability). This is one of the hottest topics in Graph ML so far! šŸ”„
In theĀ ICLR 2020 post, we coveredĀ Query2Box, a strong QE baseline capable of answering logical queries withĀ conjunctionĀ (∧),Ā disjunctionĀ (∨), andĀ existential quantifiersĀ (∃) by modeling entities as d-dimensional boxes šŸ“¦.

Ren and LeskovecĀ (the authors of the original Query2Box) finally add aĀ negationĀ operator (¬) in theĀ BetaEĀ framework. Neither points nor boxes have a usable denotation of negation so thatĀ BetaEĀ models entities and queries asĀ Beta distributions. Projection and intersection are modeled nicely with beta distributions as well (negation is a distribution with reciprocal alpha and beta parameters). In addition toĀ DNF, we can useĀ De Morgan’s lawĀ for replacing disjunctions with negations and conjunction. Check out the nice illustration of the approach below šŸ‘‡

🧪 BetaEĀ slightly outperformsĀ Query2BoxĀ on existing query patterns šŸ“ˆ while deriving and experimenting on new patterns with negations that can not be answered by any existing QE approach yet šŸ’€. Two more šŸ”‘ differences toĀ Q2B:Ā BetaEĀ better captures query uncertainty (correlation between the differential entropy of the Beta embedding and the cardinality of the answer set, up to 77% better), and can estimate if a given query has zero answers.

Answering a FOL query ā€œList the presidents of European countries that have never held the World Cupā€ with conjunction, disjunction, and negation operators. Source:Ā Ren and Leskovec

On the other hand, Sun et al identified that Q2B and other systems are notĀ logically faithful, that is, not allĀ logically entailedĀ query answers can be retrieved by the QE system. To bridge this šŸ•³ gap, the authors introduceĀ EmQLĀ (Embedding Query Language).Ā EmQLĀ still embeds entities into a d-dimensional space and supports ∧, ∨, and ∃, but takes a different approach on modelling sets šŸ¤”. Instead of boxes or Beta distributions, the authors encode each setĀ XĀ with a pairĀ (a_x,b_x)Ā whereĀ a_xĀ is a weighted centroid of set elements, andĀ b_xĀ is aĀ count-min sketchĀ (CM sketch). Each sketch consists ofĀ DĀ hash functions of depthĀ WĀ (thus aĀ DƗWĀ matrix, the authors select 20Ɨ2000).

How does it work?
* Using centroids, top-k MIPSĀ a_x.T.mm(E)Ā retrievesĀ kĀ possible candidate entities that belong toĀ X;
* Importantly, for CM sketches we have a differentiable retrieval operatorĀ CM(i, b_x)Ā that returns a weight of entityĀ iĀ in a setĀ X;
* We then can combine MIPS with CM-based filtering šŸ„‚
* The authors then define ∧, ∨, and ∃ as operators over centroids and CM sketches

🧪 InĀ experiments, the authors probeĀ EmQLĀ onĀ generalizationĀ (answering queries, standard QE task) andĀ entailmentĀ (when a full KG is given, no link prediction required). On average,Ā EmQLĀ outperformsĀ Q2BĀ by 10–15 H@3 points on FB15k-237 and NELL on the generalization task, and completely dominates on the entailment task (94.2 vs 36.2) šŸ‘€.
Furthermore, EmQL was tested on multi-hop QA benchmarks like MetaQA and WebQSP where it outperforms evenĀ the recent EmbedKGQA from the ACL 2020Ā šŸ’Ŗ
Note thatĀ EmQLĀ does not support negations (¬) allowed byĀ BetaE.Ā Yet? šŸ˜‰

Source:Ā Sun et al

KG Embeddings: NAS, šŸ“¦ vs šŸ”®, Meta-Learning

Something really interesting this year at NeurIPS going beyond ā€˜yet-another-KG-embedding-algorithm’. You’ve probably heard aboutĀ Neural Architecture Search (NAS)Ā and its successes in computer vision — for instance, recent architectures likeĀ EfficientNetĀ are not designed by humans. Instead, a NAS system generates a neural net from a bunch of smaller building blocks 🧱optimizing certain metrics. Can we have a NAS to generate efficient architectures for KG-related tasks?

When NAS for KG embeddings actually works

Zhang et alĀ say yes! They proposeĀ Interstellar, an RNN-based NAS approach for relational paths.Ā InterstellarĀ first requires sampling paths from a KG (biased random walks in this case), and then those paths are fed into an RNN. The whole RNN net (cell and weights) is a subject šŸŽÆ for NAS. The process is split into two parts: macro-level (e.g., scoring functions) and micro-level (activations and weights) which are governed by a controller šŸ¤–. As Hits@10 and related metrics are non-differentiable, the authors resort to policy gradients to optimize the controller.

🧪 InĀ experiments,Ā InterstellarĀ is probed on link prediction and entity matching tasks showing competitive results. Each task requires certain seed architectures (like on a picture šŸ‘‡), and finding a good net might take a while ā³ (about 30 hours on search and 70 hours on fine-tuning for FB15k-237), but look, it’s the first step showing that NAS is generally applicable for KG-related tasks and can create new RNN architectures! 🤩
Besides, let’s see how fast the next Nolan’s movie, Tenet, will get some traction in the model-naming world šŸ˜‰

Source:Ā Zhang et al

Geometric embedding models enjoy ever-increasing attention šŸ‘€ from the community! Last year,Ā in the NeurIPS’19 post, we noticed a surge in approaches that use hyperbolic geometry šŸ”® for graph representation learning. This year, we have a new strong geometric competitor: hyper-rectangles, aka boxes šŸ“¦!

WhereasĀ Query2BoxĀ used boxes for query embedding,Ā Abboud et alĀ develop the idea further and designĀ BoxE, a provably fully-expressive KG embedding model where entities are points in a vector space and relations are boxes šŸ“¦. Each relation is modeled with as many boxes as theĀ arityĀ of the relation is, e.g., for aĀ binaryĀ predicateĀ capitalOf(Berlin, Germany)Ā there will beĀ twoĀ boxes for head and tail entities, and forĀ n-aryĀ predicates, there will beĀ nĀ boxes. Each entity, in addition to the base position, has an additional parameterĀ translational bumpĀ which aims at bringing entities occurring in the same relation closer šŸŽ³ (check out an illustrated example šŸ‘‡).

The authors do invest into theory šŸ“š and prove several important properties ofĀ BoxE: it can model many inference patterns except composition, it allows for rule injection šŸ’‰ (and hence, for injecting ontological axioms), and it is fully expressive. However, it’s fully expressive only when the embedding dimension isĀ |E|x|R|Ā for binary relations andĀ |E|^(n-1)x|R|Ā for n-ary predicates, which is, hmm, a bit too much šŸ˜• (interestingly, the authors of Query2Box also showed that you need aboutĀ |E|Ā embedding dimension for modeling an arbitrary FOL query).

āš—ļø BoxEĀ was evaluated on triple-based benchmarks like FB15k-237 as well as on n-ary graphs like JF17K. Although the embedding dimension varies in the range 200–1000 (not 15000Ɨ237 as theory needs for FB15k-237, for example),Ā BoxEĀ is still quite competitive and performs on par āš–ļø with current SOTA on the graphs w/o many compositional patterns. The authors also compiled a nice experiment on injecting logical rules over the NELL-sports dataset and showed impressive >25 MRR points gains šŸ’Ŗ.

As 2020 is the year of boxes šŸ“¦, do not miss the work ofĀ Dasgupta et alĀ published here at NeurIPS who study boxes deeper on the subject of local identifiability and come up with an idea of using Gumbel distributions to model box parameters.

Source:Ā Abboud et al

We also rememberĀ E2R from NeurIPS 2019, a KG embedding model based on quantum logic with interesting properties (either very high šŸ‘ or very low šŸ‘Ž performance). By that time, E2R only worked in the transductive setup (which means the whole graph is seen during training). This year,Ā Srivastava et alĀ further extend the model and come up with IQE (Inductive Quantum Embedding). šŸ”‘ Essentially,Ā IQEĀ now accepts node features so that an entity embedding has to correlate with its feature vector. Furthermore,Ā IQEĀ is now optimized with a novelĀ Alternating MinimizationĀ scheme which the authors find to be approximately 9 times faster šŸš€ than vanilla E2R. The authors also provide a solid theoretical justification of model’s properties and when one should expect the model to be NP-hard.

šŸ‘©ā€šŸ”¬ Conceptually, the model supports binary predicates, but the authors concentrate on the fine-grained entity typing task (FIGER, Ontonotes, TypeNet) using BiLSTM as a context encoder. Note that IQE needs only about 6 epochs to converge (on FIGER — in comparison, E2R required 1000 iterations)! Qualitatively, IQE outperforms the original transductive model by 25–30 accuracy and F1 points šŸ“ˆ

Source:Ā Srivastava et al

Continuing on inductive tasks,Ā Baek et al study two particular link prediction setups: 1) given a training seen graph, a newĀ unseen Ā šŸ‘» node arrives, and you need to predict its connections toĀ seenĀ šŸ‘“ nodes (šŸ‘» -> šŸ‘“ ); 2) moreĀ unseenĀ nodes arrive and you need to predict links amongĀ unseenĀ nodes themselves (šŸ‘» -> šŸ‘» ). Sounds pretty complex, right? Usually, in transductive tasks, a model learns entity and relation embeddings of all seen nodes, and inference is performed on a set of seen nodes. Here, we have unseen nodes, and, often, without node features.

Source:Ā Baek et al

The authors resort toĀ meta-learningĀ and proposeĀ Graph Extrapolation Networks (GEN)designed toĀ extrapolateĀ the knowledge from the seen entities to unseen. Furthermore, the authors define the task in theĀ few-shotĀ setting, that is, unseen new nodes might have 3–5 (K) links to existing nodes or between other unseen nodes šŸ¤”.

The meta-learning šŸ‘©ā€šŸ« task for GEN relies mostly onĀ relations: given a support set ofĀ KĀ triples for an unseen nodeĀ e_i, apply neighborhood aggregation through a learnable relation-specific weightĀ Wr. In fact, šŸ‘‰ any relation-aware GNN architecture might be plugged in here. In other words, we meta-learn an embedding of an unseen entity using its neighbors’ representations. To cater for the uncertainty of the few-shot scenario, the authors stochastically embed unseen entities as a distribution which parameters are learned with 2 GEN layers through MC sampling (somewhat resemblesĀ GraphVAEs).

🧪 GENĀ has been evaluated on 1- and 3- shot LP tasks on FB15k-237 and NELL-995 yielding significant šŸ‘ improvements when considering unseen-to-unseen links. In addition, GEN has been applied to relation prediction onĀ DeepDDIĀ andĀ BioSNAP-subĀ datasets with impressive gains over baselines, e.g., 0.708 vs 0.397 AUPRC on DeepDDI.

šŸ”„ Overall, NeurIPS’20 opened up several prospects in the KG embedding area: look, Neural Architecture Search šŸ”Ž works, Meta-Learning works, Quantum and šŸ“¦ models are getting more expressive! Thanks to that, we can now solve much more complex tasks than vanilla transductive link prediction.

SPARQL and Compositional Generalization

šŸ“ In question answering over KGs (KGQA), semantic parsing transforms a question into a structured query (say, in SPARQL) which is then executed against a database. One of the šŸ”‘ problems there is compositional generalization, that is, can we build complex query patterns after observing simple atoms? In theĀ ICLR’20 post, we reviewed a new large-scale datasetĀ Complex Freebase Question (CFQ) (let’s forgive them for šŸ§Ÿā€ā™‚ļø Freebase) that was designed to measure compositional generalization capabilities of NL 2 SPARQL approaches. Notably, baselines like LSTMs and Transformers perform quite poorly: <20% accuracy on average šŸ˜•

šŸš’Ā  Guo et alĀ present a thorough study of potential caveats, i.e., one of the biggest issues is sequential decoding ⛓ or any kind of ordering bias when generating queries or logical forms including tree decoding. Instead, they propose to leverageĀ partially ordered sets (posets)Ā and, conversely,Ā Hierarchical Poset Decoding (HPD).Ā PosetsĀ allow us to enforce permutation invariance in the decoding process (for instance, predicting two branches of aĀ logical ANDĀ operator independently) so that a model could concentrate on generalization. Posets can be represented as DAGs. Components of that DAG can be predicted by a simple RNN (which the authors resort to).

However, direct prediction of posets does not bring benefits (works even worse than LSTMs and Transformers šŸ“‰ ). The essential part is hierarchical decoding (check the šŸ–¼ below) which consists of 4 steps. 1ļøāƒ£ First, we predict a post sketch (de-lexicalized DAG). 2ļøāƒ£ Independently, we predict primitives of our query (sort of entity and relation recognition). 3ļøāƒ£ Then, we fill in the primitives into the poset sketch in all possible permutations, and 4ļøāƒ£, predict which particular paths actually do belong to the correct target poset.

🧪 Experimentally,Ā HPDĀ performs surprisingly well šŸ‘€ — on average, 70% accuracy on 3 MCD splits compared to 20% by Universal Transformer andĀ 40%-ish by mighty T5–11B. Ablations show that seq2seq and seq2tree sketch predictions only worsen the performance, and the hierarchical component is crucial (otherwise minus 50% accuracy). šŸ”„ Hopefully, this work will inspire more research on compositional generalization and complex KGQA!

Source:Ā Guo et al

Benchmarking: OGB, GraphGYM, KeOps

Tired of seeing Cora/Citeseer/Pubmed in every other GNN paper? You should be: they are small, expose certain biases, and models’ performance has pretty much saturated. Time for a big change!

Source:Ā Hu et al

Open Graph Benchmark (OGB)Ā (paper by Hu et al) is a great new effort by the Graph ML community to create a set of complex and diverse tasks on different forms of graphs (leaderboards included šŸ† ). OGB offersĀ node classification,Ā graph classification,Ā link predictionĀ tasks on graphs of various sizes (as of now, the biggest graph contains ~100M nodes and ~1.6B edges) and domains (KGs are here, too šŸ˜: Wikidata-based and BioKG link prediction datasets).

šŸ”„ OGB leaderboards have already generated several Twitter storms: for instance, suddenly, aĀ simple label propagation algorithmĀ of 10K-100K parametersĀ outperformsĀ big and slow GNNs of 1M+ parameters by a large margin on transductive node classification tasks šŸ™ˆ. Clearly, there is still an unexplored room of capabilities and limitations of GNNs. Could Cora/Citeseer/Pubmed demonstrate it? Probably not šŸ¤·ā€ā™€ļø.

Okay, we have such a big variety of tasks now! On the other hand, we have dozens of GNN architectures and hundreds of hyperparameters to tune. Is there a sweet spot, a good starting point to dig into a certain task? The space is so large! 🤯  You, Ying, and LeskovecĀ tackle exactly this problem of exploring design spaces of GNNs and introduceĀ GraphGYM, a comprehensive suite for creating and evaluating GNNs (and flexing your GNN muscles šŸ’Ŗ). The authors define GNN design and task spaces, each consisting of fine-grained details, e.g., 12 design dimensions: batch norm, dropout rates, aggregation functions, activation functions, node features pre-/postprocessing layers, number of message passing layers, skip-layers, batch size, learning rate, optimizers, and training epochs. Couple that with dozens of tasks, and a Cartesian product of possible combinations surpasses 10M options! šŸ‘€

In the rich experimental agenda, the authors find the best working combinations that you could adopt as good starting points and produce very insightful charts šŸ‘‡. TheĀ repoĀ is openly available, you could start experimenting pretty much right away!

šŸ˜‰ By the way, if you’re looking for something similar in the KG embeddings domain, our team has recently completedĀ a huge surveyĀ of models and hyperparameters concentrating on the link prediction task.

96 setups sampled from 10M possible combinations. Source:Ā You, Ying, and Leskovec

āš”ļø Finally, I would like to outline the work byĀ Feydy et alĀ onĀ KeOps, a blazing fast kernel operations library with NumPy, PyTorch, R, and Matlab bindings. In addition to widely used dense and sparse matrices, the authors supportĀ symbolic matricesĀ (whereĀ ij-thĀ member is computed via a certain formulaĀ F,Ā often a matrix reduction formula). Symbolic matrices are computed on the fly šŸš€ and optimized for CUDA computations. The authors do invest into benchmarking: on a rather standard server workstation with a 8-core Xeon, 128 Gb RAM, RTX 2080 Ti, KeOps outperforms isĀ 5x — 20xĀ fasterĀ than PyTorch implementation on the same tasks (then PyTorch crashes with OOM while KeOps works fine).

  • You can also perform a kNN search and be competitive with FAISS!
  • Some implementations inĀ PyTorch-GeometricĀ already work well with KeOPS

Personally, I’ve been using PyKeOps since summer and find it extremely helpful when working with large-scale KGs. Besides, I compiled the library on a PowerPC + CUDA cluster, please feel my pain šŸ˜…

KeOps uses symbolic matrices! Source:Ā Feydy et al

Wrapping Up

NeurIPS concludes the line-up of top AI conferences, but ICLR 2021 scores are already out there šŸ˜‰. If you want to keep updated on Graph ML topics, you could subscribe to theĀ regular newsletterĀ by Sergey Ivanov or join the TelegramĀ GraphML channel!

Merry Christmas, happy New Year, and stay safe šŸ¤—

This article was originally published onĀ Medium and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

The post Machine Learning on Knowledge Graphs at NeurIPS 2020 appeared first on TOPBOTS.

Leave a Reply

Your email address will not be published. Required fields are marked *