NeurIPS is a major venue covering a wide range of ML & AI topics. Of course, there is something interesting for Graph ML aficionados and knowledge graph connoisseurs š§. Tune in to find out!

This year, NeurIPS had 1900 accepted papers š³ and 100+ among them are on graphs. Plus take into account several prominent workshops likeĀ KR2ML,Ā DiffGeo4ML, andĀ LMCA. Be sure to check their proceedings as such workshop papers are likely to appear at future venues like ICLR, ICML, or ACL. Furthermore, š theĀ GML Newsletter #4Ā š by Sergey Ivanov presents an overview of Graph ML papers from NeurIPS including theory, oversmoothing, scalability, and many more, so check it out as well.

In this post, Iād like to put an emphasis on a particular type of graphs,Ā *knowledge graphs (KGs)*, and explore with you 10 papers that might be quite influential in 2021. NeurIPS papers are often a bit more high level with more theory than NLP applications in *CL conferences, so Iād summarize them as:

Behind the curtain of transductive link prediction we step into the valley of logical reasoning tasks for KGs to excel at

Get some āļø, šµ, or even some GlĆ¼hwein ā today in our agenda:

- Query Embedding: Beyond Query2Box
- KG Embeddings: NAS, š¦ vs š®, Meta-Learning
- SPARQL and Compositional Generalization
- Benchmarking: OGB, GraphGYM, KeOps
- Wrapping up

**If this in-depth educational content on is useful for you, you canĀ subscribe to our AI research mailing listĀ to be alerted when we release new material.Ā **

## Query Embedding: Beyond Query2Box

Query embedding (QE) is about answering queries against a KG directly in the embedding space without any SPARQL or graph database engines. Given that most of KGs are sparse and incomplete, query embedding algorithms are able to infer missing links (with a certain probability). This is one of the hottest topics in Graph ML so far! š„

In theĀ ICLR 2020 post, we coveredĀ **Query2Box**, a strong QE baseline capable of answering logical queries withĀ *conjunction*Ā (ā§),Ā *disjunction*Ā (āØ), andĀ *existential quantifiers*Ā (ā) by modeling entities as d-dimensional boxes š¦.

**Ren and Leskovec**Ā (the authors of the original Query2Box) finally add aĀ *negation*Ā operator (Ā¬) in theĀ **BetaE**Ā framework. Neither points nor boxes have a usable denotation of negation so thatĀ *BetaE*Ā models entities and queries asĀ Beta distributions. Projection and intersection are modeled nicely with beta distributions as well (negation is a distribution with reciprocal alpha and beta parameters). In addition toĀ DNF, we can useĀ De Morganās lawĀ for replacing disjunctions with negations and conjunction. Check out the nice illustration of the approach below š

*š§ŖĀ BetaE*Ā slightly outperformsĀ

*Query2Box*Ā on existing query patterns š while deriving and experimenting on new patterns with negations that can not be answered by any existing QE approach yet š. Two more š differences toĀ

*Q2B*:Ā

*BetaE*Ā better captures query uncertainty (correlation between the differential entropy of the Beta embedding and the cardinality of the answer set, up to 77% better), and can estimate if a given query has zero answers.

On the other hand, Sun et al identified that Q2B and other systems are notĀ *logically faithful*, that is, not allĀ *logically entailed*Ā query answers can be retrieved by the QE system. To bridge this š³ gap, the authors introduceĀ **EmQL**Ā (*Embedding Query Language*).Ā *EmQL*Ā still embeds entities into a d-dimensional space and supports ā§, āØ, and ā, but takes a different approach on modelling sets š¤. Instead of boxes or Beta distributions, the authors encode each setĀ *X*Ā with a pairĀ `(a_x,b_x)`

Ā whereĀ `a_x`

Ā is a weighted centroid of set elements, andĀ `b_x`

Ā is aĀ count-min sketchĀ (CM sketch). Each sketch consists ofĀ *D*Ā hash functions of depthĀ *W*Ā (thus aĀ *DĆW*Ā matrix, the authors select 20*Ć*2000).

How does it work?

* Using centroids, top-k MIPSĀ `a_x.T.mm(E)`

Ā retrievesĀ *k*Ā possible candidate entities that belong toĀ *X*;

* Importantly, for CM sketches we have a differentiable retrieval operatorĀ `CM(i, b_x)`

Ā that returns a weight of entityĀ *i*Ā in a setĀ *X;*

* We then can combine MIPS with CM-based filtering š„

* The authors then define ā§, āØ, and ā as operators over centroids and CM sketches

š§Ŗ InĀ experiments, the authors probeĀ *EmQL*Ā onĀ *generalization*Ā (answering queries, standard QE task) andĀ *entailment*Ā (when a full KG is given, no link prediction required). On average,Ā *EmQL*Ā outperformsĀ *Q2B*Ā by 10ā15 H@3 points on FB15k-237 and NELL on the generalization task, and completely dominates on the entailment task (94.2 vs 36.2) š.

Furthermore, EmQL was tested on multi-hop QA benchmarks like MetaQA and WebQSP where it outperforms evenĀ the recent EmbedKGQA from the ACL 2020Ā šŖ

Note thatĀ *EmQL*Ā does not support negations (Ā¬) allowed byĀ *BetaE.Ā *Yet? š

## KG Embeddings: NAS, š¦ vs š®, Meta-Learning

Something really interesting this year at NeurIPS going beyond ā*yet-another-KG-embedding-algorithm*ā. Youāve probably heard aboutĀ **Neural Architecture Search (NAS)**Ā and its successes in computer vision ā for instance, recent architectures likeĀ EfficientNetĀ are not designed by humans. Instead, a NAS system generates a neural net from a bunch of smaller building blocks š§±optimizing certain metrics. Can we have a NAS to generate efficient architectures for KG-related tasks?

**Zhang et al**Ā say yes! They proposeĀ **Interstellar**, an RNN-based NAS approach for relational paths.Ā *Interstellar*Ā first requires sampling paths from a KG (biased random walks in this case), and then those paths are fed into an RNN. The whole RNN net (cell and weights) is a subject šÆ for NAS. The process is split into two parts: macro-level (e.g., scoring functions) and micro-level (activations and weights) which are governed by a controller š¤. As Hits@10 and related metrics are non-differentiable, the authors resort to policy gradients to optimize the controller.

š§Ŗ InĀ experiments,Ā *Interstellar*Ā is probed on link prediction and entity matching tasks showing competitive results. Each task requires certain seed architectures (like on a picture š), and finding a good net might take a while ā³ (about 30 hours on search and 70 hours on fine-tuning for FB15k-237), but look, itās the first step showing that NAS is generally applicable for KG-related tasks and can create new RNN architectures! š¤©

Besides, letās see how fast the next Nolanās movie, Tenet, will get some traction in the model-naming world š

Geometric embedding models enjoy ever-increasing attention š from the community! Last year,Ā in the NeurIPSā19 post, we noticed a surge in approaches that use hyperbolic geometry š® for graph representation learning. This year, we have a new strong geometric competitor: hyper-rectangles, aka boxes š¦!

WhereasĀ Query2BoxĀ used boxes for query embedding,Ā **Abboud et al**Ā develop the idea further and designĀ **BoxE**, a provably fully-expressive KG embedding model where entities are points in a vector space and relations are boxes š¦. Each relation is modeled with as many boxes as theĀ *arity*Ā of the relation is, e.g., for aĀ ** binary**Ā predicateĀ

`capitalOf(Berlin, Germany)`

Ā there will beĀ **Ā boxes for head and tail entities, and forĀ**

*two***Ā predicates, there will beĀ**

*n-ary***Ā boxes. Each entity, in addition to the base position, has an additional parameterĀ**

*n***Ā which aims at bringing entities occurring in the same relation closer š³ (check out an illustrated example š).**

*translational bump*The authors do invest into theory š and prove several important properties ofĀ *BoxE*: it can model many inference patterns except composition, it allows for rule injection š (and hence, for injecting ontological axioms), and it is fully expressive. However, itās fully expressive only when the embedding dimension isĀ **|E|x|R|**Ā for binary relations andĀ **|E|^(n-1)x|R|**Ā for n-ary predicates, which is, hmm, a bit too much š (interestingly, the authors of Query2Box also showed that you need aboutĀ **|E|**Ā embedding dimension for modeling an arbitrary FOL query).

ā*ļø BoxE*Ā was evaluated on triple-based benchmarks like FB15k-237 as well as on n-ary graphs like JF17K. Although the embedding dimension varies in the range 200ā1000 (not 15000Ć237 as theory needs for FB15k-237, for example),Ā *BoxE*Ā is still quite competitive and performs on par āļø with current SOTA on the graphs w/o many compositional patterns. The authors also compiled a nice experiment on injecting logical rules over the NELL-sports dataset and showed impressive >25 MRR points gains šŖ.

As 2020 is the year of boxes š¦, do not miss the work ofĀ **Dasgupta et al**Ā published here at NeurIPS who study boxes deeper on the subject of local identifiability and come up with an idea of using Gumbel distributions to model box parameters.

We also rememberĀ E2R from NeurIPS 2019, a KG embedding model based on quantum logic with interesting properties (either very high š or very low š performance). By that time, E2R only worked in the transductive setup (which means the whole graph is seen during training). This year,Ā **Srivastava et al**Ā further extend the model and come up with **IQE (Inductive Quantum Embedding)**. š Essentially,Ā *IQE*Ā now accepts node features so that an entity embedding has to correlate with its feature vector. Furthermore,Ā *IQE*Ā is now optimized with a novelĀ *Alternating Minimization*Ā scheme which the authors find to be approximately 9 times faster š than vanilla E2R. The authors also provide a solid theoretical justification of modelās properties and when one should expect the model to be NP-hard.

š©āš¬ Conceptually, the model supports binary predicates, but the authors concentrate on the fine-grained entity typing task (FIGER, Ontonotes, TypeNet) using BiLSTM as a context encoder. Note that IQE needs only about 6 epochs to converge (on FIGER ā in comparison, E2R required 1000 iterations)! Qualitatively, IQE outperforms the original transductive model by 25ā30 accuracy and F1 points š

Continuing on inductive tasks,Ā ** Baek et al** study two particular link prediction setups: 1) given a training seen graph, a newĀ

*unseen*Ā š» node arrives, and you need to predict its connections toĀ

*seen*Ā š nodes (š» -> š ); 2) moreĀ

*unseen*Ā nodes arrive and you need to predict links amongĀ

*unseen*Ā nodes themselves (š» -> š» ). Sounds pretty complex, right? Usually, in transductive tasks, a model learns entity and relation embeddings of all seen nodes, and inference is performed on a set of seen nodes. Here, we have unseen nodes, and, often, without node features.

The authors resort toĀ **meta-learning**Ā and proposeĀ **Graph Extrapolation Networks (GEN)**designed toĀ *extrapolate*Ā the knowledge from the seen entities to unseen. Furthermore, the authors define the task in theĀ *few-shot*Ā setting, that is, unseen new nodes might have 3ā5 (**K**) links to existing nodes or between other unseen nodes š¤.

The meta-learning š©āš« task for GEN relies mostly onĀ **relations**: given a support set ofĀ **K**Ā triples for an unseen nodeĀ *e_i*, apply neighborhood aggregation through a learnable relation-specific weightĀ **Wr**. In fact, š any relation-aware GNN architecture might be plugged in here. In other words, we meta-learn an embedding of an unseen entity using its neighborsā representations. To cater for the uncertainty of the few-shot scenario, the authors stochastically embed unseen entities as a distribution which parameters are learned with 2 GEN layers through MC sampling (somewhat resemblesĀ GraphVAEs).

š§Ŗ GENĀ has been evaluated on 1- and 3- shot LP tasks on FB15k-237 and NELL-995 yielding significant š improvements when considering unseen-to-unseen links. In addition, GEN has been applied to relation prediction onĀ *DeepDDI*Ā andĀ *BioSNAP-sub*Ā datasets with impressive gains over baselines, e.g., 0.708 vs 0.397 AUPRC on DeepDDI.

š„ Overall, NeurIPSā20 opened up several prospects in the KG embedding area: look, Neural Architecture Search š works, Meta-Learning works, Quantum and š¦ models are getting more expressive! Thanks to that, we can now solve much more complex tasks than vanilla transductive link prediction.

## SPARQL and Compositional Generalization

š In question answering over KGs (KGQA), semantic parsing transforms a question into a structured query (say, in SPARQL) which is then executed against a database. One of the š problems there is compositional generalization, that is, can we build complex query patterns after observing simple atoms? In theĀ ICLRā20 post, we reviewed a new large-scale datasetĀ *Complex Freebase Question (**CFQ*) (letās forgive them for š§āāļø Freebase) that was designed to measure compositional generalization capabilities of NL 2 SPARQL approaches. Notably, baselines like LSTMs and Transformers perform quite poorly: <20% accuracy on average š

šĀ **Guo et al**Ā present a thorough study of potential caveats, i.e., one of the biggest issues is sequential decoding ā or any kind of ordering bias when generating queries or logical forms including tree decoding. Instead, they propose to leverageĀ **partially ordered sets ( posets)**Ā and, conversely,Ā

**Hierarchical Poset Decoding (HPD)**.Ā

*Posets*Ā allow us to enforce permutation invariance in the decoding process (for instance, predicting two branches of aĀ

*logical AND*Ā operator independently) so that a model could concentrate on generalization. Posets can be represented as DAGs. Components of that DAG can be predicted by a simple RNN (which the authors resort to).

However, direct prediction of posets does not bring benefits (works even worse than LSTMs and Transformers š ). The essential part is hierarchical decoding (check the š¼ below) which consists of 4 steps. 1ļøā£ First, we predict a post sketch (de-lexicalized DAG). 2ļøā£ Independently, we predict primitives of our query (sort of entity and relation recognition). 3ļøā£ Then, we fill in the primitives into the poset sketch in all possible permutations, and 4ļøā£, predict which particular paths actually do belong to the correct target poset.

š§Ŗ Experimentally,Ā **HPD**Ā performs surprisingly well š ā on average, 70% accuracy on 3 MCD splits compared to 20% by Universal Transformer andĀ 40%-ish by mighty T5ā11B. Ablations show that seq2seq and seq2tree sketch predictions only worsen the performance, and the hierarchical component is crucial (otherwise minus 50% accuracy). š„ Hopefully, this work will inspire more research on compositional generalization and complex KGQA!

## Benchmarking: OGB, GraphGYM, KeOps

Tired of seeing Cora/Citeseer/Pubmed in every other GNN paper? You should be: they are small, expose certain biases, and modelsā performance has pretty much saturated. Time for a big change!

**Open Graph Benchmark (OGB)**Ā (**paper by Hu et al**) is a great new effort by the Graph ML community to create a set of complex and diverse tasks on different forms of graphs (leaderboards included š ). OGB offersĀ *node classification*,Ā *graph classification*,Ā *link prediction*Ā tasks on graphs of various sizes (as of now, the biggest graph contains ~100M nodes and ~1.6B edges) and domains (KGs are here, too š: Wikidata-based and BioKG link prediction datasets).

š„ OGB leaderboards have already generated several Twitter storms: for instance, suddenly, aĀ simple label propagation algorithmĀ of 10K-100K parametersĀ outperformsĀ big and slow GNNs of 1M+ parameters by a large margin on transductive node classification tasks š. Clearly, there is still an unexplored room of capabilities and limitations of GNNs. Could Cora/Citeseer/Pubmed demonstrate it? Probably not š¤·āāļø.

Okay, we have such a big variety of tasks now! On the other hand, we have dozens of GNN architectures and hundreds of hyperparameters to tune. Is there a sweet spot, a good starting point to dig into a certain task? The space is so large! š¤ÆĀ **You, Ying, and Leskovec**Ā tackle exactly this problem of exploring design spaces of GNNs and introduceĀ **GraphGYM**, a comprehensive suite for creating and evaluating GNNs (and flexing your GNN muscles šŖ). The authors define GNN design and task spaces, each consisting of fine-grained details, e.g., 12 design dimensions: batch norm, dropout rates, aggregation functions, activation functions, node features pre-/postprocessing layers, number of message passing layers, skip-layers, batch size, learning rate, optimizers, and training epochs. Couple that with dozens of tasks, and a Cartesian product of possible combinations surpasses 10M options! š

In the rich experimental agenda, the authors find the best working combinations that you could adopt as good starting points and produce very insightful charts š. TheĀ repoĀ is openly available, you could start experimenting pretty much right away!

š By the way, if youāre looking for something similar in the KG embeddings domain, our team has recently completedĀ a huge surveyĀ of models and hyperparameters concentrating on the link prediction task.

ā”ļø Finally, I would like to outline the work byĀ **Feydy et al**Ā onĀ **KeOps**, a blazing fast kernel operations library with NumPy, PyTorch, R, and Matlab bindings. In addition to widely used dense and sparse matrices, the authors supportĀ *symbolic matrices*Ā (whereĀ *ij-thĀ *member is computed via a certain formulaĀ *F,Ā *often a matrix reduction formula). Symbolic matrices are computed on the fly š and optimized for CUDA computations. The authors do invest into benchmarking: on a rather standard server workstation with a 8-core Xeon, 128 Gb RAM, RTX 2080 Ti, KeOps outperforms isĀ **5x ā 20x**Ā **faster**Ā than PyTorch implementation on the same tasks (then PyTorch crashes with OOM while KeOps works fine).

- You can also perform a kNN search and be competitive with FAISS!
- Some implementations inĀ PyTorch-GeometricĀ already work well with KeOPS

Personally, Iāve been using PyKeOps since summer and find it extremely helpful when working with large-scale KGs. Besides, I compiled the library on a PowerPC + CUDA cluster, please feel my pain š

## Wrapping Up

NeurIPS concludes the line-up of top AI conferences, but ICLR 2021 scores are already out there š. If you want to keep updated on Graph ML topics, you could subscribe to theĀ regular newsletterĀ by Sergey Ivanov or join the TelegramĀ GraphML channel!

Merry Christmas, happy New Year, and stay safe š¤

*This article was originally published onĀ Medium and re-published to TOPBOTS with permission from the author.*

## Enjoy this article? Sign up for more AI research updates.

Weāll let you know when we release more summary articles like this one.

The post Machine Learning on Knowledge Graphs at NeurIPS 2020 appeared first on TOPBOTS.