NeurIPS 2020: Key Research Papers in Natural Language Processing (NLP) & Conversational AI

  • by

NeurIPS is the largest machine learning conference held every December. It brings together researchers in computational neuroscience, reinforcement learning, deep learning, and their applications such as computer vision, fairness and transparency, natural language processing, robotics, and more.

Our team reviewed the papers accepted to NeurIPS 2020 and shortlisted the most interesting ones across different research areas. Here are the topics we cover:

If you’re interested in the remarkable keynote presentations, interesting workshops, and exciting tutorials presented at the conference, check our guide to NeurIPS 2020.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

Top Natural Language Processing Research Papers at NeurIPS 2020

Pre-trained language models still dominate the NLP research advances in 2020. At NeurIPS 2020, top research teams from Facebook AI Research, Carnegie Mellon University, Microsoft Research, and others, introduce approaches to:

  • increasing efficiency of transformers,
  • investigating gender bias in language models,
  • improving language generation performance of pre-trained models,
  • crowdsourced training of large neural networks,
  • pre-training language models for multilingual NLP tasks.

Here are the research papers we recommend reading.

Investigating Gender Bias in Language Models Using Causal Mediation Analysis

Jesse Vig (Salesforce Research), Sebastian Gehrmann (Harvard University), Yonatan Belinkov (Harvard University), Sharon Qian (Harvard University), Daniel Nevo (Tel Aviv University), Yaron Singer (Harvard University), Stuart Shieber (Harvard University)

Many interpretation methods for neural models in natural language processing investigate how information is encoded inside hidden representations. However, these methods can only measure whether the information exists, not whether it is actually used by the model. We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior. The approach enables us to analyze the mechanisms that facilitate the flow of information from input to output through various model components, known as mediators. As a case study, we apply this methodology to analyzing gender bias in pre-trained Transformer language models. We study the role of individual neurons and attention heads in mediating gender bias across three datasets designed to gauge a model’s sensitivity to gender bias. Our mediation analysis reveals that gender bias effects are concentrated in specific components of the model that may exhibit highly specialized behavior.

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Zi-Hang Jiang (National University of Singapore), Weihao Yu (National University of Singapore), Daquan Zhou (National University of Singapore), Yunpeng Chen (Yitu Technology), Jiashi Feng (National University of Singapore), Shuicheng Yan (Yitu Technology)

Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for generating the attention map from a global perspective, we observe some heads only need to learn local dependencies, which means existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, using less than 1/4 training cost. 

Code: official TensorFlow implementation is available here.


Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Zihang Dai (Carnegie Mellon University), Guokun Lai (Carnegie Mellon University), Yiming Yang (CMU), Quoc V Le (Google)

With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further improve the model capacity. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension.

Code: official TensorFlow and PyTorch implementations are available here.

Funnel Transformer

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis (Facebook AI Research), Ethan Perez (New York University), Aleksandra Piktus (Facebook AI), Fabio Petroni (Facebook AI Research), Vladimir Karpukhin (Facebook AI Research), Naman Goyal (Facebook Inc), Heinrich Küttler (Facebook AI Research), Mike Lewis (Facebook AI Research), Wen-tau Yih (Facebook AI Research), Tim Rocktäschel (Facebook AI Research), Sebastian Riedel (Facebook AI Research), Douwe Kiela (Facebook AI Research)

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

Code: unofficial code implementation is available here.

Retrieval-augmented generation (RAG)

MPNet: Masked and Permuted Pre-training for Language Understanding

Kaitao Song (Nanjing University of Science and technology), Xu Tan (Microsoft Research), Tao Qin (Microsoft Research), Jianfeng Lu (Nanjing University of Science and Technology), Tie-Yan Liu (Microsoft Research Asia)

BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning. In this paper, we propose MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting.

Code: official PyTorch implementation is available here.

Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts

Max Ryabinin (Yandex, Higher School of Economics), Anton Gusev (Independent Researcher)

Many recent breakthroughs in deep learning were achieved by training increasingly larger models on massive datasets. However, training such models can be prohibitively expensive. For instance, the cluster used to train GPT-3 costs over $250 million. As a result, most researchers cannot afford to train state of the art models and contribute to their development. Hypothetically, a researcher could crowdsource the training of large neural networks with thousands of regular PCs provided by volunteers. The raw computing power of a hundred thousand $2500 desktops dwarfs that of a $250M server pod, but one cannot utilize that power efficiently with conventional distributed training methods. In this work, we propose Learning@home: a novel neural network training paradigm designed to handle large amounts of poorly connected participants. We analyze the performance, reliability, and architectural constraints of this paradigm and compare it against existing distributed training techniques.

Code: official PyTorch implementation is available here.

Decentralized Mixture-of-Experts

Pre-training via Paraphrasing

Mike Lewis (Facebook AI Research), Marjan Ghazvininejad (Facebook AI Research), Gargi Ghosh (Facebook AI Research), Armen Aghajanyan (Facebook AI Research), Sida Wang (Facebook AI Research), Luke Zettlemoyer (University of Washington, Facebook AI Research)

We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an alternative to the dominant masked language modeling paradigm, where we self-supervise the reconstruction of target text by retrieving a set of related texts (in many languages) and conditioning on them to maximize the likelihood of generating the original. We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization. The objective noisily captures aspects of paraphrase, translation, multi-document summarization, and information retrieval, allowing for strong zero-shot performance on several tasks. For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation. We further show that fine-tuning gives strong performance on a range of discriminative and generative tasks in many languages, making MARGE the most generally applicable pre-training method to date.

Code: unofficial PyTorch implementation is available here.

Pre-training via Paraphrasing

Top Conversational AI Research Papers at NeurIPS 2020

Traditionally, NeurIPS is more focused on theory and methodology, paying less attention to AI applications. Thus, there are not so many conversational AI research papers presented at NeurIPS 2020. Still, several papers from the main conference program focus on dialog systems, introducing new ways to:

  • building a knowledge-grounded dialog system in a zero-resource setting,
  • developing a task-oriented dialog using a single causal language model,
  • learning visual dialog agents without dialog data.

Here are the abstracts of the corresponding research papers.

A Simple Language Model for Task-Oriented Dialogue

Ehsan Hosseini-Asl (Salesforce Research), Bryan McCann (Salesforce Research), Chien-Sheng Wu (Salesforce Research), Semih Yavuz (Salesforce), Richard Socher (Salesforce)

Task-oriented dialogue is often decomposed into three tasks: understanding user input, deciding actions, and generating a response. While such decomposition might suggest a dedicated model for each sub-task, we find a simple, unified approach leads to state-of-the-art performance on the MultiWOZ dataset. SimpleTOD is a simple approach to task-oriented dialogue that uses a single causal language model trained on all sub-tasks recast as a single sequence prediction problem. This allows SimpleTOD to fully leverage transfer learning from pre-trained, open domain, causal language models such as GPT-2. SimpleTOD improves over the prior state-of-the-art by 0.49 points in joint goal accuracy for dialogue state tracking. More impressively, SimpleTOD also improves the main metrics used to evaluate action decisions and response generation in an end-to-end setting for task-oriented dialog systems: inform rate by 8.1 points, success rate by 9.7 points, and combined score by 7.2 points.

Code: official repository is available here.

Zero-Resource Knowledge-Grounded Dialogue Generation

Linxiao Li (Peking University), Can Xu (Microsoft), Wei Wu (Meituan-Dianping Group), Yufan Zhao (Microsoft), Xueliang Zhao (Peking University), Chongyang Tao (Microsoft)

While neural conversation models have shown great potentials towards generating informative and engaging responses via introducing external knowledge, learning such a model often requires knowledge-grounded dialogues that are difficult to obtain. To overcome the data challenge and reduce the cost of building a knowledge-grounded dialogue system, we explore the problem under a zero-resource setting by assuming no context-knowledge-response triples are needed for training. To this end, we propose representing the knowledge that bridges a context and a response and the way that the knowledge is expressed as latent variables, and devise a variational approach that can effectively estimate a generation model from a dialogue corpus and a knowledge corpus that are independent with each other. Evaluation results on three benchmarks of knowledge-grounded dialogue generation indicate that our model can achieve comparable performance with state-of-the-art methods that rely on knowledge-grounded dialogues for training, and exhibits a good generalization ability over different topics and different datasets.

Code: official PyTorch implementation is available here.

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Michael Cogswell (Georgia Tech), Jiasen Lu (Allen Institute of Artificial Intelligence ), Rishabh Jain (Georgia Tech), Stefan Lee (Oregon State University), Devi Parikh (Georgia Tech / Facebook AI Research (FAIR)), Dhruv Batra (Georgia Tech / Facebook AI Research (FAIR))

Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to new tasks, minimizing expensive data collection and annotation. In this work, we study a setting we call “Dialog without Dialog”, which requires agents to develop visually grounded dialog models that can adapt to new tasks without language level supervision. By factorizing intention and language, our model minimizes linguistic drift after fine-tuning for new tasks. We present qualitative results, automated metrics, and human studies that all show our model can adapt to new tasks and maintain language quality. Baselines either fail to perform well at new tasks or experience language drift, becoming unintelligible to humans.

Code: official PyTorch implementation is available here.

Dialog without Dialog Data

Top Research Papers From 2020

To be prepared for NeurIPS, you should be aware of the major research papers published in the last year in popular topics such as computer vision, NLP, and general machine learning approaches, even if they are not being presented at this specific event. 

We’ve shortlisted top research papers in these areas so you can review them quickly: 

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

The post NeurIPS 2020: Key Research Papers in Natural Language Processing (NLP) & Conversational AI appeared first on TOPBOTS.

Leave a Reply

Your email address will not be published. Required fields are marked *