Nlp Highlights

Informações:

Synopsis

Discussing recent and interesting work related to natural language processing. Matt Gardner and Waleed Ammar, research scientists at the Allen Institute for Artificial Intelligence, give short discussions of papers, mostly in interviews with authors about their work.

Episodes

  • 104 - Model Distillation, with Victor Sanh and Thomas Wolf

    03/02/2020 Duration: 31min

    In this episode we talked with Victor Sanh and Thomas Wolf from HuggingFace about model distillation, and DistilBERT as one example of distillation. The idea behind model distillation is compressing a large model by building a smaller model, with much fewer parameters, that approximates the output distribution of the original model, typically for increased efficiency. We discussed how model distillation was typically done previously, and then focused on the specifics of DistilBERT, including training objective, empirical results, ablations etc. We finally discussed what kinds of information you might lose when doing model distillation.

  • 103 - Processing Language in Social Media, with Brendan O'Connor

    27/01/2020 Duration: 43min

    We talked to Brendan O’Connor for this episode about processing language in social media. Brendan started off by telling us about his projects that studied the linguistic and geographical patterns of African American English (AAE), and how obtaining data from Twitter made these projects possible. We then talked about how many tools built for standard English perform very poorly on AAE, and why collecting dialect-specific data is important. For the rest of the conversation, we discussed the issues involved in scraping data from social media, including ethical considerations and the biases that the data comes with. Brendan O’Connor is an Assistant Professor at the University of Massachusetts, Amherst. Warning: This episode contains explicit language (one swear word).

  • 102 - Biomedical NLP research at the National Institute of Health with Dina Demner-Fushman

    20/01/2020 Duration: 36min

    What exciting NLP research problems are involved in processing biomedical and clinical data? In this episode, we spoke with Dina Demner-Fushman, who leads NLP and IR research at the Lister Hill National Center for Biomedical Communications, part of the National Library of Medicine. We talked about processing biomedical scientific literature, understanding clinical notes, and answering consumer health questions, and the challenges involved in each of these applications. Dina listed some specific tasks and relevant data sources for NLP researchers interested in such applications, and concluded with some pointers to getting started in this field.

  • 101 - The lottery ticket hypothesis, with Jonathan Frankle

    14/01/2020 Duration: 41min

    In this episode, Jonathan Frankle describes the lottery ticket hypothesis, a popular explanation of how over-parameterization helps in training neural networks. We discuss pruning methods used to uncover subnetworks (winning tickets) which were initialized in a particularly effective way. We also discuss patterns observed in pruned networks, stability of networks pruned at different time steps and transferring uncovered subnetworks across tasks, among other topics. A recent paper on the topic by Frankle and Carbin, ICLR 2019: https://arxiv.org/abs/1803.03635 Jonathan Frankle’s homepage: http://www.jfrankle.com/

  • 100 - NLP Startups, with Oren Etzioni

    08/01/2020 Duration: 30min

    For our 100th episode, we invite AI2 CEO Oren Etzioni to talk to us about NLP startups. Oren has founded several successful startups, is himself an investor in startups, and helps with AI2's startup incubator. Some of our discussion topics include: What's the similarity between being a researcher and an entrepreneur? How do you transition from being a researcher to doing a startup? How do you evaluate early-stage startups? What advice would you give to a researcher who's thinking about a startup? What are some typical mistakes that you've seen startups make? Along the way, Oren predicts a that we'll see a whole generation of startup companies based on the technology underlying ELMo, BERT, etc.

  • 99 - Evaluating Protein Transfer Learning, With Roshan Rao And Neil Thomas

    16/12/2019 Duration: 44min

    For this episode, we chatted with Neil Thomas and Roshan Rao about modeling protein sequences and evaluating transfer learning methods for a set of five protein modeling tasks. Learning representations using self-supervised pretaining objectives has shown promising results in transferring to downstream tasks in protein sequence modeling, just like it has in NLP. We started off by discussing the similarities and differences between language and protein sequence data, and how the contextual embedding techniques are applicable also to protein sequences. Neil and Roshan then described a set of five benchmark tasks to assess the quality of protein embeddings (TAPE), particularly in terms of how well they capture the structural, functional, and evolutionary aspects of proteins. The results from the experiments they ran with various model architectures indicated that there was not a single best performing model across all tasks, and that there is a lot of room for future work in protein sequence modeling. Neil Thom

  • 98 - Analyzing Information Flow In Transformers, With Elena Voita

    09/12/2019 Duration: 37min

    What function do the different attention heads serve in multi-headed attention models? In this episode, Lena describes how to use attribution methods to assess the importance and contribution of different heads in several tasks, and describes a gating mechanism to prune the number of effective heads used when combined with an auxiliary loss. Then, we discuss Lena’s work on studying the evolution of representations of individual tokens in transformers model. Lena’s homepage: https://lena-voita.github.io/ Blog posts: https://lena-voita.github.io/posts/acl19_heads.html https://lena-voita.github.io/posts/emnlp19_evolution.html Papers: https://arxiv.org/abs/1905.09418 https://arxiv.org/abs/1909.01380

  • 97 - Automated Analysis Of Historical Printed Documents, With Taylor Berg-Kirkpatrick

    27/11/2019 Duration: 44min

    In this episode, we talk to Taylor Berg-Kirkpatrick about optical character recognition (OCR) on historical documents. Taylor starts off by describing some practical issues related to old scanning processes of documents that make performing OCR on them a difficult problem. Then he explains how one can build latent variable models for this data using unsupervised methods, the relative importance of various modeling choices, and summarizes how well the models do. We then take a higher level view of historical OCR as a Machine Learning problem, and discuss how it is different from other ML problems in terms of the tradeoff between learning from data and imposing constraints based on prior knowledge of the underlying process. Finally, Taylor talks about the applications of this research, and how these predictions can be of interest to historians studying the original texts.

  • 96 - Question Answering as an Annotation Format, with Luke Zettlemoyer

    12/11/2019 Duration: 29min

    In this episode, we chat with Luke Zettlemoyer about Question Answering as a format for crowdsourcing annotations of various semantic phenomena in text. We start by talking about QA-SRL and QAMR, two datasets that use QA pairs to annotate predicate-argument relations at the sentence level. Luke describes how this annotation scheme makes it possible to obtain annotations from non-experts, and discusses the tradeoffs involved in choosing this scheme. Then we talk about the challenges involved in using QA-based annotations for more complex phenomena like coreference. Finally, we briefly discuss the value of crowd-labeled datasets given the recent developments in pretraining large language models. Luke is an associate professor at the University of Washington and a Research Scientist at Facebook AI Research.

  • 95 - Common sense reasoning, with Yejin Choi

    07/10/2019 Duration: 35min

    In this episode, we invite Yejin Choi to talk about common sense knowledge and reasoning, a growing area in NLP. We start by discussing a working definition of “common sense” and the practical utility of studying it. We then talk about some of the datasets and resources focused on studying different aspects of common sense (e.g., ReCoRD, CommonsenseQA, ATOMIC) and contrast implicit vs. explicit modeling of common sense, and what it means for downstream applications. To conclude, Yejin shares her thoughts on some of the open problems in this area and where it is headed in the future. Yejin Choi’s homepage: https://homes.cs.washington.edu/~yejin/ ATOMIC: https://homes.cs.washington.edu/~msap/atomic/ ReCoRD: https://sheng-z.github.io/ReCoRD-explorer/ CommonsenseQA: https://www.tau-nlp.org/commonsenseqa

  • 94 - Decompositional Semantics, with Aaron White

    30/09/2019 Duration: 27min

    In this episode, Aaron White tells us about the decompositional semantics initiative (Decomp), an attempt to re-think the prototypical approach to semantic representation and annotation. The basic idea is to decompose complex semantic classes such as ‘agent’ and ‘patient’ into simpler semantic properties such as ‘causation’ and ‘volition’, while embracing the uncertainty inherent in language by allowing annotators to choose answers such as ‘probably’ or ‘probably not’. In order to scale the collection of labeled data, each property is annotated by asking crowd workers intuitive questions about phrases in a given sentence. Aaron White's homepage: http://aaronstevenwhite.io/ Decomp initiative page: http://decomp.io/

  • 93 - NLP/ML for clinical data, with Alistair Johnson

    22/07/2019 Duration: 37min

    In this episode, we invite Alistair Johnson to discuss the main challenge in applying NLP/ML to clinical domains: the lack of data. We discuss privacy concerns, de-identification, synthesizing records, legal liabilities and data heterogeneity. We also discuss how the MIMIC dataset evolved over the years, how it is being used, and some of the under-explored ways in which it can be used. Alistair’s homepage: http://alistairewj.github.io/ MIMIC dataset: https://mimic.physionet.org/

  • 92 - Computational Humanities, with David Bamman

    05/07/2019 Duration: 33min

    In this episode, we invite David Bamman to give an overview of computational humanities. We discuss examples of questions studied in computational humanities (e.g., characterizing fictionality, assessing novelty, measuring the attention given to male vs. female characters in the literature). We talk about the role NLP plays in addressing these questions and how the accuracy and biases of NLP models can influence the results. We also discuss understudied NLP tasks which can help us answer more questions in this domain such as literary scene coreference resolution and constructing a map of literature geography. David Bamman's homepage: http://people.ischool.berkeley.edu/~dbamman/ LitBank dataset: https://github.com/dbamman/litbank

  • 91 - (Executable) Semantic Parsing, with Jonathan Berant

    26/06/2019 Duration: 42min

    In this episode, we invite Jonathan Berant to talk about executable semantic parsing. We discuss what executable semantic parsing is and how it differs from related tasks such as semantic dependency parsing and abstract meaning representation (AMR) parsing. We talk about the main components of a semantic parser, how the formal language affects design choices in the parser, and end with a discussion of some exciting open problems in this space. Jonathan Berant's homepage: http://www.cs.tau.ac.il/~joberant/

  • 90 - Research in Academia versus Industry, with Philip Resnik and Jason Baldridge

    31/05/2019 Duration: 54min

    How is it like to do research in academia vs. industry? In this episode, we invite Jason Baldridge (UT Austin => Google) and Philip Resnik (Sun Microsystems => UMD) to discuss some of the aspects one may want to consider when planning their research careers, including flexibility, security and intellectual freedom. Perhaps most importantly, we discuss how the career choices we make influence and are influenced by the relationships we forge. Check out the Careers in NLP Panel at NAACL'19 on Monday, June 3, 2019 for further discussion. Careers in NLP panel @ NAACL'19: https://naacl2019.org/blog/careers-panel-survey/ Jason Baldridge's homepage: http://www.jasonbaldridge.com/ Philip Resnik's homepage: http://users.umiacs.umd.edu/~resnik/

  • 89 - Dialog Systems, with Zhou Yu

    31/05/2019 Duration: 37min

    In this episode, we invite Zhou Yu to give an overview of dialogue systems. We discuss different types of dialogue systems (task-oriented vs. non-task-oriented), the main building blocks and how they relate to other research areas in NLP, how to transfer models across domains, and the different ways used to evaluate these systems. Zhou also shares her thoughts on exciting future directions such as developing dialogue methods for non-cooperative environments (e.g., to negotiate prices) and multimodal dialogue systems (e.g., using video as well as audio/text). Zhou Yu's homepage: http://zhouyu.cs.ucdavis.edu/

  • 88 - A Structural Probe for Finding Syntax in Word Representations, with John Hewitt

    07/05/2019 Duration: 40min

    In this episode, we invite John Hewitt to discuss his take on how to probe word embeddings for syntactic information. The basic idea is to project word embeddings to a vector space where the L2 distance between a pair of words in a sentence approximates the number of hops between them in the dependency tree. The proposed method shows that ELMo and BERT representations, trained with no syntactic supervision, embed many of the unlabeled, undirected dependency attachments between words in the same sentence. Paper: https://nlp.stanford.edu/pubs/hewitt2019structural.pdf GitHub repository: https://github.com/john-hewitt/structural-probes Blog post: https://nlp.stanford.edu/~johnhew/structural-probe.html Twitter thread: https://twitter.com/johnhewtt/status/1114252302141886464 John's homepage: https://nlp.stanford.edu/~johnhew/

  • 87 - Pathologies of Neural Models Make Interpretation Difficult, with Shi Feng

    25/04/2019 Duration: 33min

    In this episode, Shi Feng joins us to discuss his recent work on identifying pathological behaviors of neural models for NLP tasks. Shi uses input word gradients to identify the least important word for a model's prediction, and iteratively removes that word until the model prediction changes. The reduced inputs tend to be significantly smaller than the original inputs, e.g., 2.3 words instead of 11.5 in the original in SQuAD, on average. We discuss possible interpretations of these results, and a proposed method for mitigating these pathologies. Shi Feng's homepage: http://users.umiacs.umd.edu/~shifeng/ Paper: https://www.semanticscholar.org/paper/Pathologies-of-Neural-Models-Make-Interpretation-Feng-Wallace/8e141b5cb01c88b315c9a94dc97e50738cc7370d Joint work with Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez and Jordan Boyd-Graber

  • 86 - NLP for Evidence-based Medicine, with Byron Wallace

    15/04/2019 Duration: 32min

    In this episode, Byron Wallace tells us about interdisciplinary work between evidence based medicine and natural language processing. We discuss extracting PICO frames from articles describing clinical trials and data available for direct and weak supervision. We also discuss automating the assessment of risks of bias in, e.g., random sequence generation, allocation containment and outcome assessment, which have been used to help domain experts who need to review hundreds of articles. Byron Wallace's homepage: http://www.byronwallace.com/ EBM-NLP dataset: https://ebm-nlp.herokuapp.com/ MIMIC dataset: https://mimic.physionet.org/ Cochrane database of systematic reviews: https://www.cochranelibrary.com/cdsr/about-cdsr The bioNLP workshop at ACL'19 (submission due date was extended to May 10): https://aclweb.org/aclwiki/BioNLP_Workshop The workshop on health text mining and information analysis at EMNLP'19: https://louhi2019.fbk.eu/ Machine learning for healthcare conference: https://www.mlforhc.org/

  • 85 - Stress in Research, with Charles Sutton

    29/03/2019 Duration: 36min

    In this episode, Charles Sutton walks us through common sources of stress for researchers and suggests coping strategies to maintain your sanity. We talk about how pursuing a research career is similar to participating in a life-long international tournament, conflating research worth and self-worth, and how freedom can be both a blessing and a curse, among other stressors one may encounter in a research career. Charles Sutton's homepage: https://homepages.inf.ed.ac.uk/csutton/ A series of blog posts Charles wrote on this topic: http://www.theexclusive.org/tag/stress%20in%20research/

page 3 from 8