Linear Digressions

  • Author: Vários
  • Narrator: Vários
  • Publisher: Podcast
  • Duration: 96:08:51
  • More information

Informações:

Synopsis

Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.

Episodes

  • Google Flu Trends

    26/03/2018 Duration: 12min

    It's been a nasty flu season this year. So we were remembering a story from a few years back (but not covered yet on this podcast) about when Google tried to predict flu outbreaks faster than the Centers for Disease Control by monitoring searches and looking for spikes in searches for flu symptoms, doctors appointments, and other related terms. It's a cool idea, but after a few years turned into a cautionary tale of what can go wrong after Google's algorithm systematically overestimated flu incidence for almost 2 years straight. Relevant link: https://gking.harvard.edu/publications/parable-google-flu%C2%A0traps-big-data-analysis

  • How to pick projects for a professional data science team

    19/03/2018 Duration: 31min

    This week's episodes is for data scientists, sure, but also for data science managers and executives at companies with data science teams. These folks all think very differently about the same question: what should a data science team be working on? And how should that decision be made? That's the subject of a talk that I (Katie) gave at Strata Data in early March, about how my co-department head and I select projects for our team to work on. We have several goals in data science project selection at Civis Analytics (where I work), which can be summarized under "balance the best attributes of bottom-up and top-down decision-making." We achieve this balance, or at least get pretty close, using a process we've come to call the Idea Factory (after a great book about Bell Labs). This talk is about that process, how it works in the real world of a data science company and how we see it working in the data science programs of other companies. Relevant links: https://conferences.oreilly.com/strata/strata-ca/public

  • Autoencoders

    12/03/2018 Duration: 12min

    Autoencoders are neural nets that are optimized for creating outputs that... look like the inputs to the network. Turns out this is a not-too-shabby way to do unsupervised machine learning with neural nets.

  • When Private Data Isn't Private Anymore

    05/03/2018 Duration: 26min

    After all the back-patting around making data science datasets and code more openly available, we figured it was time to also dump a bucket of cold water on everyone's heads and talk about the things that can go wrong when data and code is a little too open. In this episode, we'll talk about two interesting recent examples: a de-identified medical dataset in Australia that was re-identified so specific celebrities and athletes could be matched to their medical records, and a series of military bases that were spotted in a public fitness tracker dataset.

  • What makes a machine learning algorithm "superhuman"?

    26/02/2018 Duration: 34min

    A few weeks ago, we podcasted about a neural network that was being touted as "better than doctors" in diagnosing pneumonia from chest x-rays, and how the underlying dataset used to train the algorithm raised some serious questions. We're back again this week with further developments, as the author of the original blog post pointed us toward more developments. All in all, there's a lot more clarity now around how the authors arrived at their original "better than doctors" claim, and a number of adjustments and improvements as the original result was de/re-constructed. Anyway, there are a few things that are cool about this. First, it's a worthwhile follow-up to a popular recent episode. Second, it goes *inside* an analysis to see what things like imbalanced classes, outliers, and (possible) signal leakage can do to real science. And last, it raises a really interesting question in an age when computers are often claimed to be better than humans: what do those claims really mean? Relevant links: https://lu

  • Open Data and Open Science

    19/02/2018 Duration: 16min

    One interesting trend we've noted recently is the proliferation of papers, articles and blog posts about data science that don't just tell the result--they include data and code that allow anyone to repeat the analysis. It's far from universal (for a timely counterpoint, read this article ), but we seem to be moving toward a new normal where data science conclusions are expected to be shown, not just told. Relevant links: https://github.com/fivethirtyeight/data https://blog.patricktriest.com/police-data-python/

  • Defining the quality of a machine learning production system

    12/02/2018 Duration: 20min

    Building a machine learning system and maintaining it in production are two very different things. Some folks over at Google wrote a paper that shares their thoughts around all the items you might want to test or check for your production ML system. Relevant links: https://research.google.com/pubs/pub45742.html

  • Auto-generating websites with deep learning

    04/02/2018 Duration: 19min

    We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurrent neural nets that generate descriptions and captions of the images. Our episode today tells a similar tale, except today we're talking about a blog post where the author fed in wireframes of a website design and asked the neural net to generate the HTML and CSS that would actually build a website that looks like the wireframes. If you're a programmer who thinks your job is challenging enough that you're automation-proof, guess again... Link to blog post: https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/

  • The Case for Learned Index Structures, Part 2: Hash Maps and Bloom Filters

    29/01/2018 Duration: 20min

    Last week we started the story of how you could use a machine learning model in place of a data structure, and this week we wrap up with an exploration of Bloom Filters and Hash Maps. Just like last week, when we covered B-trees, we'll walk through both the "classic" implementation of these data structures and how a machine learning model could create the same functionality.

  • The Case for Learned Index Structures, Part 1: B-Trees

    22/01/2018 Duration: 18min

    Jeff Dean and his collaborators at Google are turning the machine learning world upside down (again) with a recent paper about how machine learning models can be used as surprisingly effective substitutes for classic data structures. In this first part of a two-part series, we'll go through a data structure called b-trees. The structural form of b-trees make them efficient for searching, but if you squint at a b-tree and look at it a little bit sideways then the search functionality starts to look a little bit like a regression model--hence the relevance of machine learning models. If this sounds kinda weird, or we lost you at b-tree, don't worry--lots more details in the episode itself.

  • Challenges with Using Machine Learning to Classify Chest X-Rays

    15/01/2018 Duration: 18min

    Another installment in our "machine learning might not be a silver bullet for solving medical problems" series. This week, we have a high-profile blog post that has been making the rounds for the last few weeks, in which a neural network trained to visually recognize various diseases in chest x-rays is called into question by a radiologist with machine learning expertise. As it seemingly always does, it comes down to the dataset that's used for training--medical records assume a lot of context that may or may not be available to the algorithm, so it's tough to make something that actually helps (in this case) predict disease that wasn't already diagnosed.

  • The Fourier Transform

    08/01/2018 Duration: 15min

    The Fourier transform is one of the handiest tools in signal processing for dealing with periodic time series data. Using a Fourier transform, you can break apart a complex periodic function into a bunch of sine and cosine waves, and figure out what the amplitude, frequency and offset of those component waves are. It's a really handy way of re-expressing periodic data--you'll never look at a time series graph the same way again.

  • Statistics of Beer

    02/01/2018 Duration: 15min

    What better way to kick off a new year than with an episode on the statistics of brewing beer?

  • Re - Release: Random Kanye

    24/12/2017 Duration: 09min

    We have a throwback episode for you today as we take the week off to enjoy the holidays. This week: what happens when you have a markov chain that generates mashup Kanye West lyrics with Bible verses? Exactly what you think.

  • Debiasing Word Embeddings

    18/12/2017 Duration: 18min

    When we covered the Word2Vec algorithm for embedding words, we mentioned parenthetically that the word embeddings it produces can sometimes be a little bit less than ideal--in particular, gender bias from our society can creep into the embeddings and give results that are sexist. For example, occupational words like "doctor" and "nurse" are more highly aligned with "man" or "woman," which can create problems because these word embeddings are used in algorithms that help people find information or make decisions. However, a group of researchers has released a new paper detailing ways to de-bias the embeddings, so we retain gender info that's not particularly problematic (for example, "king" vs. "queen") while correcting bias.

  • The Kernel Trick and Support Vector Machines

    11/12/2017 Duration: 17min

    Picking up after last week's episode about maximal margin classifiers, this week we'll go into the kernel trick and how that (combined with maximal margin algorithms) gives us the much-vaunted support vector machine.

  • Maximal Margin Classifiers

    04/12/2017 Duration: 14min

    Maximal margin classifiers are a way of thinking about supervised learning entirely in terms of the decision boundary between two classes, and defining that boundary in a way that maximizes the distance from any given point to the boundary. It's a neat way to think about statistical learning and a prerequisite for understanding support vector machines, which we'll cover next week--stay tuned!

  • Re - Release: The Cocktail Party Problem

    27/11/2017 Duration: 13min

    Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!

  • Clustering with DBSCAN

    20/11/2017 Duration: 16min

    DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It's pretty nifty: with just two parameters, you can specify "dense" regions in your data, and grow those regions out organically to find clusters. In particular, it can fit irregularly-shaped clusters, and it can also identify outlier points that don't belong to any of the clusters. Pretty cool!

  • The Kaggle Survey on Data Science

    13/11/2017 Duration: 25min

    Want to know what's going on in data science these days?  There's no better way than to analyze a survey with over 16,000 responses that recently released by Kaggle.  Kaggle asked practicing and aspiring data scientists about themselves, their tools, how they find jobs, what they find challenging about their jobs, and many other questions.  Then Kaggle released an interactive summary of the data, as well as the anonymized dataset itself, to help data scientists understand the trends in the data.  In this episode, we'll go through some of the survey toplines that we found most interesting and counterintuitive.

page 7 from 15