Synopsis
Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.
Episodes
-
Scikit + Optimization = Scikit-Optimize
12/09/2016 Duration: 15minWe're excited to welcome a guest, Tim Head, who is one of the maintainers of the scikit-optimize package. With all the talk about optimization lately, it felt appropriate to get in a few words with someone who's out there making it happen for python. Relevant links: https://scikit-optimize.github.io/ http://www.wildtreetech.com/
-
Two Cultures: Machine Learning and Statistics
05/09/2016 Duration: 17minIt's a funny thing to realize, but data science modeling is usually about either explainability, interpretation and understanding, or it's about predictive accuracy. But usually not both--optimizing for one tends to compromise the other. Leo Breiman was one of the titans of both kinds of modeling, a statistician who helped bring machine learning into statistics and vice versa. In this episode, we unpack one of his seminal papers from 2001, when machine learning was just beginning to take root, and talk about how he made clear what machine learning could do for statistics and why it's so important. Relevant links: http://www.math.snu.ac.kr/~hichoi/machinelearning/(Breiman)%20Statistical%20Modeling--The%20Two%20Cultures.pdf
-
Optimization Solutions
29/08/2016 Duration: 20minYou've got an optimization problem to solve, and a less-than-forever amount of time in which to solve it. What do? Use a heuristic optimization algorithm, like a hill climber or simulated annealing--we cover both in this episode! Relevant link: http://www.lizsander.com/programming/2015/08/04/Heuristic-Search-Algorithms.html
-
Optimization Problems
22/08/2016 Duration: 17minIf modeling is about predicting the unknown, optimization tries to answer the question of what to do, what decision to make, to get the best results out of a given situation. Sometimes that's straightforward, but sometimes... not so much. What makes an optimization problem easy or hard, and what are some of the methods for finding optimal solutions to problems? Glad you asked! May we recommend our latest podcast episode to you?
-
Multi-level modeling for understanding DEADLY RADIOACTIVE GAS
15/08/2016 Duration: 23minOk, this episode is only sort of about DEADLY RADIOACTIVE GAS. It's mostly about multilevel modeling, which is a way of building models with data that has distinct, related subgroups within it. What are multilevel models used for? Elections (we can't get enough of 'em these days), understanding the effect that a good teacher can have on their students, and DEADLY RADIOACTIVE GAS. Relevant links: http://www.stat.columbia.edu/~gelman/research/published/multi2.pdf
-
How Polls Got Brexit "Wrong"
08/08/2016 Duration: 15minContinuing the discussion of how polls do (and sometimes don't) tell us what to expect in upcoming elections--let's take a concrete example from the recent past, shall we? The Brexit referendum was, by and large, expected to shake out for "remain", but when the votes were counted, "leave" came out ahead. Everyone was shocked (SHOCKED!) but maybe the polls weren't as wrong as the pundits like to claim. Relevant links: http://www.slate.com/articles/news_and_politics/moneybox/2016/07/why_political_betting_markets_are_failing.html http://andrewgelman.com/2016/06/24/brexit-polling-what-went-wrong/
-
Election Forecasting
01/08/2016 Duration: 28minNot sure if you heard, but there's an election going on right now. Polls, surveys, and projections about, as far as the eye can see. How to make sense of it all? How are the projections made? Which are some good ones to follow? We'll be your trusty guides through a crash course in election forecasting. Relevant links: http://www.wired.com/2016/06/civis-election-polling-clinton-sanders-trump/ http://election.princeton.edu/ http://projects.fivethirtyeight.com/2016-election-forecast/ http://www.nytimes.com/interactive/2016/upshot/presidential-polls-forecast.html?rref=collection%2Fsectioncollection%2Fupshot&action=click&contentCollection=upshot®ion=rank&module=package&version=highlights&contentPlacement=5&pgtype=sectionfront
-
Machine Learning for Genomics
25/07/2016 Duration: 20minGenomics data is some of the biggest #bigdata, and doing machine learning on it is unlocking new ways of thinking about evolution, genomic diseases like cancer, and what really makes each of us different for everyone else. This episode touches on some of the things that make machine learning on genomics data so challenging, and the algorithms designed to do it anyway.
-
Climate Modeling
18/07/2016 Duration: 19minHot enough for you? Climate models suggest that it's only going to get warmer in the coming years. This episode unpacks those models, so you understand how they work. A lot of the episodes we do are about fun studies we hear about, like "if you're interested, this is kinda cool"--this episode is much more important than that. Understanding these models, and taking action on them where appropriate, will have huge implications in the years to come. Relevant links: https://climatesight.org/
-
Reinforcement Learning Gone Wrong
11/07/2016 Duration: 28minLast week’s episode on artificial intelligence gets a huge payoff this week—we’ll explore a wonderful couple of papers about all the ways that artificial intelligence can go wrong. Malevolent actors? You bet. Collateral damage? Of course. Reward hacking? Naturally! It’s fun to think about, and the discussion starting now will have reverberations for decades to come. https://www.technologyreview.com/s/601519/how-to-create-a-malevolent-artificial-intelligence/ http://arxiv.org/abs/1605.02817 https://arxiv.org/abs/1606.06565
-
Reinforcement Learning for Artificial Intelligence
03/07/2016 Duration: 18minThere’s a ton of excitement about reinforcement learning, a form of semi-supervised machine learning that underpins a lot of today’s cutting-edge artificial intelligence algorithms. Here’s a crash course in the algorithmic machinery behind AlphaGo, and self-driving cars, and major logistical optimization projects—and the robots that, tomorrow, will clean our houses and (hopefully) not take over the world…
-
Differential Privacy: how to study people without being weird and gross
27/06/2016 Duration: 18minApple wants to study iPhone users' activities and use it to improve performance. Google collects data on what people are doing online to try to improve their Chrome browser. Do you like the idea of this data being collected? Maybe not, if it's being collected on you--but you probably also realize that there is some benefit to be had from the improved iPhones and web browsers. Differential privacy is a set of policies that walks the line between individual privacy and better data, including even some old-school tricks that scientists use to get people to answer embarrassing questions honestly. Relevant links: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42852.pdf
-
How the sausage gets made
20/06/2016 Duration: 29minSomething a little different in this episode--we'll be talking about the technical plumbing that gets our podcast from our brains to your ears. As it turns out, it's a multi-step bucket brigade process of RSS feeds, links to downloads, and lots of hand-waving when it comes to trying to figure out how many of you (listeners) are out there.
-
SMOTE: makin' yourself some fake minority data
13/06/2016 Duration: 14minMachine learning on imbalanced classes: surprisingly tricky. Many (most?) algorithms tend to just assign the majority class label to all the data and call it a day. SMOTE is an algorithm for manufacturing new minority class examples for yourself, to help your algorithm better identify them in the wild. Relevant links: https://www.jair.org/media/953/live-953-2037-jair.pdf
-
Conjoint Analysis: like AB testing, but on steroids
06/06/2016 Duration: 18minConjoint analysis is like AB tester, but more bigger more better: instead of testing one or two things, you can test potentially dozens of options. Where might you use something like this? Well, if you wanted to design an entire hotel chain completely from scratch, and to do it in a data-driven way. You'll never look at Courtyard by Marriott the same way again. Relevant link: https://marketing.wharton.upenn.edu/files/?whdmsaction=public:main.file&fileID=466
-
Traffic Metering Algorithms
30/05/2016 Duration: 17minThis episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and clog up. If you're someone who listens to podcasts while commuting, and especially if your area has on-ramp metering, you'll never look at highway access control the same way again (yeah, we know this is super nerdy; it's also super awesome). Relevant links: http://its.berkeley.edu/sites/default/files/publications/UCB/99/PWP/UCB-ITS-PWP-99-19.pdf http://www.its.uci.edu/~lchu/ramp/Final_report_mou3013.pdf
-
Um Detector 2: The Dynamic Time Warp
23/05/2016 Duration: 14minOne tricky thing about working with time series data, like the audio data in our "um" detector (remember that? because we barely do...), is that sometimes events look really similar but one is a little bit stretched and squeezed relative to the other. Besides having an amazing name, the dynamic time warp is a handy algorithm for aligning two time series sequences that are close in shape, but don't quite line up out of the box. Relevant link: http://www.aaai.org/Papers/Workshops/1994/WS-94-03/WS94-03-031.pdf
-
Inside a Data Analysis: Fraud Hunting at Enron
16/05/2016 Duration: 30minIt's storytime this week--the story, from beginning to end, of how Katie designed and built the main project for Udacity's Intro to Machine Learning class, when she was developing the course. The project was to use email and financial data to hunt for signatures of fraud at Enron, one of the biggest cases of corporate fraud in history; that description makes the project sound pretty clean but getting the data into the right shape, and even doing some dataset merging (that hadn't ever been done before), made this project much more interesting to design than it might appear. Here's the story of what a data analysis like this looks like...from the inside.
-
What's the biggest #bigdata?
09/05/2016 Duration: 25minData science and is often mentioned in the same breath as big data. But how big is big data? And who has the biggest big data? CERN? Youtube? ... Something (or someone) else? Relevant link: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
-
Data Contamination
02/05/2016 Duration: 20minSupervised machine learning assumes that the features and labels used for building a classifier are isolated from each other--basically, that you can't cheat by peeking. Turns out this can be easier said than done. In this episode, we'll talk about the many (and diverse!) cases where label information contaminates features, ruining data science competitions along the way. Relevant links: https://www.researchgate.net/profile/Claudia_Perlich/publication/221653692_Leakage_in_data_mining_Formulation_detection_and_avoidance/links/54418bb80cf2a6a049a5a0ca.pdf