AudioBook Audiobook Linear Digressions

Random Kanye

04/03/2015 Duration: 08min

Ever feel like you could randomly assemble words from a certain vocabulary and make semi-coherent Kanye West lyrics? Or technical documentation, imitations of local newscasters, your politically outspoken uncle, etc.? Wonder no more, there's a way to do this exact type of thing: it's called a Markov Chain, and probably the most powerful way to generate made-up data that you can then use for fun and profit. The idea behind a Markov Chain is that you probabilistically generate a sequence of steps, numbers, words, etc. where each next step/number/word depends only on the previous one, which makes it fast and efficient to computationally generate. Usually Markov Chains are used for serious academic uses, but this ain't one of them: here they're used to randomly generate rap lyrics based on Kanye West lyrics.

Listen

Lie Detectors

25/02/2015 Duration: 09min

Often machine learning discussions center around algorithms, or features, or datasets--this one centers around interpretation, and ethics. Suppose you could use a technology like fMRI to see what regions of a person's brain are active when they ask questions. And also suppose that you could run trials where you watch their brain activity while they lie about some minor issue (say, whether the card in their hand is a spade or a club)--could you use machine learning to analyze those images, and use the patterns in them for lie detection? Well you certainly can try, and indeed researchers have done just that. There are important problems though--the images of brains can be high variance, meaning that for any given person, there might not be a lot of certainty about whether they're lying or not. It's also open to debate whether the training set (in this case, test subjects with playing cards in their hands) really generalize well to the more important cases, like a person accused of a crime. So while machin

Listen

The Enron Dataset

09/02/2015 Duration: 12min

In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad apples had been cooking the books, and billions of dollars and thousands of jobs disappeared. In the aftermath, surprisingly, one of the greatest datasets in all of machine learning was born--the Enron emails corpus. Hundreds of thousands of emails amongst top executives were made public; there's no realistic chance any dataset like this will ever be made public again. But the dataset that was released has gone on to immortality, serving as the basis for a huge variety of advances in machine learning and other fields. http://www.technologyreview.com/news/515801/the-immortal-life-of-the-enron-e-mails/

Listen

Labels and Where To Find Them

04/02/2015 Duration: 13min

Supervised classification is built on the backs of labeled datasets, but a good set of labels can be hard to find. Great data is everywhere, but the corresponding labels can sometimes be really tricky. Take a few examples we've already covered, like lie detection with an MRI machine (have to take pictures of someone's brain while they try to lie, not a trivial task) or automated image captioning (so many images! so many valid labels!) In this epsiode, we'll dig into this topic in depth, talking about some of the standard ways to get a labeled dataset if your project requires labels and you don't already have them. www.higgshunters.org

Listen

Um Detector 1

23/01/2015 Duration: 13min

So, um... what about machine learning for audio applications? In the course of starting this podcast, we've edited out a lot of "um"'s from our raw audio files. It's gotten now to the point that, when we see the waveform in soundstudio, we can almost identify an "um" by eye. Which makes it an interesting problem for machine learning--is there a way we can train an algorithm to recognize the "um" pattern, too? This has become a little side project for Katie, which is very much still a work in progress. We'll talk about what's been accomplished so far, some design choices Katie made in getting the project off the ground, and (of course) mistakes made and hopefully corrected. We always say that the best way to learn something is by doing it, and this is our chance to try our own machine learning project instead of just telling you about what someone else did!

Listen

Better Facial Recognition with Fisherfaces

07/01/2015 Duration: 11min

Now that we know about eigenfaces (if you don't, listen to the previous episode), let's talk about how it breaks down. Variations that are trivial to humans when identifying faces can really mess up computer-driven facial ID--expressions, lighting, and angle are a few. Something that can easily happen is an algorithm can optimize to identify one of those traits, rather than the underlying trait of whether the person is the same (for example, if the training image is me smiling, you may reject an image of me frowning but accidentally approve an image of another woman smiling). Fisherfaces uses a fisher linear discriminant to distinguish based on the dimension in the data that shows the smallest inter-class distance, rather than maximizing the variation overall (we'll unpack this statement), and it is much more robust than our pal eigenfaces when there's shadows, cut-off images, expressions, etc. http://www.cs.columbia.edu/~belhumeur/journal/fisherface-pami97.pdf

Listen

Facial Recognition with Eigenfaces

07/01/2015 Duration: 10min

A true classic topic in ML: Facial recognition is very high-dimensional, meaning that each picture can have millions of pixels, each of which can be a single feature. It's computationally expensive to deal with all these features, and invites overfitting problems. PCA (principal components analysis) is a classic dimensionality reduction tool that compresses these many dimensions into the few that contain the most variation in the data, and those principal components are often then fed into a classic ML algorithm like and SVM. One of the best thing about eigenfaces is the great example code that you can find in sklearn--you can distinguish pictures of world leaders yourself in just a few minutes! http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html

Listen

Stats of World Series Streaks

17/12/2014 Duration: 12min

Baseball is characterized by a high level of equality between teams; even the best teams might only have 55% win percentages (contrast this with college football, where teams go undefeated pretty regularly). In this regime, where 2 outcomes (Giants win/Giants lose) are approximately equally likely, we can model the win/loss chances with a binomial distribution. Using the binomial distribution, we can calculate an interesting little result: what's the chance of the world series going to only 4 games? 5? 6? All the way to 7? Then we can compare to decades' worth of world series data, to see how well the data follows the binomial assumption. The result tells us a lot about sports psychology--if each game is independent of the others, 4/5/6/7 game series are equally likely. The data shows a different trend: 4 and 7 game series are significantly more likely than 5 or 6. There's a powerful psychological effect at play--everybody loves the 7th game of the world series, or a good sweep. And it turns out that the

Listen

Computers Try to Tell Jokes

26/11/2014 Duration: 09min

Computers are capable of many impressive feats, but making you laugh is usually not one of them. Or could it be? This episode will talk about a custom-built machine learning algorithm that searches through text and writes jokes based on what it finds. The jokes are formulaic: they're all of the form "I like my X like I like my Y: Z" where X and Y are nouns, and Z is an adjective that can describe both X and Y. For (dumb) example, "I like my men like I like my coffee: steaming hot." The joke is funny when ZX and ZY are both very common phrases, but X and Y are rarely seen together. So, given a large enough corpus of text, the algorithm looks for triplets of words that fit this description and writes jokes based on them. Are the jokes funny? You be the judge... http://homepages.inf.ed.ac.uk/s0894589/petrovic13unsupervised.pdf

Listen

How Outliers Helped Defeat Cholera

22/11/2014 Duration: 10min

In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera. When a cholera outbreak in London killed scores of people, a doctor named John Snow used it as a chance to study whether the cause might be very small organisms that were spreading through the water supply (the prevailing theory at the time was miasma, or “bad air”). By tracing the geography of all the deaths from the outbreak, Snow was practicing elementary data science--and stumbled upon one of history’s most famous outliers. In this episode, we’ll tell you more about this single data point, a case of cholera that cracked the case wide open for Snow and provided critical validation for the germ theory of disease. http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak

Listen

Hunting for the Higgs

16/11/2014 Duration: 10min

Machine learning and particle physics go together like peanut butter and jelly--but this is a relatively new development. For many decades, physicists looked through their fairly large datasets using the laws of physics to guide their exploration; that tradition continues today, but as ever-larger datasets get made, machine learning becomes a more tractable way to deal with the deluge. With this in mind, ATLAS (one of the major experiments at CERN, the European Center for Nuclear Research and home laboratory of the recently discovered Higgs boson) ran a machine learning contest over the summer, to see what advances could be found from opening up the dataset to non-physicists. The results were impressive--physicists are smart folks, but there’s clearly lots of advances yet to make as machine learning and physics learn from one another. And who knows--maybe more Nobel prizes to win as well! https://www.kaggle.com/c/higgs-boson

Listen