I had a conversation today about the curse of dimensionality and ended up pulling out ESL as a resource for some more talk on this. In there I found a great little derivation that both seems ridiculously complicated on its face (it sounds complicated), has deep practical implications, and can be done entirely with facts you learn in an intro probability and statisitcs course.
The other day I
pip installed google's or-tools and started madly optimizing every facet of my life, using constraint programming. Er, well, actually I've been futzing about solving toy problems and trying to figure out how to use or-tools?.
It's been a crazy week of analysis on the US Presidential election and how it came to be that Trump won the Electoral College and Clinton won the popular vote. There are lots of threads to pull on, but I'm going to focus on one particular theory I read online: The Electoral College is skewed where some states, like Nebraska, have more EC votes than you would expect given it's population while others, like California, have many fewer. California, by population share alone, should have 11 more EC votes. So did this conspire to give Trump the election?
The U.S. Presidential election was on Tuesday and the result was not what many people were expecting, to put it mildly. Let's take a voyage of discovery to find out how we ended up here, and what to make of the apparent failure of election forecasters to predict...this...
I was at a Halloween party last night and we played the game Codenames, which is a very fun party game. In the game, two teams compete to find their colored squares on a 5x5 grid. The grid starts with cards on it, each with a single word, and only the Spymasters have the map to which square corresponds to which color. The twist is that the Spymasters (there's one for each team) can only communicate to their team using a single word and a number, where the number is the number of cards that correspond to the word in some way. So, for example, the Spymaster might say "ocean, four" and their team now has to find four cards that somehow correspond to "ocean". If you pick the wrong card your turn ends, and you may end up revealing one of the opposing team's cards by mistake (which bonus to them).
I read Cathy O'Neil's Weapons of Math Destruction a few weeks ago and I continue to mull it over, and I want to spend some time expanding on one small part of the book. In WMD Ms. O'Neil talks a fair bit about how models can lead to terrible outcomes due to a fusion of their particular blindspots and perverse incentives. One thing I would like to expand upon is the how these blindspots can develop naturally and be obscured by the naive performance metrics that one typically uses to decide how well the model is functioning.
We are in the final months of the 2016 US presidential election and people have gone a little nuts analyzing and overanalyzing the various election forecasts out there on the internets. I want to take a minute and explore what exactly the election forecasters are doing, using silly toy examples
In a previous post I plotted some transit statistics and waved my hands around at the apparent and obvious correlation without doing any math. I figured I should re-address this and do some minimal statistics (what? I'm lazy) to show that there is some relation between transit usage and how easy it is to get around by transit (a shocking assertion). Part of the reason I didn't bother with any deeper analysis last time was that I figured my claim was obvious. But, on reflection, there is a lot going on there that could be analyzed to death, and since I have nothing to do this afternoon...
This is one of those questions I could probably answer with research, but I'm lazy so I am going to do simulations. Anyways, it came up in my life to check a data-set to see if the values are normally distributed. There are a couple of ways of doing this (I lean towards doing a KS test) but one that was recommended was to do a $\chi^2$ test. Of course the $\chi^2$ test typically requires the data to be in discrete bins, and this got me thinking: surely the test itself is highly dependent upon the bin size I choose so, presumably, I could fiddle with that variable to get whatever answer I wanted. Presumably.
Recently I've been posting about Edmonton Transit and drawing a lot of my data from the 2014 Edmonton Census. Well the 2016 census data is now available on the Edmonton Open Data Portal so I should take a peak and see what's changed.
I recently discovered a bug in how I was generating choropleth maps with cartopy. Basically the
add_geometries() function for adding shapes to matplotlib maps was not mapping the correct face colours to the correct geometry.
Yesterday I put together some maps showing some results culled from google maps on how effective Edmonton Transit is vs driving in your own car. It looked pretty grim for the 'burbs, with average transit times being 20-30min longer than the equivalent trip by car (almost twice as long!). But I didn't really answer whether or not this any impact on actual transit ridership. Intuitively we think it should, but there are lots of other factors as well, such as economics. If you got no money you're still to take the bus (because it is cheaper than driving, slightly) even if it takes you hours.
I was having a conversation, the other day, about how much of a pain it is to take ETS to and from the 'burbs. I am a big fan of not driving as much as possible and I resisted owning and driving a car for years in Edmonton (notably when I lived downtown and either worked downtown or at the University) but no longer, I have a car. My particular breaking point was working in a business park that wasn't really transit accessible -- by bus, train, bus, and then walking my trip to work took over an hour each way, with a car that dropped to 20 minutes max. I figure this experience generalizes well and exlains why transit ridership is really low in Edmonton. Transit takes forever and it sucks, whereas everywhere is a 20-30min drive from everywhere else in this town.
I want to take a minute and talk about risk analysis and some of the underlying assumptions left unexamined.
Recently I sat down and made some maps of Edmonton with overlays for various and sundry bits of the census. This got me interested in looking into the API for the open data portal and seeing what I could do with that.
I've been working on a project for the past few weeks that involves parsing a bunch of data sets to generate some aggregate statistics at the neighbourhood level in my hometown of Edmonton. Staring at tables of numbers and scrutinizing a ROC curve can only inspire you so much. Today I'm taking a break and making some maps of Edmonton's latest property values dataset and most recent city census (done in 2014).
A few days ago I decided to play around with Markov chains, to improve the playlists generated on my computer actually, but somehow I got sidetracked and ended up making a horrible twitter bot
Matt Parker recently posted a video with a neat mathematical card trick. Like all good things in life it involves lots tedious counting and shuffling, so brace yourself.
In a previous post I looked at various ways of partitioning up a group of people into teams, such that each individual's preferences of teammates is taken into consideration and the overall happiness of the team, and thus the corporation, is maximized. I've been spinning ever more complicated way of doing this in my head, so why not try one out?
Partially Derivative had a great segment on AI artists -- honestly: hilarious and thought provoking -- thinking about a new project wherein a team of programmers and art historians managed to train up an algorithm to reproduce the style of Rembrandt. The computer doesn't replicate Rembrandts, it paints new ones that replicate the style of Rembrandt. Vidya brought up the question: who is the artist here? Is it the programmer? Or are we willing to grant artistic ownership to the computer itself?