Finding the hot singles in my neighbourhood

Posted in civics on Sunday, May 29 2016

I've been working on a project for the past few weeks that involves parsing a bunch of data sets to generate some aggregate statistics at the neighbourhood level in my hometown of Edmonton. Staring at tables of numbers and scrutinizing a ROC curve can only inspire you so much. Today I'm taking a break and making some maps of Edmonton's latest property values dataset and most recent city census (done in 2014).

Making maps of Edmonton

Loading the shapefile and indexing it by neighbourhood

Edmonton's open data portal is super convenient, it has a shapefile you can download with the boundaries for all the neighbourhoods in town, all properly aligned with lat and long.

I loaded the shapefile into geopandas and now I need to think about how I'm going to join this dataframe with whatever other data source I end up with. I know I want to index by neighbourhood, and conveniently each neighbourhood is assigned a unique number so I can use that as an index. Assigning an index is important, with pandas, as it makes combining dataframes easy: Pandas can take two dataframes and stick them together by lining up the indexes.

However not all datasets I'm going to look at have the neighbourhood number included, so I also need a way to map neighbourhood names to numbers. To do this I set up a defaultdict that returns a NaN if the neighbourhood doesn't exist. This is important as neighbourhoods change over time, especially along the border of the city where new neighbourhoods are being made every year. I can't map neighbourhoods that no longer exist.

import glob
import wget
import zipfile
import pandas as pd
import numpy as np

from geopandas import GeoDataFrame
from collections import defaultdict
from itertools import repeat

shapefiles_url = 'https://data.edmonton.ca/api/geospatial/jfvj-x253?method=export&format=Shapefile'
shapefiles_zip = wget.download(shapefiles_url)

with zipfile.ZipFile(shapefiles_zip,"r") as zip_ref:
    zip_ref.extractall("Neighbourhood_shapefiles")

edmonton_hoods_shapefile = glob.glob('Neighbourhood_shapefiles/*.shp')[0]

edmonton_hoods = GeoDataFrame.from_file(edmonton_hoods_shapefile)
edmonton_hoods['Neighbourhood'] = edmonton_hoods['name'].apply(lambda x: x.upper())
edmonton_hoods['number'] = pd.to_numeric(edmonton_hoods['number'])
edmonton_hoods.index = edmonton_hoods['number']

neighbourhood_table = defaultdict(repeat(np.nan).__next__)
neighbourhood_table.update({ key:value for key,value in zip(edmonton_hoods['Neighbourhood'], edmonton_hoods['number'])})

Building maps in matplotlib

Building maps with matplotlib is not difficult but it can be tedious if you want to make a lot of them. So I'm going to define a helper function make_map.

%matplotlib inline

import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
from matplotlib.colors import Normalize

import cartopy.crs as ccrs
from cartopy.io.img_tiles import GoogleTiles
from cartopy.feature import ShapelyFeature

class StamenToner(GoogleTiles):
    def _image_url(self, tile):
        x, y, z = tile
        url = 'http://tile.stamen.com/toner/{}/{}/{}.png'.format(z, x, y)
        return url

def make_map(df, key, style='YlGnBu', imagery=StamenToner(), extent=(-113.8, -113.2, 53.35, 53.75), zoom=11):
    df = df.loc[:, [key, 'geometry']]
    df.dropna(subset=[key, 'geometry'], inplace=True)
    norm = Normalize(vmin=df[key].min(), vmax=df[key].max())
    cmap = ScalarMappable(norm=norm, cmap=style)

    fig, ax = plt.subplots(figsize=(15,15), subplot_kw={'projection':imagery.crs})
    ax.set_extent(extent)
    ax.add_image(imagery,zoom)

    # There's a weird bug in cartopy where I add the whole series of geometries
    # and the whole series of colours i.e.
    # ax.add_geometries(df['geometry'], ..., facecolor=df['colour'])
    # the geometries don't get coloured correctly.
    for index, row in df.iterrows():
        try:
            iter(row['geometry'])
        except TypeError:
            geom = (row['geometry'],)
        else:
            geom = row['geometry']

        colour = cmap.to_rgba(row[key])
        ax.add_geometries(geom, ccrs.PlateCarree(), facecolor=colour, edgecolor='#555555', alpha=0.8)

    cmap.set_array([])
    plt.colorbar(cmap)

Making maps in Folium

Folium is a crazy convenient way for making real maps by letting you build overlays on a leaflet map. The takehome for those of you who don't know what either of those are is that I can put overlays on a map pretty much like a google maps, but more customizable.

Building a map is much easier with Folium than with matplotlib, as you can see with my little helper function which takes a dataframe and builds a fancy interactive map.

import folium

def make_folium_map(df, key, style='YlGnBu'):
    df = df.loc[:, [key, 'geometry']].dropna(how='any')
    #This is really hacky
    df.loc[:,'index'] = df.index.astype(str)
    gjson = df.loc[:,['geometry']].to_json()

    folmap = folium.Map(location=[53.55, -113.5], tiles="Stamen Toner", zoom_start=11)
    folmap.choropleth(geo_str=gjson, data=df,
                      columns=['index', key],
                      key_on='feature.id',
                      fill_color=style, fill_opacity=0.7, line_opacity=0.2,
                      reset=True)
    return folmap

Mapping Property Values Data

Property assessment data is also availale from the Edmonton open data portal, this time as a csv. The table is huge and contains the assed value for over 300,000 properties in the city, complete with address and lat long coordinates.

property_values_url = 'https://data.edmonton.ca/api/views/q7d6-ambg/rows.csv?accessType=DOWNLOAD'
property_values_csv = wget.download(property_values_url, out='datasets/')

One irritating thing is that the assessed value comes with a dollar sign attached, so pandas interpreted it as a string instead of a number. So I need to fix up the values by extracting the digits, leaving any NaN in place.

import re

def getvals(valstring):
    if type(valstring) == type('string'):
        return float(re.findall(r'\d+', valstring)[0])
    else:
        return valstring

def fix_vals(df):
    series = df['Assessed Value'].apply(getvals)
    pd.to_numeric(series, errors='coerce')
    return series

Another issue is that the 'Neighbourhood' column contains a series of non-standard neighbourhood names, typos really, which need to be cleaned up if I want to aggregate by neighbourhood. After a quick search through the dataframe I found the following 5 neighbourhoods that are named differently, and I can easily map them to the right neighbourhood number.

neighbourhood_table['ANTHONY HENDAY SOUTHEAST'] = neighbourhood_table['ANTHONY HENDAY SOUTH EAST']
neighbourhood_table['PLACE LA RUE'] = neighbourhood_table['PLACE LARUE']
neighbourhood_table['RAPPERSWIL'] = neighbourhood_table['RAPPERSWILL']
neighbourhood_table['SOUTHEAST (ANNEXED) INDUSTRIAL'] = neighbourhood_table['SOUTHEAST INDUSTRIAL']
neighbourhood_table['WESTBROOK ESTATE'] = neighbourhood_table['WESTBROOK ESTATES']

Now I'm ready to load the property values datafile, filter out only the residential properties, clean up the property values, and assign the proper neighbourhood number, then group by the neighbourhood.

residential = (pd.read_csv(property_values_csv)
                 .where(lambda df: df['Assessment Class'] == 'Residential')
                 .assign(value= fix_vals,
                         name = lambda df: df['Neighbourhood'])
                 .replace({'Neighbourhood': neighbourhood_table})
                 .dropna(subset=['value', 'Neighbourhood'])
                 .groupby(['Neighbourhood']))

The Median Value of a Residential Property

The first and most obvious way to map things is just by the median value of homes. Using the aggregate() function I can collapse down the table of every property into a table of just neighbourhoods and their median value. Attaching the neighbourhood boundaries I can go ahead and map the median values for each neighbourhood.

import numpy as np

aggregated = pd.DataFrame()
aggregated.loc[:,'median'] = residential['value'].aggregate(np.median)
aggregated = GeoDataFrame( pd.concat([aggregated, edmonton_hoods], axis=1) )

make_map(aggregated, 'median')

png

Well, one neighbourhood certainly stands out with a median property value of \$10,000,000. If I take a peek at the data I see that the neighbourhood of Crystallina Nera East only has two residential properties and they are both multimillion dollar properties.

Neighbourhoods with only one or two properties are going to be irritating outliers, so if I drop the neighbourhoods with less than say five properties I should get a better picture of what's going on.

Account Number Suite House Number Street Name Assessed Value Assessment Class Neighbourhood Garage Latitude Longitude name value
274446 3771052.0 NaN 18104.0 66 STREET NW $7488500 Residential 2462 N 53.644388 -113.449489 CRYSTALLINA NERA EAST 7488500.0
307103 10396744.0 NaN 17350.0 66 STREET NW $14130000 Residential 2462 N 53.639656 -113.448943 CRYSTALLINA NERA EAST 14130000.0
aggregated.loc[:,'count'] = residential['Assessment Class'].aggregate(len)

make_map(aggregated.loc[ aggregated['count'] >5 ], 'median')

png

The Extremes of Residential Property Values

Now I want to know where the most expensive properties are. If I map the max property value in each neighbourhood I get the following.

png

a 45 million dollar pmq

There is a \$45,009,500 former PMQ in Griesbach. Though the street address does not line up with the lat/long coordinates, because the street address doesn't appear to correspond to an actual place, according to Google Maps.

Apparently there is a lot of variability in Griesbach, the lowest value residential property is valued at $500 (what is that? a dog house? a parking spot? part of a patio?) and it is in some condo complex off Castle Downs road

Either that or there is a data quality problem.

The Most Valuable Property The Least Valuable Property
Suite N/A N/A
House Number 14604 N/A
Street Name 97 STREET NW N/A
Garage N N
Assessed Value $45009500 $500
Assessment Class Residential Residential
Neighbourhood 3111 3111
name GRIESBACH GRIESBACH
Latitude 53.6104 53.6125
Longitude -113.498 -113.515

Mapping Some of the Edmonton 2014 Census Datasets

The Edmonton Data Portal has a whole series of subsets of the 2014 city census. These datasets are much easier to play with since they are already aggregated by neighbourhood and the datasets have the neighbourhood number already there, ready to use as an index. Also numbers are displayed as numbers (i.e. unlike the property value csv) so pandas can recognize what they are.

One downside, for my particular usage, just comes with voluntary census data and that is the rather high no response rate. Many neighbourhoods have all or almost all of the inhabitants just not answering, which sucks. I'm going to just ignore that and play fast and loose with the data for the sake of mapping here.

Neighbourhoods in Edmonton also vary greatly with population and if I map the raw prevalence of almost any variable I would expect Oliver and Downtown to feature prominently always because those are the two most populous neighbourhoods. What I really want to look at is neighbourhood makeup as a fraction. So make this easier I'm going to define a function to take all columns in a dataframe and give back the columns divided by the row-wise sum, i.e. turn them into fractions. It also tacks on what the total was, which is convenient since I will want to be able to filter out neighbourhoods with very few respondents.

def frequentize(df):
    df = df.copy()
    dftotal = df.sum(axis=1)
    return (df.div(dftotal, axis=0)
              .assign(total = dftotal)
              .fillna(0))

Transit Usage

This is one of my favourite datasets: How do people get around?

This is probably pretty hinky but I'm just going to pretend the 'No Respose' people don't exist and assume that the people that responded to the census are actually representative of Edmontonians.

transit_url = 'https://data.edmonton.ca/api/views/9f7k-3qk6/rows.csv?accessType=DOWNLOAD'
transit_csv = wget.download(transit_url, out='datasets/')

transit = (pd.read_csv(transit_csv)
             .drop(['WARD', 'NEIGHBOURHOOD_NAME', 'NO_RESPONSE'], axis=1)
             .rename(columns={'CAR/TRUCK/VAN_(AS_DRIVER)': 'car driver',
                              'CAR/TRUCK/VAN_(AS_PASSENGER)': 'car passenger'})
             .rename(columns=str.lower)
             .set_index('neighbourhood_number')
             .sort_index()
             .where(lambda df: ~(df == 0).all(axis=1))
             .dropna(how='all')
             .pipe(frequentize))

merged_transit = GeoDataFrame( pd.concat([transit, edmonton_hoods], axis=1) )

I can make a nice map of transit usage by neighbourhood. Again as a fraction of the neighbourhood population. If we mapped the raw numbers we would get a nice map highlighting core of the city which is where the most population dense neighbourhoods are.

png

A more interesting map might be where do the "not by car" people live, combining transit, walking, cycling, and "other" (whatever those are? I'm putting my money on hot air balloon, autogyro, etc.)

merged_transit.loc[:, 'not car'] = merged_transit['transit'] + merged_transit['walk'] + merged_transit['bicycle'] + merged_transit['other']

make_folium_map(merged_transit, 'not car')