In a previous post I plotted some transit statistics and waved my hands around at the apparent and obvious correlation without doing any math. I figured I should re-address this and do some minimal statistics (what? I'm lazy) to show that there is *some* relation between transit usage and how easy it is to get around by transit (a shocking assertion). Part of the reason I didn't bother with any deeper analysis last time was that I figured my claim was *obvious*. But, on reflection, there is a lot going on there that could be analyzed to death, and since I have nothing to do this afternoon...

A brief re-cap: I abused the google maps api into giving me transit times for several thousand properties in Edmonton, with 10 residential properties chosen at random from each of the ~300 neighbourhoods. I gathered the average travel time to the city center from each of these properties, both by car and by public transit. The idea being that the city center (Churchill Square specifically) is at the convergence of the most bus routes, LRT lines, etc. so it is probably the easiest place to get to in Edmonton by transit. It also happens to be right in the heart of Downtown, where a lot of people would want to commute to anyways. The difference between the time it takes to get downtown by car and the time it takes to get downtown by transit is, I claim, reflective of how "effective" public transit is for a given neighbourhood. I averaged the results for each neighbourhood to get a `MeanDiff`

-- the mean difference between public transit and driving.

In some neighbourhoods, like where I live, the difference is a few minutes. In others it can be as much as 70 minutes. That is, you spend an additional 1 hour 10 minutes on the bus commuting to work versus just driving.

Naturally neighbourhoods that are centrally located tend to have the smallest difference, probably because they are already close to the endpoint anyways, but also there are more transit routes and transit options in the core of the city. A lot of bus routes end up downtown, so as they all converge, there are some roads with a ton of bus routes on them. Jasper Ave, for example, has innumerable bus routes heading to and from downtown just a block from my apartment, then once they hit the west end they branch out to whichever neighbourhood out there they serve.

Anyways, I want to see if there is a relationship between the number of people who take transit as their primary mode of transportation and the mean difference between travel times by public transit and travel times by car. To start I need a model and some more assumptions.

Suppose the number of people in neighbourhood $i$ who reported taking transit on the census is $y_{i}$ and there are $n_{i}$ people overall who responded to the census. I am going to model the number of people observed as a random variable drawn from a Binomial distribution with probability $p_{i}$ which is drawn from a beta distribution. The mean of the beta distribution is basically a logistic regression and the variance is a pooled variance for all neighbourhoods.

$$ Y_{i} \text{ | } n_{i}, p_{i} \sim \mathcal{B} \left( n_{i}, p_{i} \right) $$ $$ p_{i} \text{ | } \mu_i, \sigma \sim \text{Beta} \left( \mu_i, \sigma \right) $$

$$ \mu_{i} = \text{logit}^{-1} \left( a + b\cdot \text{MeanDiff} \right) $$ $$ a \sim \mathcal{N} \left( \mu = 0, \tau = 10^{-12} \right) $$ $$ b \sim \mathcal{N} \left( \mu = 0, \tau = 10^{-12} \right) $$

$$ \sigma \sim \text{HalfCauchy} \left( 5 \right) $$

My priors for *a* and *b* are reasonably uninformative, and hopefully I have enough data that it won't matter too much (I'm too lazy to look up what the Jeffrey's priors are for this). The prior for $\sigma$ was chosen to be plausibly uninformative while also strictly >0.

A note on my assumptions: I am assuming that the people who responded to the census are essentially a *random draw* from the people who live in the neighbourhood. This, I imagine, is not entirely true as there are systematic biases in the census process. For example, it could be that civic minded people who respond to the census are also more likely to take transit. Barring any particular knowledge of these biases, I am assuming *they don't matter* and blundering ahead, but you should keep this assumption in the back of your head for later.

Also, I am using the newest census data (from 2016) whereas previously I used data from the 2014 census.

```
import pymc3 as pm
with pm.Model() as logistic_model:
# Uninformative hyper-priors
a = pm.Normal('a', mu=0, tau=1e-12)
b = pm.Normal('b', mu=0, tau=1e-12)
s = pm.HalfCauchy('sigma', beta=5)
# probability of taking public transit as a logistic regression
mu = pm.invlogit(a + b*df.MeanDiff.values)
# beta distribution parameters
alpha = mu/s
beta = (1-mu)/s
# observations (count of people who use public transit)
y = pm.BetaBinomial('observations', alpha=alpha, beta=beta, n=df.total.values, observed=df.transit.values)
# run the MCMC
start = pm.find_MAP()
step = pm.NUTS(scaling=start)
trace = pm.sample(20000, step, start, random_seed=111)
```

Above is a plot of the observed percentage of people who use public transit (as their primary mode of transport) versus the mean excess travel time over driving. Added in is the posterior predictive distribution for the model I generated. It clearly captures something, but a fair amount of the variability in public transit usage is not captured by my (pathologically) simple model. The $R^2$ for this model is 0.365, which is not great (insert the infinite caveats and warnings about even seeing an $R^2$ out of the corner of your eye, let alone *interpreting* one)

For reference, the point estimate for *p*, the proportion of people using transit in a given neighbourhood, is:

$$ \hat{p}_{i} = { \exp \left( a + b \cdot \text{MeanDiff} \right) \over {1 + \exp \left( a + b \cdot \text{MeanDiff} \right) }} $$

So what can examining the model parameters tell us? Well, if transit usage is *completely unrelated* to the mean difference we would expect *b* to be pretty close to zero. Specifically we would expect the posterior distribution for *b* to sit squarely on zero. However, looking at the sampled posterior distributions, this is not the case. In fact $P(b < -0.03 \text{ | } \text{data}) > 0.999$. So we have some pretty good evidence that there is some link between the two.

**NOTE** This is not a *causal* claim. I cannot say, given the data I have, that public transit being terrible is *why* people don't take transit. There could be some confounding factor, some *other thing* that is correlated both to transit usage *and* the mean difference in travel time. It could also be the case, for example, that people who would take transit self select into neighbourhoods that already have good transit. In this case improving transit in a neighbourhood where people don't take transit would be futile, in the short term.