A few days ago I decided to play around with Markov chains, to improve the playlists generated on my computer actually, but somehow I got sidetracked and ended up making a horrible twitter bot.
It started innocently enough just building a markov chain generator. For my markov chain to work I need some sort of frequency table where I can look up for any given track (on my playlist) what are the next possible tracks and the probability of picking each. In python what I want is a dictionary where each element is a collection of the next possible elements in the chain that I can sample from with a probability equal to their observed frequency.
The first thing that comes to mind is a dictionary of
Counter objects. A
Counter would give me a nice count of each time an element came next in the chain, but it doesn't lend itself to the random sampling I want. So I extended the
Counter class with a method to calculate the frequencies of each value in the counter, and a method to randomly sample from the counter with the appropriate probability distribution. I call this
import numpy as np from collections import defaultdict, Counter class ProbsCounter(Counter): '''Counter that tracks count and frequency, can randomly sample from the counter with a probability equal to the elements frequency''' def probabilitize(self): ''' calculate the probabilities for each element in the counter ''' self.probs = np.array(list(self.values()), dtype=np.uint)/sum(self.values()) def sample(self): ''' sample from the counter, with probs proportional to count ''' try: self.probs except AttributeError: self.probabilitize() finally: return np.random.choice(list(self.keys()), p=self.probs)
Of course now I need to build a frequency table, and try all this out. Instead of diving right into building playlists, I figured I would try it out on a corpus of text (as it is coded it is agnostic to what the links in the chain actually are) just to see if it works properly and so forth.
Initially I tried it out on the full text of Moby Dick, and it worked in so far as it generated text. But this wasn't very satisfing so I put in my back history of tweets instead (which I have archived on my computer for...reasons...). By just picking a random word to get the Markov chain going I got...
So, not terrible, but largely non-sensical. Then I got thinking, why just bigrams? Currently I was just taking pairs of words and building my table that way. The markov chain was picking from exiting runs of two words. Instead I could work with trigrams, runs of three words, this should make it simulate real sentences better.
Another issue is that it often starts in the middle of a sentence so the fake tweets can be quite bizarre grammatically. I could generate a table for the first two words in a tweet, so when I start a chain I always sample from a real start to a tweet. This should make the tweets more plausible sounding.
To tease you a bit, here are some of the better outcomes that I tweeted.
The following function walks through a list of tweets and prepares two tables, one keyed by the first word and containing a
ProbsCounter for second words (the starter table), and the second keyed by pairs of words, giving a
ProbsCounter for the third word in the chain.
After building the tables it calculates the probabilities and returns the tables ready for use.
def trigram_frequency_table(tweets): ''' take a list of tweets and build a frequency table''' ftable = defaultdict(ProbsCounter) startertable = defaultdict(ProbsCounter) for tweet in tweets: words = tweet.split() startertable[words][words] += 1 for w1, w2, w3 in zip(words, words[1:] + [None], words[2:] + [None,None]): key = (w1, w2) ftable[key][w3] += 1 return ftable, startertable
The Markov chain for this is fairly simple, given a frequency table and a key (a tuple of two words) the following generator iterates until it hits a
None (indicating the end of an actual tweet).
def trigram_chain(ftable, key): w1, w2 = key while w2 != None: yield w2 w1, w2 = w2, ftable[key].sample() key = (w1,w2)
To build an actual tweet, though, I need to get pick two words to start off, then iterate until I hit the end condition. This is fairly straight forward given the
startertable I already constructed:
import random def make_tweet(ftable, startertable=None, key=None): assert (key or startertable) != None, "Need to provide a key or starter table" if key == None: w1 = random.sample(startertable.keys(), 1) w2 = startertable[w1].sample() key = (w1,w2) else: w1, w2 = key chain = trigram_chain(ftable, key) tweet = [w1] for word in chain: tweet.append(word) return " ".join(tweet)
Now, finally, to put it all together.
import json with open("twidiocy/selenized.json") as mytweetjson: mytweets = json.load(mytweetjson) freqtable, starts = trigram_frequency_table(mytweets)
for i in range(20): print(make_tweet(freqtable, startertable=starts)) print("")
Stealing White https://t.co/KMIt5q0GrD Like last night: I dreamed I had no idea we had 24/7 bail hearings in Alberta. That's actually really amazing and worth preserving if possible https://t.co/0PUnR71lyo Having a lazy Sunday, drinking coffee and watching some Fortran compile. Theranos Is Wrong: We Don’t Need More Blood Tests https://t.co/kyrel631BL It's all about the value for money proposition How I like the new twitter interface. Difficult to use, but it has become natural (3/n) Right when I finished arranging the pastry balls but before I nearly vomited from repressed snark I'm glad I wasn't the only one. When I got a new machine, they can call it the Keurig Klan Oxidation States: https://t.co/EgeDgcGQfE (1/2) https://t.co/j… Reminds me of when my mom fwd me about how you totally failed at your Jetsons references? Something went wrong with 704058271443918848 @partiallyd I'm late to the croque-en-bouche, looked me straight in the backyard in your life, this fridge has got your back https://t.co/nyiJT0mVzU Sorry ARIMA, but I’m Going Bayesian https://t.co/F59etBKjD6 And then everyone left, being polite but obviously disappointed with me (6/n, n=6) @poop_tally well your first mistake was showering in poop... Sorry ARIMA, but I’m Going Bayesian https://t.co/F59etBKjD6 Now I kinda want to hang out with menswear dog https://t.co/uz8S9kGFHo Artisanal integers, & how to ensure your hipster Brooklyn integer is unique from your London high street integer https://t.co/Phw5FpkRjD Skating at the cabin. Watching the unbreakable Kimmy Schmidt, I am strong and independent and don't need no man, but then some hick made fun of my daily coffee regime where I go, if I want to relive my 20s by dancing and make it look good everyone arrived and filled in the sun. Spring has been a lie https://t.co/8CmYpdXTg6 Life does nothing but disappoint. It's *not* a book on the floor and left, visibly disgusted (5/n)
The results are crazy and delightful. It runs the gambit from just replicating an actual tweet (I guess one with a particularly unique word ordering) to bizarre mashups. One thing you will notice is that I deliberately have not ensured that these are actually tweetable, often they are much longer than 140 characters and I kind of don't care? The point was just to goof off and see what my twitter corpus (which admittedly includes all sorts of weird things) would look like when randomly mashed up and spat out.
As usual the ipython notebook is on github