After a month's break I am ready to re-enter the award-winning blog-writing world with another installment of "cool stuff I have added to KevBot lately". WARNING: This post is about algorithms and math. I love algorithms and math. If you do not like algorithms or math then I will try to make this post make sense regardless but know that I am disappointed in you for not liking algorithms and math.
As you'll recall, a Markov process is one where the next state's probabilities are dependent on only the previous state that the system was in. So taking the example I've used before, a Markov process is like looking at a book by looking at every two adjacent words and coming up with the next word based solely on the current one. Depending on the book used, the results will be different: a Markov-based Harry Potter generator will write about Harry Potter, while an Old Man's War generator will be totally badass.
This is the basis to impersonation: each IRC user is treated as its own source material. So KevBot can be asked to generate a Dean-like sentence or a qed-like sentence and the results will be as different as qed is from me (which is to say his will talk a lot more about Linux and Perl while mine will bitch about accessibility). Treating each user as a different source turns out to be the key to adding statistical guesswork to KevBot's already phenomenal repertoire of functionality.
You see, KevBot's non-impersonating Markov structure stores all word pairs globally. If I say, "a cat", KevBot will store the pair ("a","cat") in its database and assign it a weight of 1. If qed then says, "a dog", KevBot will store ("a","dog"), also with a weight of 1. Then, if KevBot is asked to find a word following "a", both "cat" and "dog" are equally likely. The weights are assigned based on how often particular phrases are said; ergo, a pair like "five dollars" is much more likely to be generated than a pair like "five forearms". (For the extraterrestrials who may be reading this, humans have only two forearms most of the time.)
At the same time, KevBot stores additional information specifically about who said which pair. In parallel with the global structure is the impersonation structure, which stores the same word pairs but with the additional datum of who actually owns that pair. In this case, if KevBot is asked to generate a Dean-like word following "a", then "cat" is the only option. For qed, "dog" is the only option.
That's how it works when you're generating sentences, anyway. The new functionality that I have added effectively reverses this process by asking the following question: for a given sentence, whom does it resemble statistically the most? How do we answer that question, given that we have the kind of data structures I've described?
Let's look at it with an example. Someone asks Kevbot, "Who said 'a cat'?" KevBot looks at its database and sees that Dean said "a cat", so KevBot replies, "I think Dean said it." That's simple enough. But what if other people had also said it? If you're anything like me you can already think of dozens of ways to approach this problem. I'll just tell you the way I've been doing it, which I find is fairly efficient and yields good results.
I use weights to determine who says a given phrase the most, as a ratio of how many times it has been said in total to how often a given user has said it. So for a given word pair like "a cat", I check the global database and record how often it was said, regardless of who said it. Then for each of the most prolific users I calculate the ratio of how many times that user said the pair compared to everyone else. The weights of those pairs are the number of times, so really the winner is whoever has the highest weight.
But something that simple doesn't scale to entire sentences. (Such as, "I ate every piece of cheese in the southern hemisphere last Tuesday.") For longer sentences, I break them up into word pairs and find the averages of all of the weights for all of the pairs for all of the users. Then, at the end, whoever has the biggest average wins, and the average is printed out and called KevBot's "confidence" that a given user said the phrase. The higher the confidence, the more statistically likely it was for the winning user to have actually said the phrase.
For the record,
<KevBot> I am 10.66% confident that Dean said, "I ate every piece of cheese in the southern hemisphere last Tuesday."
Even when I try to make up dumb-sounding sentences, KevBot still knows it's me. I guess deep down I'm nothing more than a second-order Markov process.