Is it possible to supplement Naive Bayes text classification algorithm with author information?
I am working on a text classification project where I am trying to assign topic classifications to speeches from the Congressional Record.
Using topic codes from the Congressional Bills Project (http://congressionalbills.org/), I've tagged speeches that mention a specific bill as belonging to the topic of the bill. I'm using this as my "training set" for the model.
I have a "vanilla" Naive Bayes classifier working well-enough, but I keep feeling like I could get better accuracy out of the algorithm by incorporating information about the member of Congress who is making the speech (e.g. certain members are much more likely to talk about Foreign Policy than others).
One possibility would be to replace the prior in the NB classifier (usually defined as the proportion of documents in the training set that have the given classification) with speaker's observed prior speeches.
Is this worth pursuing? Are there existing approaches that have followed this same kind of logic? I'm a little bit familiar with the "author-topic models" that come out of Latent Dirichlet Allocation models, but I like the simplicity of the NB model.
There is no need to modify anything, simply add this information to your Naive Bayes and it will work just fine.
And as it was previously mentioned in the comment - do not change any priors - prior probability is P(class), this has nothing to do with actual features.
Just add to your computations another feature corresponding to the authorship, e.g. "author:AUTHOR" and train Naive Bayes as usual, ie. compute P(class|author:AUTHOR) for each class and AUTHOR and use it later on in your classification process.If your current representation is a bag of words, it is sufficient to add a "artificial" word of form "author:AUTHOR" to it.
One other option would be to train independent classifier for each AUTHOR, which would capture person-specific type of speech, for example - one uses lots of words "environment" only when talking about "nature", while other simply likes to add this word in each speach "Oh, in our local environment of ...". Independent NBs would capture these kind of phenomena.