Is it possible to supplement Naive Bayes text classification algorithm with author information?

I am working on a text classification project where I am trying to assign topic classifications to speeches from the Congressional Record.

Using topic codes from the Congressional Bills Project (http://congressionalbills.org/), I've tagged speeches that mention a specific bill as belonging to the topic of the bill. I'm using this as my "training set" for the model.

I have a "vanilla" Naive Bayes classifier working well-enough, but I keep feeling like I could get better accuracy out of the algorithm by incorporating information about the member of Congress who is making the speech (e.g. certain members are much more likely to talk about Foreign Policy than others).

One possibility would be to replace the prior in the NB classifier (usually defined as the proportion of documents in the training set that have the given classification) with speaker's observed prior speeches.

Is this worth pursuing? Are there existing approaches that have followed this same kind of logic? I'm a little bit familiar with the "author-topic models" that come out of Latent Dirichlet Allocation models, but I like the simplicity of the NB model.

Answers


There is no need to modify anything, simply add this information to your Naive Bayes and it will work just fine.

And as it was previously mentioned in the comment - do not change any priors - prior probability is P(class), this has nothing to do with actual features.

Just add to your computations another feature corresponding to the authorship, e.g. "author:AUTHOR" and train Naive Bayes as usual, ie. compute P(class|author:AUTHOR) for each class and AUTHOR and use it later on in your classification process.If your current representation is a bag of words, it is sufficient to add a "artificial" word of form "author:AUTHOR" to it.

One other option would be to train independent classifier for each AUTHOR, which would capture person-specific type of speech, for example - one uses lots of words "environment" only when talking about "nature", while other simply likes to add this word in each speach "Oh, in our local environment of ...". Independent NBs would capture these kind of phenomena.


Need Your Help

Merge MySql table and a external XML to have them ORDER BY timestamp

php mysql xml

From an external party we receive a constant updating XML newsfeed. We also have our own news in a MySql database. What i want to do is merge them both together so that i can sort them by timestamp...

Interposing on OSX, my function not being called

c osx interposing

So, I'm messing around with some interposing code on OSX (gcc 4.2.1) and I'm trying to get the following to work:

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.