Algorithm for determining which words make a phrase popular
Suppose I had a list of slogans (short, multi-word phrases), and people had voted for the ones they liked best, and I wanted to assess which words, if any, made some slogans more popular than others. What would be the best way to achieve this? My first thought was just to find all the unique words in the set of slogans and score each one as the average number of votes of all the slogans that contain said word, but frequency should also come into play in some fashion, I think, so that the following should be true:
- If Word A occurs in only the slogan that got the most votes and Word B only occurs in the slogan that got the second-most, Word A is more "popularity-generating"
- However, if Word A occurs only in the top-ranked slogan and Word B occurs in both the second- and third-ranked slogans, Word B should win, since it pushed more slogans to the top.
- However, a single occurrence of Word A in the top slogan should still trump three appearances of Word B in other slogans if they're, say, in the middle, or bottom half, of the pack (that is to say, there needs to be a balance of vote-getting and frequency in scoring).
I also want to eliminate words that are generally common (e.g., "the" or "of"). This is sort of related to questions about identifying trending words that have been asked in the past, but different because change over time isn't a factor. I'd be happy just to be pointed in the right direction about this as far as literature is concerned, but I'm not really sure what to look for. Is this a class of problem that other people deal with?
This is a machine learning question. You are trying to learn a model from supervised data. To do this, you could run a simple algorithm that's like Perceptron or SampleRank (pdf):
First you define features that apply to the words in a slogan. Features can be shared across words, e.g. features of the word "peace" could be:
- "starts with p",
- "ends in 's'-sound",
The first feature "peace" is a unique feature that fires only on "peace", whereas the other features can also fire on other words.
Each feature has a weight (higher is better). So you have a feature vector and a weight vector. This will enable you to assign a weight (score) to any slogan (just the sum of all weighted features that fire on the words in the slogan). All weights are initialized to 0.0.
Now you start training:
You loop over all pairs of slogans. For each pair you know the true ranking (according to the votes you already have). Then you compute the ranking according to the features and their current weights. If the true ranking and the ranking according to your current feature weights (i.e., according to your current model) is the same you just move on to the next pair. If your model assigned the wrong ranking you correct the feature weights: You add 1.0 to the weights of the features that fire on the better slogan (the one that's better according to the people's vote) and subtract 1.0 from the weights of the features that fire on the worse slogan (its score was obviously too high, so you're lowering it now). These weight updates will affect the scores that your model assigns to the next pairs, and so on.
You run this loop several times, until your model got most of the pairs right (or some other convergence criterion).
Typically, you don't really add or subtract 1.0, but eta times 1.0, where eta is the learning rate, which you can set experimentally. Typically it is higher at the beginning of training and is gradually lowered during training, as your weights are moving in the right direction. (See also stochastic gradient descent.) To get started, you could just set it to 0.1 as a constant.
This procedure takes care of the stop words ("the", "of", ...) as they should occur equally often in good and in bad slogans (and if they really don't then you learn that too).
After training, you can compute the score for each word according to the learned feature weights.