How to combine different NLP features for machine learning?
I'm trying to do some KNN learning using different NLP features. For example, I want to use bag-of-words and local POS tags.
Separately, I have some idea of how to calculate similarity with a single feature. Like using cosine similarity with counts (for bag-of-words vectors), or using perhaps Hamming distance for POS tags.
However, I don't know how to combine the two. How do people in this area normally do this? Could anyone help me with that?
Thanks in advance.
I would use a simple linear combination of both features. So you individually compare the bag-of-words vectors using cosine similarity and using the Hamming distance for the POS tags, and then take the average of both outcomes. So if cosine comparison and Hamming distance results in the following ranks:
rank score cosine Hamming ------------------------------- 1 red blue 2 blue yellow 3 yellow orange 4 orange red
Then the final ranking (given the ranking score above which you can change of course to, e.g., an exponential scale if you want to put more emphasis on the higher ranked labels) will be as follows (with lower score being better):
label total score -------------------- blue 3 red 5 yellow 5 orange 7
So the output label would be blue. In this case the linear combination puts 50% weight on the cosine similarity output and 50% weight on the Hamming distance output. You can perform tests with different weights (e.g., 70% cosine, 30% Hamming) to find the optimal balance between both measures.