Arranging documents in a grid in accordance with the content similarity
How is it possible to arrange documents in to a space (say multiple grids), so that the position in which they are placed in, contains information about how similar they are to other documents. I looked in to K-means clustering, but it is a bit computationally intensive if data is large. I'm looking for something like hashing the contents of the document, so that they can fit in a large space and documents that are similar would be having similar hashes and distance between them would be small. In this case, it would be easy to find documents similar to a given document, with out doing much extra work.
The result could be something similar to the picture below. In this case music documents are near film documents but far from documents related to computers. The box can be considered as the whole world of documents.
Any help would be greatly appreciated.
One way to introduce a distance or similarity measure between documents is:
first encode your documents as vectors, eg using TF-IDF (see https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
the scalar-product between two vectors related to two documents give you a measure about the similarity of the documents. The larger this value is, the higher is the similarity.
Using MDS (http://en.wikipedia.org/wiki/Multidimensional_scaling) on these similarities should help to visualize the documents in a two dimensional plot.