Arranging documents in a grid in accordance with the content similarity

How is it possible to arrange documents in to a space (say multiple grids), so that the position in which they are placed in, contains information about how similar they are to other documents. I looked in to K-means clustering, but it is a bit computationally intensive if data is large. I'm looking for something like hashing the contents of the document, so that they can fit in a large space and documents that are similar would be having similar hashes and distance between them would be small. In this case, it would be easy to find documents similar to a given document, with out doing much extra work.

The result could be something similar to the picture below. In this case music documents are near film documents but far from documents related to computers. The box can be considered as the whole world of documents.

Any help would be greatly appreciated.

Thanks

jvc007

Answers


One way to introduce a distance or similarity measure between documents is:

  • first encode your documents as vectors, eg using TF-IDF (see https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

  • the scalar-product between two vectors related to two documents give you a measure about the similarity of the documents. The larger this value is, the higher is the similarity.

Using MDS (http://en.wikipedia.org/wiki/Multidimensional_scaling) on these similarities should help to visualize the documents in a two dimensional plot.


Need Your Help

linker script conditional includes

c++ c linker arm ld

I am converting a scatter file to linker file. Now the problem is armlink can accept symbols e.g --predefine=-DSOME_VARIABLE at link time and in the scatter file other header files can be included ...

Dependency Injection of configuration class into static Document DB repository (VS2015 DNX project)

c# configuration dependency-injection azure-documentdb dnx

I have a base Document DB repository in the infrastructure layer of my solution. I based this repository on this GitHub project, which is a static class that is utilized by my other domain model

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.