# Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features. In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features). Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.

Thanks

## Answers

Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.

Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.

Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).

*[Updated after problem clarification]*

In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.

**Suggested solution:**

Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).

*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.

There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a *distance function*. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).

So the key question here is **how you want to compare your features**. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, *Cosine distance* is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, **bag of words** features are used for such a situation. For images, this is also known as **visual words**. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.

I see two options here:

- Restrict yourself to those features for which all your data-points have a value.
- See if you can generate sensible default values for missing features.

However, if possible, you should probably resample all your data-points, so that they all have values for all features.