algorithm - How to cluster large datasets -
i have large dataset (500 million) of documents , want cluster documents according content.
what best way approach this? tried using k-means not seem suitable because needs documents @ once in order calculations.
are there cluster algorithms suitable larger datasets?
for reference: using elasticsearch store data.
according prof. j. han, teaching cluster analysis in data mining class @ coursera, common methods clustering text data are:
- combination of k-means , agglomerative clustering (bottom-up)
- topic modeling
- co-clustering.
but can't tell how apply these on dataset. it's big - luck.
for k-means clustering, recommend read dissertation of ingo feinerer (2008). guy developer of tm package (used in r) text mining via document-term-matrices.
the thesis contains case-studies (ch. 8.1.4 , 9) on applying k-means , support vector machine classifier on documents (mailing lists , law texts). case studies written in tutorial style, dataset not available.
the process contains lots of intermediate steps of manual inspection.
Comments
Post a Comment