algorithm - How to cluster large datasets -


i have large dataset (500 million) of documents , want cluster documents according content.

what best way approach this? tried using k-means not seem suitable because needs documents @ once in order calculations.

are there cluster algorithms suitable larger datasets?

for reference: using elasticsearch store data.

according prof. j. han, teaching cluster analysis in data mining class @ coursera, common methods clustering text data are:

  • combination of k-means , agglomerative clustering (bottom-up)
  • topic modeling
  • co-clustering.

but can't tell how apply these on dataset. it's big - luck.

for k-means clustering, recommend read dissertation of ingo feinerer (2008). guy developer of tm package (used in r) text mining via document-term-matrices.

the thesis contains case-studies (ch. 8.1.4 , 9) on applying k-means , support vector machine classifier on documents (mailing lists , law texts). case studies written in tutorial style, dataset not available.

the process contains lots of intermediate steps of manual inspection.


Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -