algorithm - How to cluster large datasets -


i have large dataset (500 million) of documents , want cluster documents according content.

what best way approach this? tried using k-means not seem suitable because needs documents @ once in order calculations.

are there cluster algorithms suitable larger datasets?

for reference: using elasticsearch store data.

according prof. j. han, teaching cluster analysis in data mining class @ coursera, common methods clustering text data are:

  • combination of k-means , agglomerative clustering (bottom-up)
  • topic modeling
  • co-clustering.

but can't tell how apply these on dataset. it's big - luck.

for k-means clustering, recommend read dissertation of ingo feinerer (2008). guy developer of tm package (used in r) text mining via document-term-matrices.

the thesis contains case-studies (ch. 8.1.4 , 9) on applying k-means , support vector machine classifier on documents (mailing lists , law texts). case studies written in tutorial style, dataset not available.

the process contains lots of intermediate steps of manual inspection.


Comments

Popular posts from this blog

IF statement in MySQL trigger -

c++ - What does MSC in "// appease MSC" comments mean? -

android - MPAndroidChart - How to add Annotations or images to the chart -