algorithm - How to cluster large datasets -

i have large dataset (500 million) of documents , want cluster documents according content.

what best way approach this? tried using k-means not seem suitable because needs documents @ once in order calculations.

are there cluster algorithms suitable larger datasets?

for reference: using elasticsearch store data.

according prof. j. han, teaching cluster analysis in data mining class @ coursera, common methods clustering text data are:

combination of k-means , agglomerative clustering (bottom-up)
topic modeling
co-clustering.

but can't tell how apply these on dataset. it's big - luck.

for k-means clustering, recommend read dissertation of ingo feinerer (2008). guy developer of tm package (used in r) text mining via document-term-matrices.

the thesis contains case-studies (ch. 8.1.4 , 9) on applying k-means , support vector machine classifier on documents (mailing lists , law texts). case studies written in tutorial style, dataset not available.

the process contains lots of intermediate steps of manual inspection.

Search This Blog

Braziel

algorithm - How to cluster large datasets -

Comments

Post a Comment

Popular posts from this blog

javascript - Add class to another page attribute using URL id - Jquery -

IF statement in MySQL trigger -

c++ - What does MSC in "// appease MSC" comments mean? -