php - Using k-means for document clustering, should clustering be on cosine similarity or on term vectors? -
apologies if answer obvious, please kind, first time on here :-)
i gratefully appreciate if give me steer on appropriate input data structure k-means. working on masters dissertation in proposing new tf-idf term weighing approach specific domain. want use k-means cluster results , apply number of internal , external evaluation criteria see if new term weighting method has merit.
my steps far (implemented in php), working
step 1: read in document collection step 2: clean document collection, feature extraction, feature selection step 3: term frequency (tf) step 4: inverse document frequency (idf) step 5: tf * idf step 6: normalise tf-idf fixed length vectors
where struggling
step 7: vector space model – cosine similarity
the examples can find, compare input query each document , find similarity. there no input query (this not information retrieval system) compare every single document in corpus every other document in corpus (every pair of documents)? cannot find example of cosine similarity applied full document collection rather single example/query compared collection.
step 8: k-means
i struggling here understand if input vector k-means should contain matrix of cosine similarity score of every document in collection against every other document (a matrix of cosine similarity). or k-means supposed applied on term vector model. if latter, every example can find of k-means quite basic , plots either singular terms. how handle fact there multiple terms in document collection etc.
cosine similarity , k-means implied solution document clustering on many examples missing obvious.
if give me steer forever grateful.
thanks
claire
k-means cannot operate on similarity matrix.
because k-means computes point-to-mean distances, not pairwise distances.
you need implementation of spherical k-means if want use cosine distance: @ every iteration, centers should l2 normalized.
if i'm not mistaken, should equivalent run k-means cosine similarity, , normalize center unit length @ end. regular spherical k-means may faster, because can exploit data normalization simplify cosine distance dot product.
you may want reconsider using php. 1 of worst possible choices type of programming task. it's interactive web page, doesn't shine @ data analysis @ all.
Comments
Post a Comment