apache spark - k mean clustering for mixed categorical and numeric value -


any please

i want provide simple framework identifying , cleaning duplicates data in context big data . pretreatment must performed in real time (streaming).

we reperesent our data base file.csv , file contains patient (medical) records without duplication .

we want clusterig file.csv 4 clusters using incremental parallel k mean clustering mixed categorical , numeric value, each cluster contain similars records.

every time (data stream) structured data comes (record), must compare representatives of clusters (m1, m2, m3, m4)............. if data not represent duplicate data , save in file.csv , if represents duplicate data not saved in file.csv.

1)so what's effiscient tool in case hadoop or spark ! 2) how can impliment clustering mixed categorical , numeric value mlib(spark) or mahout (hadoop). 3) mean incremental clustering , same of streaming clustering!

as noted dozen of times here on so/cv:

k-means computes means

unless can define least-squares mean categorical data (that still useful in practise) using k-means on such data doesn't work.

sure, can one-hot encoding amd similar hacks, make results next meaningless. "least-squares" not meaningful objective on binary input data.

kmeans dealing categorical variable

why not getting points around clusers in kmeans implementation?

https://stats.stackexchange.com/questions/58910/kmeans-whether-to-standardise-can-you-use-categorical-variables-is-cluster-3


Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -