apache spark - k mean clustering for mixed categorical and numeric value -
any please
i want provide simple framework identifying , cleaning duplicates data in context big data . pretreatment must performed in real time (streaming).
we reperesent our data base file.csv , file contains patient (medical) records without duplication .
we want clusterig file.csv 4 clusters using incremental parallel k mean clustering mixed categorical , numeric value, each cluster contain similars records.
every time (data stream) structured data comes (record), must compare representatives of clusters (m1, m2, m3, m4)............. if data not represent duplicate data , save in file.csv , if represents duplicate data not saved in file.csv.
1)so what's effiscient tool in case hadoop or spark ! 2) how can impliment clustering mixed categorical , numeric value mlib(spark) or mahout (hadoop). 3) mean incremental clustering , same of streaming clustering!
as noted dozen of times here on so/cv:
k-means computes means
unless can define least-squares mean categorical data (that still useful in practise) using k-means on such data doesn't work.
sure, can one-hot encoding amd similar hacks, make results next meaningless. "least-squares" not meaningful objective on binary input data.
kmeans dealing categorical variable
why not getting points around clusers in kmeans implementation?
Comments
Post a Comment