survey하나 링크.
1. Classical
A. Partition
k-means, k-medoids, PAM, CLARA, CLARANS
not suitable for non-convex data
B. HIERARCHY
BIRCH(CF tree), CURE(random sampling, large scale clustering), ROCK(improvement of CURE. about enumeration type), Chameleon
C. Fuzzy
FCM: ‘object function’
FCS: distance function by multi-dimensional hypersphere
MM: Mountain function → find the center of cluster
relatively high accuracy.
easily drawn into local optimal.
D. Distribution
DBCLASD
same distribution → same cluster
GMM
several gaussian distributions, high scalablity.
E. Density
high density → same cluster
DBSCAN *
OPTICS : improvement of DBSCAN
1) radius of neighborhood
2) min number of pts in a neighborhood
Mean-Shift *
pros
high efficiency
arbitrary shape
cons
density에 따라 결과가 안좋을 수 있음
big size memory
sensitive to parameters
DENCLUE : large scale data clustering
F. GRAPH Theory
1) CLICK : min weight of division of the graph
2) MST(minimum spanning tree)-based clustering
pros
high efficiency
high accuracy
cons
time complexity grows as graph's complexity grows.
SM, NJW algorithm need to be considered
G. Grid
original data space → grid structure
1) STING
a. divided into many rectangular by constructing hierarchical structure
b. parallable
2) CLIQUE
grid based + density based
pros
low time complexity
high scalability
cons
accuracy-efficiency trade-off
wave cluster와 관련되어 있음
H. Fractal Theory
FC : change of any inner data of cluster does not have any influence on the instrinsic quality of the fractal dim. O(n)
sensitive to parameters
I. Model
1) Statistical learning
COBWEB : probability dist of each attr. is indenpendent
2) Neuralnet
a. SOM : build a mapping of dimension reduction.(assumption: there exists topology in the input)
b. ART : incremental algorithm. generates a new neuron.
time complexity : COBWEB < ART < SOM
2. MODERN
A. Kernel
kernel k-means
kernel SOM
kernel FCM
SVC : sphere in feature space → original data space (isoline: border of clusters)
MMC : hyperplane with maximum margin
MKC : improvement of MMC
pros
can sepearate overlapping clusters.
not need preliminary knowledge about topology of data
cons
not good for large scale data(time complexity high)
sensitive to kernels and there parameters
B. Ensemble
initial clustering → (consensus function) → final result.
어지간하면 그냥 제일 좋을듯.
C. Swarm
simulate the changing process of the biological population
time complexity high
easy to understand
low scalability
D. Quantum Theory
QC : getting potential energy by Schrodinger.
DQC: improvement of QC. time based Schrodinger Eq.
E. Spectral Graph
graph partitioning.
1) recursive spectral : SM (Normalized cut이용한 image segmentation과 비슷)
2) multiway spectral : NJW (feature space by k-largest eigenvalues of the Laplacian Matrix)
time complexity high
Not sensitive to outliers
number of clusters needed to be preset
F. Affinity Propagation (AP ★)
simple, clear idea
N is not deeded (N: number of clusters)
G. Density and Distance (DD ★)
arbitrary shape
insensitive to outliers(but sensitive to parameters)
time complexity high
H. Spatial Data
DBSCAN, STING, Wavecluster(parallel processing possible), CLARANS(sample based CLARA + clustering by PAM)
I. Data Stream
STREAM
CluStream ┐
HPStream ┼ incremental
DenStream ┘
J. Large Scale Data
4V ┬ large in volume
├ rich in variety
├ high in velocity
└ doubt in veracity(정직성, 진실성, 정확도)
1) sample
2) data merge
3) dimension reduction
4) parallel
댓글 없음:
댓글 쓰기