tech_log: Clustering

http://link.springer.com/article/10.1007/s40745-015-0040-1

survey하나 링크.

1. Classical

A. Partition

k-means, k-medoids, PAM, CLARA, CLARANS

not suitable for non-convex data

B. HIERARCHY

BIRCH(CF tree), CURE(random sampling, large scale clustering), ROCK(improvement of CURE. about enumeration type), Chameleon

C. Fuzzy

FCM: ‘object function’

FCS: distance function by multi-dimensional hypersphere

MM: Mountain function → find the center of cluster

relatively high accuracy.

easily drawn into local optimal.

D. Distribution

DBCLASD

same distribution → same cluster

GMM

several gaussian distributions, high scalablity.

E. Density

high density → same cluster

DBSCAN *

OPTICS : improvement of DBSCAN

1) radius of neighborhood

2) min number of pts in a neighborhood

Mean-Shift *

pros

high efficiency

arbitrary shape

cons

density에 따라 결과가 안좋을 수 있음

big size memory

sensitive to parameters

DENCLUE : large scale data clustering

F. GRAPH Theory

1) CLICK : min weight of division of the graph

2) MST(minimum spanning tree)-based clustering

pros

high efficiency

high accuracy

cons

time complexity grows as graph's complexity grows.

SM, NJW algorithm need to be considered

G. Grid

original data space → grid structure

1) STING

a. divided into many rectangular by constructing hierarchical structure

b. parallable

2) CLIQUE

grid based + density based

pros

low time complexity

high scalability

cons

accuracy-efficiency trade-off

wave cluster와 관련되어 있음

H. Fractal Theory

FC : change of any inner data of cluster does not have any influence on the instrinsic quality of the fractal dim. O(n)

sensitive to parameters

I. Model

1) Statistical learning

COBWEB : probability dist of each attr. is indenpendent

2) Neuralnet

a. SOM : build a mapping of dimension reduction.(assumption: there exists topology in the input)

b. ART : incremental algorithm. generates a new neuron.

time complexity : COBWEB < ART < SOM

2. MODERN

A. Kernel

kernel k-means

kernel SOM

kernel FCM

SVC : sphere in feature space → original data space (isoline: border of clusters)

MMC : hyperplane with maximum margin

MKC : improvement of MMC

pros

can sepearate overlapping clusters.

not need preliminary knowledge about topology of data

cons

not good for large scale data(time complexity high)

sensitive to kernels and there parameters

B. Ensemble

initial clustering → (consensus function) → final result.

어지간하면 그냥 제일 좋을듯.

C. Swarm

simulate the changing process of the biological population

time complexity high

easy to understand

low scalability

D. Quantum Theory

QC : getting potential energy by Schrodinger.

DQC: improvement of QC. time based Schrodinger Eq.

E. Spectral Graph

graph partitioning.

1) recursive spectral : SM (Normalized cut이용한 image segmentation과 비슷)

2) multiway spectral : NJW (feature space by k-largest eigenvalues of the Laplacian Matrix)

time complexity high

Not sensitive to outliers

number of clusters needed to be preset

F. Affinity Propagation (AP ★)

simple, clear idea

N is not deeded (N: number of clusters)

G. Density and Distance (DD ★)

arbitrary shape

insensitive to outliers(but sensitive to parameters)

time complexity high

H. Spatial Data

DBSCAN, STING, Wavecluster(parallel processing possible), CLARANS(sample based CLARA + clustering by PAM)

I. Data Stream

STREAM

CluStream ┐

HPStream ┼ incremental

DenStream ┘

J. Large Scale Data

4V ┬ large in volume

├ rich in variety

├ high in velocity

└ doubt in veracity(정직성, 진실성, 정확도)

1) sample

2) data merge

3) dimension reduction

4) parallel

tech_log

2016년 6월 27일 월요일

Clustering

1. Classical

A. Partition

B. HIERARCHY

C. Fuzzy

D. Distribution

E. Density

F. GRAPH Theory

G. Grid

H. Fractal Theory

I. Model

2. MODERN

A. Kernel

B. Ensemble

C. Swarm

D. Quantum Theory

E. Spectral Graph

F. Affinity Propagation (AP ★)

G. Density and Distance (DD ★)

H. Spatial Data

I. Data Stream

J. Large Scale Data

댓글 없음:

댓글 쓰기

블로그 보관함