2016년 6월 27일 월요일

Clustering

http://link.springer.com/article/10.1007/s40745-015-0040-1

survey하나 링크.

1. Classical

    A. Partition

        k-means, k-medoids, PAM, CLARA, CLARANS
        not suitable for non-convex data

    B. HIERARCHY

        BIRCH(CF tree), CURE(random sampling, large scale clustering), ROCK(improvement of CURE. about enumeration type), Chameleon

    C. Fuzzy

        FCM: ‘object function’
        FCS: distance function by multi-dimensional hypersphere
        MM: Mountain function → find the center of cluster
        relatively high accuracy.
        easily drawn into local optimal.

    D. Distribution

        DBCLASD
            same distribution → same cluster
        GMM
            several gaussian distributions, high scalablity.

    E. Density

        high density → same cluster
        DBSCAN *
        OPTICS : improvement of DBSCAN
            1) radius of neighborhood
            2) min number of pts in a neighborhood
        Mean-Shift *

        pros
            high efficiency
            arbitrary shape
        cons
            density에 따라 결과가 안좋을 수 있음
            big size memory
            sensitive to parameters

        DENCLUE : large scale data clustering

    F. GRAPH Theory

        1) CLICK : min weight of division of the graph
        2) MST(minimum spanning tree)-based clustering

        pros
            high efficiency
            high accuracy
        cons
            time complexity grows as graph's complexity grows.

        SM, NJW algorithm need to be considered

    G. Grid

        original data space → grid structure
        1) STING
            a. divided into many rectangular by constructing hierarchical structure
            b. parallable
        2) CLIQUE
            grid based + density based

        pros
            low time complexity
            high scalability
        cons
            accuracy-efficiency trade-off

        wave cluster와 관련되어 있음

    H. Fractal Theory

        FC : change of any inner data of cluster does not have any influence on the instrinsic quality of the fractal dim. O(n)
        sensitive to parameters

    I. Model

        1) Statistical learning
            COBWEB : probability dist of each attr. is indenpendent
        2) Neuralnet
            a. SOM : build a mapping of dimension reduction.(assumption: there exists topology in the input)
            b. ART : incremental algorithm. generates a new neuron.

        time complexity : COBWEB < ART < SOM

2. MODERN

    A. Kernel

        kernel k-means
        kernel SOM
        kernel FCM
        SVC : sphere in feature space → original data space (isoline: border of clusters)
        MMC : hyperplane with maximum margin
        MKC : improvement of MMC

        pros
            can sepearate overlapping clusters.
            not need preliminary knowledge about topology of data
        cons
            not good for large scale data(time complexity high)
            sensitive to kernels and there parameters

    B. Ensemble

        initial clustering → (consensus function) → final result.
        어지간하면 그냥 제일 좋을듯.

    C. Swarm

        simulate the changing process of the biological population
        time complexity high
        easy to understand
        low scalability

    D. Quantum Theory

        QC : getting potential energy by Schrodinger.
        DQC: improvement of QC. time based Schrodinger Eq.

    E. Spectral Graph

        graph partitioning.
        1) recursive spectral : SM (Normalized cut이용한 image segmentation과 비슷)
        2) multiway spectral : NJW (feature space by k-largest eigenvalues of the Laplacian Matrix)
        time complexity high
        Not sensitive to outliers
        number of clusters needed to be preset

    F. Affinity Propagation (AP ★)

        simple, clear idea
        N is not deeded (N: number of clusters)

    G. Density and Distance (DD ★)

        arbitrary shape
        insensitive to outliers(but sensitive to parameters)
        time complexity high

    H. Spatial Data

        DBSCAN, STING, Wavecluster(parallel processing possible), CLARANS(sample based CLARA + clustering by PAM)

    I. Data Stream

        STREAM
        CluStream ┐
        HPStream  ┼ incremental
        DenStream ┘

    J. Large Scale Data

        4V ┬ large in volume
           ├ rich in variety
           ├ high in velocity
           └ doubt in veracity(정직성, 진실성, 정확도)
        1) sample
        2) data merge
        3) dimension reduction
        4) parallel

댓글 없음:

댓글 쓰기