Differences between revisions 75 and 76

Deletions are marked like this. Additions are marked like this.
Line 52: Line 52:

Warning: If the output directory already exists, the program will crash and an error message will come up. Please change the name of directory and restart the program .

Name

sxk_means - K-means classification of a set of images

Usage

Usage in command lines:

sxk_means.py stack outdir <maskfile> --K=10 --trials=2 --debug --opt_method='cla' --maxit=100 --rand_seed=10 --crit='all' --F=0.9 --T0=2.0 --init_method='rnd' --normalize --CTF --MPI --CUDA

Usage in python programming:

k_means_main(stack, out_file, maskname, opt_method, K1, K2, rand_seed, maxit, trials, CTF, F, T0, MPI=False, CUDA=False, DEBUG=False, flagnorm=False)

  • To use MPI || version:

  • 1. set the flag --MPI in command line
  • 2. mpirun -np 32 sxk_means.py and the remaining parameters
  • The above example is for mympi.

Example:

sxk_means.py hri_stack.hdf RES mask2d_23.hdf --opt_method="SSE" --K=128 --maxit=500 --crit="D"

sxk_means.py bdd:hri_stack RES mask2d_23.hdf --opt_method="SSE" --K=128 --maxit=1000 --rand_seed=100 --T0=2.5 --F=0.995 --MPI

sxk_means.py bdb:hri_stack RES mask2d_13.hdf --K=212 --maxit=10000 --rand_seed=10 --T0=-1 --F=0.995 --CUDA

Note: the 2D input images have to be aligned (see sxali2d).

Input

stack
The input stack of images
maskfile
optional mask file to be used
outdir
name of directory where the results are writed
  • The parameters preceded with -- are optional and default values are given in parenthesis.

  • K
    The requested number of clusters (2).
    trials
    number of trials of K-means (see description below) (default one trial). NOT USED in CUDA version.
    opt_method
    optimization method: 'SSE' or 'cla' (default is SSE) (see description below). NOT USED in CUDA version.
    max_iter
    maximum number of iterations the program will perform (set to 100)
    CTF
    if set, CTF information stored in file headers will be used (default no CTF). NOT USED in CUDA version.
    rand_seed
    the seed used to generating random numbers (set to -1, means different and pseudo-random each time)
    crit

    names of criterion used: 'all' all criterions, 'C' Coleman, 'H' Harabasz or 'D' Davies-Bouldin, thoses criterions return the values of classification quality, see also sxk_means_groups. Possibility to free composed, like 'CD', 'HC', 'CHD', ... CUDA version return every time all criterions, equivalent to 'CHD'.

    T0
    simulated annealing, start the algorithm with the first temperature T0. (set to 0.0, means simulated annealing turn off)
    F
    simulated annealing, cooling factor, how you want decrease the temperature after each iteration, T = T * F (set to 0.0, means simulated annealing turn off)
    MPI
    to use MPI version of k-means (default False, possibility to combine with option CUDA to run on GPU cluster)
    CUDA
    to use CUDA version of k-means (default False, possibility to combine with option MPI to run on GPU cluster)
    normalize
    Normalize images under the mask
    init_method
    Method used to initialize partition: "rnd" randomize or "d2w" for d2 weighting initialization (default is rnd)

    Output

    outdir
    The directory to which the averages of K clusters, and the variance. The classification charts are written to the logfile. To the CUDA version the classification charts and the variance are not export.

    Warning: If the output directory already exists, the program will crash and an error message will come up. Please change the name of directory and restart the program .

    The program will write two kinds of image stack files:

    • the averages of each cluster (averages.hdf) and the

    • variance of each cluster (variances.hdf).

    The averages have the following attributes set:

    • 'Class_average': 1 (indicate that the image is a class avergae, not the raw data),
    • 'nobjects': number of objects in a given class,
    • 'members': list of images assigned to this class.

    The variances have the following attributes set:

    • 'Class_average': 1 and
    • 'nobjects'.

    Description

    • The command implements two minimization methods and two different algorithms depending on the CTF flag. In each case, random initialization is used, i.e., initially, images are randomly assigned to K classes.
    • Minimization methods:
    • cla - classical K-means, in which class averages are updated after reassignment of each image. The method is fast, except for trivial cases it fails to find good assignment.

    • SSE - Sum-of-Squared-Error K-means class averages are updated after reassignment of each object. The method is slower (in case of CTF it is painfully slow), but yields better classification results.

    • The results of K-means classification are (in most cases) irreproducible, i.e., if classification is repeated for the same number of classes but using different initial assignment (as in this implementation), the result will be different. In order to find reproducible results one is advised to repeat K-means many times and accept the 'best' solution, as identified by the criterion value. For a sufficiently large number of trials and reasonable data, it is possible to find optimum solution. This process is facilitated by the number of trials user can provide: program will repeat classification specified number of times and return the best solution found.
    • In order to find a better classification you can use simulated annealing. This method is implemented for all minimization methods of k-means. To use simulated annealing and turn it on, change the value of the first temperature T0 and the cooling factor F. More the temperature decreases slowy, means a cooling factor F near to 1.0 (ex.: 0.9995), and more the classification should be better. But it means also the algorithm will take more iteration, so more time, due to the slowy convergence. You must select a maximum number of iterations max_iter enough high to not reach it before simulated annealing has convergence. The value of the first temperature T0 is defined empirically, or determine automatically by the program.

    • Program calculates and returns values of classification quality - see sxk_means_groups.

    • The program can also cluster on a text file containing columns of numbers; the elements to cluster are indexed by row number. For example, if infile is a text file with N columns, then by running the program with infile as input text file instead of an input stack of images, sxk_means will cluster based on K columns, where K<N is determined by the maskfile. The elements of the i-th cluster is written to kmeans_grp_00i.txt in the output directory.

    • The maskfile has to be a binary file, and is used to determine which columns the program will cluster on. For example, if the input text file to cluster has 4 columns, and the following will produce a binary mask maskone.hdf for clustering based on the first column of infile:
    • maskone = model_blank(4,bckg=0)
    • maskone[0]=1
    • drop_image(maskone,'maskone.hdf')

    Reference

    • Pattern Classification II Edition - Richard O.Duda, Peter E.Hart, David G.Stork

    Author / Maintainer

    Julien Bert

    Keywords

    category 1
    APPLICATIONS

    Files

    statisctics.py, sxk_means.py

    See also

    sxk_means_groups sxk_means_stable

    Maturity

    beta
    works for author, often works for others.

    Bugs

    HDF file: HDF file has a limitation on the number of items contain in the header (~16000). In the case 'members' (list of images assigned to each class) is a list over 16000 elements, all assignment will be automatically export to text file: kmeans_grp_00.txt, kmeans_grp_01.txt, etc. Each file contain the list of ID images assigns to this class.

    sxk means (last edited 2010-07-27 20:01:40 by ranlin)