Differences between revisions 75 and 76
| Deletions are marked like this. | Additions are marked like this. |
| Line 52: | Line 52: |
|
Warning: If the output directory already exists, the program will crash and an error message will come up. Please change the name of directory and restart the program . |
Name
sxk_means - K-means classification of a set of images
Usage
Usage in command lines:
sxk_means.py stack outdir <maskfile> --K=10 --trials=2 --debug --opt_method='cla' --maxit=100 --rand_seed=10 --crit='all' --F=0.9 --T0=2.0 --init_method='rnd' --normalize --CTF --MPI --CUDA
Usage in python programming:
k_means_main(stack, out_file, maskname, opt_method, K1, K2, rand_seed, maxit, trials, CTF, F, T0, MPI=False, CUDA=False, DEBUG=False, flagnorm=False)
To use MPI || version:
- 1. set the flag --MPI in command line
- 2. mpirun -np 32 sxk_means.py and the remaining parameters
- The above example is for mympi.
Example:
sxk_means.py hri_stack.hdf RES mask2d_23.hdf --opt_method="SSE" --K=128 --maxit=500 --crit="D"
sxk_means.py bdd:hri_stack RES mask2d_23.hdf --opt_method="SSE" --K=128 --maxit=1000 --rand_seed=100 --T0=2.5 --F=0.995 --MPI
sxk_means.py bdb:hri_stack RES mask2d_13.hdf --K=212 --maxit=10000 --rand_seed=10 --T0=-1 --F=0.995 --CUDA
Note: the 2D input images have to be aligned (see sxali2d).
Input
- stack
- The input stack of images
- maskfile
- optional mask file to be used
- outdir
- name of directory where the results are writed
The parameters preceded with -- are optional and default values are given in parenthesis.
names of criterion used: 'all' all criterions, 'C' Coleman, 'H' Harabasz or 'D' Davies-Bouldin, thoses criterions return the values of classification quality, see also sxk_means_groups. Possibility to free composed, like 'CD', 'HC', 'CHD', ... CUDA version return every time all criterions, equivalent to 'CHD'.
Output
- outdir
- The directory to which the averages of K clusters, and the variance. The classification charts are written to the logfile. To the CUDA version the classification charts and the variance are not export.
Warning: If the output directory already exists, the program will crash and an error message will come up. Please change the name of directory and restart the program .
The program will write two kinds of image stack files:
the averages of each cluster (averages.hdf) and the
variance of each cluster (variances.hdf).
The averages have the following attributes set:
- 'Class_average': 1 (indicate that the image is a class avergae, not the raw data),
- 'nobjects': number of objects in a given class,
- 'members': list of images assigned to this class.
The variances have the following attributes set:
- 'Class_average': 1 and
- 'nobjects'.
Description
- The command implements two minimization methods and two different algorithms depending on the CTF flag. In each case, random initialization is used, i.e., initially, images are randomly assigned to K classes.
- Minimization methods:
cla - classical K-means, in which class averages are updated after reassignment of each image. The method is fast, except for trivial cases it fails to find good assignment.
SSE - Sum-of-Squared-Error K-means class averages are updated after reassignment of each object. The method is slower (in case of CTF it is painfully slow), but yields better classification results.
- The results of K-means classification are (in most cases) irreproducible, i.e., if classification is repeated for the same number of classes but using different initial assignment (as in this implementation), the result will be different. In order to find reproducible results one is advised to repeat K-means many times and accept the 'best' solution, as identified by the criterion value. For a sufficiently large number of trials and reasonable data, it is possible to find optimum solution. This process is facilitated by the number of trials user can provide: program will repeat classification specified number of times and return the best solution found.
In order to find a better classification you can use simulated annealing. This method is implemented for all minimization methods of k-means. To use simulated annealing and turn it on, change the value of the first temperature T0 and the cooling factor F. More the temperature decreases slowy, means a cooling factor F near to 1.0 (ex.: 0.9995), and more the classification should be better. But it means also the algorithm will take more iteration, so more time, due to the slowy convergence. You must select a maximum number of iterations max_iter enough high to not reach it before simulated annealing has convergence. The value of the first temperature T0 is defined empirically, or determine automatically by the program.
Program calculates and returns values of classification quality - see sxk_means_groups.
The program can also cluster on a text file containing columns of numbers; the elements to cluster are indexed by row number. For example, if infile is a text file with N columns, then by running the program with infile as input text file instead of an input stack of images, sxk_means will cluster based on K columns, where K<N is determined by the maskfile. The elements of the i-th cluster is written to kmeans_grp_00i.txt in the output directory.
- The maskfile has to be a binary file, and is used to determine which columns the program will cluster on. For example, if the input text file to cluster has 4 columns, and the following will produce a binary mask maskone.hdf for clustering based on the first column of infile:
- maskone = model_blank(4,bckg=0)
- maskone[0]=1
- drop_image(maskone,'maskone.hdf')
Reference
Pattern Classification II Edition - Richard O.Duda, Peter E.Hart, David G.Stork
Author / Maintainer
Julien Bert
Keywords
- category 1
- APPLICATIONS
Files
statisctics.py, sxk_means.py
See also
sxk_means_groups sxk_means_stable
Maturity
- beta
- works for author, often works for others.
Bugs
HDF file: HDF file has a limitation on the number of items contain in the header (~16000). In the case 'members' (list of images assigned to each class) is a list over 16000 elements, all assignment will be automatically export to text file: kmeans_grp_00.txt, kmeans_grp_01.txt, etc. Each file contain the list of ID images assigns to this class.
