How to run bootstrap real-space variance estimation method using SPARX?
Bootstrap real-space variance estimation method (from now on referred to as "bootstrap") is a method to evaluate the flexibility of a 3D object from its cryo-EM projection images proposed by Penczek et al. (Penczek 2006). The input of this method is a set of 2D projections while the output is a 3D variance map. The theory of bootstrap method is rather complicated, please refer to the paper mentioned above for details. The algorithm itself however is fairly simple: assuming N 2D projections are given, for each step of bootstrap, N projections are randomly selected with replacements (one projection can appear multiple times while other might be omitted) and 3D reconstruction using these selected projections yields a bootstrap volume. The 3D variance map is calculated as the variance of a set of bootstrap volumes. We will discuss later how many bootstrap volumes are needed to obtain a reliable variance map.
The whole procedure to use SPARX implementation of bootstrap method can be divided into 5 steps:
1. normalize the projections (using sxnormal_prj.py, MPI version availiable).
2. calculate weights to account for uneven distribution of projections (sxbootstrap_calcwgts.py, MPI version availiable).
3. generate buffer file containing 2D images (using sxbootstrap_genbuf.py).
4. generate bootstrap volumes (using sxbootstrap_bigdisk.py, MPI version availiable).
5. run analysis to get variance map (using sxvar.py)
Step 1. normalize the projections
I AM NOT CERTAIN ABOUT THIS STEP PAP 08/03/09/
To normalize the projections, use the command sxnormal_prj.py:
sxnormal_prj.py stack outdir <refvol> --r=radius --niter=number_of_iteration --snr=Signal-to-Noise Ratio --sym=symmetry
The results are stored in the directory outdir. Normally, one would use this command on a cluster. In that case, the results are saved in the directory outdir with filenames prj0000.hdf, prj0001.hdf, prj0002.hdf, etc. To concatenate all files, use command pack.py in directory sparxroot/sparx/test/bootstrap:
pack.py file_prefix nproc output.hdf
Here, file_prefix is the prefix of the files you wish to concatenate (should be prj here), nproc is the number of processors used for normalization, output.hdf is the file to save the result of concatenation.
Step 2. calculate weights accounting for uneven distribution of projections
Calculate weights to account for uneven distribution of projections:
sxbootstrap_calcwgts.py bdb:data wgts.txt --voronoi
Output is the text file wgtx.txt.
Step 3. generate buffer file
The reconstruction algorithm used with bootstrap method of SPARX is the nearest neighboring scheme with two times padding, which is a type of direct Fourier inversion reconstruction algorithm. Its basic idea is to do 2D Fourier transform for each projection and insert them into the 3D Fourier volume. The advantage of this algorithm is that one can pre-calculate the 2D Fourier transforms of projection data and store the results as a buffer file. This buffer file is then be used for bootstrap calculations.
The program to generate buffer file is sxbootstrap_genbuf.py , the usage is very simple:
sxbootstrap_genbuf.py prj_stack.hdf file_prefix
two files will be generated by this program: file_prefix.bin which is a binary file containing the result of 2D Fourier transform, the other is file_prefix.txt which records the header information.
An very important issue about sxbootstrap_genbuf.py is the size of the scratch file will be about 4 times bigger than the original projection file because of the two times padding method used to gain finer grid and to reduce interpolation error, which may sometimes use up the free disk spaces and cause problems.
Step 4. generate bootstrap volumes
The program which generates bootstrap volumes is sxbootstrap_bigdisk.py. The usage is:
sxbootstrap_bigdisk.py prj_stack.hdf buf_prefix vol_prefix nvol --snr=signal_noise_ratio --sym=symmetry --verbose=(0|1)
user needs to specify number of bootstrap volumes to be generated (see step 5), signal-to-noise ratio and the symmetry. The volume will be generated as vol_prefix00xx.hdf in which xx is the id of the cpu (please keep in mind bootstrap_run is an MPI program and each cpu has an ID and generate bootstrap volumes independently).
Step 5. determine the number of the bootstrap volumes and calculate the variance
After the first step, user should now have a set of files containing bootstrap volumes. The next step is to calculate the variance map. The program which provides this feature is sxincvar.py. The program generates 3 sets of variances: on odd volumes, on even volumes and on all volumes. The variance is generated incrementally. For every 100 bootstrap volumes (50 for odd and 50 for even), the odd and even variance will be generated and the cross-correlation coefficient between them will be printed as a criterion of the stability of the variance. Usually ccc > 0.80 could indicate a reliable variance map.
The usage of sxvar.py is:
sxvar.py vol_prefix outdir --fl=filt_l --aa=filt_h --radccc=radius_of_ccc --writelp --writestack
Here, vol_prefix are the bootstrap volume files generated in step 4 (under MPI, cannot be smaller than the number of processors used). outputfile should be an .hdf file. filt_l and filt_h are the parameters for low-pass filtration (using filt_tanl, see description). They are related to the resolution of the projections. In the FSC curve of the projections, find out the frequency corresponding to FSC equals 0.5. The suggested value of low pass filtration is around half of the frequency. radccc is the radius for cross-corelation coefficient evaluation which should be a tighter mask of the object, i.e. inside this radius should be the core of the object.
The variance map is stored in a directory outdir.
One can also do a Principal Component Analysis (PCA) using the results from calculating the variance.
THIS HAS TO BE CORRECTED.
sxvar.py filt_vol_prefix n_begin n_end average_output.hdf pca_output.hdf mask n_eigen
Here, filt_vol_prefix is the filtered volume results from sxincvar.py, currently hard-wired to btwl_cir_vol_prefix, where vol_prefix should be same as the one used in sxincvar.py. n_begin and n_end determine how many volume files should be used, average_output is the average output file generated from sxincvar.py, pca_output.hdf stores a stack of principal componenets, n_eigen is the number of eigenvectors you want to generate.
After generating the pca_output.hdf, it is recommended that each volume be multipled by the square root of its eigenvalues, which is stored in the header of each volume.
Typical protocol
The user is well advised to follow the protocol included in the demo run-through example.
Sample PBS job file
Considering most bootstrap calculation is carried on PC clusters, here we provide a sample script for bootstrap calculation using PBS system. It can be easily changed to other systems such as SUN grid engine.
#PBS -r n -l cput=100:00:00 -l walltime=100:00:00
#PBS -q workq
#PBS -l nodes=32:ppn=2
cd /your_directory
cp $PBS_NODEFILE nodelist
uniq nodelist > singlenodelist
mpirun -machinefile singlenodelist -np 32 SPARXROOT/sparx/bin/sxbootstrap_genbuf.py prj.hdf $PBSTMPDIR/tmpslice
mpirun -machinefile $PBS_NODEFILE -np 64 SPARXROOT/sparx/bin/sxbootstrap_run.py prj.hdf $PBSTMPDIR/tmpslice $PBSTMPDIR/bs_efg_vol 10240 --snr=1.0
foreach i (`uniq nodelist`)
ssh $i cp "$PBSTMPDIR/bs_efg_vol*.hdf" /garibaldi/people-a/jbrown/sparx/test/EFG-ALL-2/run1/
end
touch done
