============================================================= ==== README for JSGD ============================================================= What is this? ============= This a Stochastic Gradient Descent algorithm used to train linear multiclass classifiers. It is biased towards large classification problems (many classes, many examples, high dimensional data). The reference paper is: @article{akata:hal-00835810, AUTHOR = {Akata, Zeynep and Perronnin, Florent and Harchaoui, Zaid and Schmid, Cordelia}, TITLE = {{Good Practice in Large-Scale Learning for Image Classification}}, JOURNAL = {{IEEE Transactions on Pattern Analysis and Machine Intelligence}}, PUBLISHER = {David A. Forsyth}, YEAR = {2013}, MONTH = Jun, URL = {http://hal.inria.fr/hal-00835810} } @inproceedings{perronnin:hal-00690014, AUTHOR = {Perronnin, Florent and Akata, Zeynep and Harchaoui, Zaid and Schmid, Cordelia}, TITLE = {{Towards Good Practice in Large-Scale Learning for Image Classification}}, BOOKTITLE = {{Computer Vision and Pattern Recognition}}, YEAR = {2012}, MONTH = Jun, ADDRESS = {Providence, Rhode Island, United States}, URL = {http://hal.inria.fr/hal-00690014} } The code uses similar acceleration tricks, yet adapted to large-scale image classification, than the ones used in L. Bottou's package, available at http://leon.bottou.org/projects/sgd Files ===== The directories in the package are: ref/ contains a basic Matlab implementation of the SGD algorithm based on the paper. c/ contains the C implementation. matlab/ contains the MEX wrappers for Octave/Matlab. There is a precompiled version for 64-bit Linux python/ contains the Python wrappers, and the optimization code that uses cross-validation to find the best parameters of the method. Compiling / requirements ======================== Requirements: - 64-bit Linux. The code should be reasonably portable, but no other platform was tested. - an implementation of the Basic Linear Algebra Subroutines (BLAS), that are interfaced with the Fortran conventions. - SSE vector operations. They are declared using the gcc intrinsics. The library is compiled using a makefile. First try the following commands to compile. If they don't work, edit the makefile to adjust the configuration. To compile the Matlab interface, do make mex For the Octave (Matlab work-alike) interface: make octave For the Python interface: make py On Ubuntu, sudo apt-get install liboctave-dev should install both octave and atlas. Be sure to change the line in the Makefile concerning octave to the required version. Additional software -------------------- Although the library is self contained, the examples rely on: - Yael (https://gforge.inria.fr/projects/yael/), and its Matlab interface. - H. Jegou's Matlab implementation of PQcodes (http://www.irisa.fr/texmex/people/jegou/ann.php) for the PQ encoding. - some datafiles. See the JSGD homepage (http://lear.inrialpes.fr/src/jsgd/). All examples are provided in a simplistic format: a file containing the descriptor matrix + one with the label vector, for train and test (4 files in total). The files are in the .fvecs and .ivecs formats. The examples assume that they are installed in the jsgd directory, next to c/, matlab/ etc. On a normal Linux installation, the following commands download and install all the required files (run from the jsgd-xx directory): # get Yael (need svn version...) svn checkout --username anonsvn --password anonsvn --no-auth-cache https://scm.gforge.inria.fr/svn/yael/trunk yael # compile cd yael ./configure.sh --enable-numpy make # compile matlab interface of Yael cd matlab make cd ../.. # get pqcodes wget http://www.irisa.fr/texmex/people/jegou/src/pqsearch_2188.tar.gz tar xvzf pqsearch_2188.tar.gz Interfaces ========== There are two high-level interfaces to jsgd_train, a Matlab and a Python interface. They provide a Matlab/Python jsgd_train function that takes Matlab/Python matrices as input. The algorithm parameters are passed as optional arguments to the functions. Matlab interface ---------------- The Matlab interface uses single matrices for floating-point data and int32 matrices for indices, labels, etc. Python interface ---------------- The Python interface uses numpy matrices for i/o. The matrices should be "c_compact", which means that all matrix dimensions in the documentation should be transposed. The matrices should be of type numpy.float32 or numpuy.int32 for floating-point and integer matrices respectively. You should also set the PYTHONPATH to point on jsgd-xx/yael, eg. export PYTHONPATH=../yael Examples ======== Examples are provided in Matlab and Python for readability. After installing the package (see "Compiling" above) and its optional dependencies (see "Additional software" above), you can get started with some examples. Download a dataset available on lear.inrialpes.fr/src/jsgd/. E.g. wget http://pascal.inrialpes.fr/data2/douze/jsgd_data/groupFungus_k64_nclass10_nex10.tgz. Uncompress it to your local folder: E.g. tar xzf groupFungus_k64_nclass10_nex10.tgz To train a One-versus-Rest classifier, use cd matlab; octave --eval "test_jsgd_train" matlab -nojvm -nodesktop -r "test_jsgd_train; exit" Other useful options and datasets are shown commented in the test_jsgd_train.m file. Product quantizer encoding -------------------------- The script matlab/test_pq_encoded.m shows how to learn a product quantizer, how to encode the training set and how to train a classifier on the encoded training data. #Follow the additional software steps wget http://pascal.inrialpes.fr/data2/douze/jsgd_data/groupFungus_k64_nclass134_nex50.tgz tar xzf groupFungus_k64_nclass134_nex50.tgz cd matlab matlab -nojvm -nodesktop -r "test_pq_encoded; exit;" Cross-validation of parameters ------------------------------ The classifier output by SGD is only as good as the parameters it is computed with. The default parameters will produce a suboptimal classifier. The main parameters are: lambda penalty term for W norm bias_term this term is appended to each training vector eta0 intial value of the gradient step beta for OVR, how many negatives to sample per training vector temperature for STF, the stiffness of the loss The script python/crossval.py Optimizes the parameters by stepping from a parameter set to another, monitoring the cross-validated accuracy in the meantime. Unfortunately, this relatively exaustive exploration of the parameter space is slow (see below for a few tips on determining optimal parameters). Comparison with supplementary material of the CVPR paper -------------------------------------------------------- The script python/example_BOW.py reproduces the results from the supplementary material of the CVPR paper (table 2, Ungulate with 50 examples per class). It shows how to optimize a subet of parameters and how to use sparse descriptors (it is based on a BOW in 4096D). #Additional softwares wget http://pascal.inrialpes.fr/data2/douze/jsgd_data/groupUngulate_BOW4096.tgz tar xzf groupUngulate_BOW4096.tgz cd python python example_BOW.py Library layout ============== The main entry point for the library, both in C, Matlab and Python is jsgd_train There are a few mandatory parameters to jsgd_train, and many optional ones. Dense matrices are stored contiguously in memory with a column-major (Fortran/Matlab) convention, hence x(d, n) means that matrix x has d lines and n columns. Mandatory parameters -------------------- The examples should already be shuffled randomly on input. x(d, n) is a floating-point matrix containing the n example vectors of dimension d. It can be encoded with a product quantizer if necessary (see below). y(n) is an interger array that contains the label associated with each column of x. Labels are 0-based in C and Python, 1-based in Matlab. Return values ------------- The function returns w(d + 1, nclass) a floating-point array such that max(w(:, i)' * [x ; 1]) gives the class of vector x stats a structure (Matlab) or class (Python) that contain statistics on the resolution (timings, evaluations on the validation set, etc.) Optional parameters ------------------- In Matlab, optional parameters are passed as 'option_name', option_value pairs. In Python, use option_name = option_value In C, just set the fields of the jsgd_params_t structure. The most important ones: lambda penalty term for W norm (_lambda in Python) bias_term this term is appended to each training vector eta0 intial value of the gradient step beta for OVR, how many negatives to sample per training vector temperature for STF, the stiffness of the loss (beta on the website) fixed_eta (boolean) use a fixed step verbose verbosity level (integer, default 0) n_thread if non-zero, try to use threads (does not set the number of threads) random_seed running with the same random seed should give the same results algo string, 'ovr', 'mul', 'rnk', 'war', etc, selecting the algorithm (in C, use the JSGD_ALGO_* constants) n_epoch number of passes through the data valid matrix of validation examples valid_labels corresponding labels eval_freq evaluate on validation set every that many epochs stop_valid_threshold stop if the accuracy is lower than the accuracy at the previous evaluation + this much. Set to something like -0.05 to stop iterating when all hope is lost to obtain a better operating point. Documentation ============= Matlab codes for didactical purpose are provided for better understanding of the mex-files in the /ref folder. They provide a straightforward translation of the methods described in both articles. Do not use these codes to run your large-scale experiments. Use the codes in the matlab folder instead. Comprehensive Matlab implementation ----------------------------------- sgd_simple.m: SGD implementation generate_toy_data.m: randomly generate a training and a test set. sgd_simple_toy_data.m: runs both, computes the accuracy and shows results graphically. Determining optimal parameters of SGD ===================================== Here a few practical remarks on setting the parameters. The optimization routine implemented in crossval.py normally outputs optimal parameters. Unfortunately, it is very slow. Below are a few notes on how to choose reasonable parameters in finite time. It is possible to fix some parameters and optimize on the others, which speeds up the estimation. See example_BOW.py for an example. An intermediate result of optimize() can be used (see the "keep" messages of the cross-validation). For OVR, beta is an important parameter. Setting it to the square root of the number of classes seems a good starting point (if beta = 0 on input, the algorithm sets this automatically). L. Bottou's optimization of eta0 (choosing eta0 that minimizes the loss on a subset) is implemented in jsgd_train. It will be used if the passed-in eta0 is 0. It is not useful to fine-tune the parameters very precisely. The optimization criterion (top-1 accuracy on cross-validated datasets) is quite noisy, so choosing small steps on the parameters leads to local minima. Therefore, crossval.py optimizes lambda, eta0, temperature and bias_term by powers of 10. If a validation set is available, do early stopping, and evaluate every 10 or so epochs. For this, set stop_valid_threshold = -0.05, which will abandon when the current accuracy is 5 point below the best one seen so far. The CVPR paper is also a good source on the influence of the various parameters. Low-level library ================= The C library has one main function: int jsgd_train(int nclass, const x_matrix_t *x, const int *y, float * w, float * bias, jsgd_params_t *params); Where: nclass nb of classes x matrix of train vectors y(n) labels of each vector. Should be in 0:nclass-1 w(d,nclass) output w matrix bias(nclass) output bias coefficients params all other parameters A matrix a(m, n) of n float vectors of dimension m is represented as float *a The (i, j)'th element of the matrix, where 0 <= i < m and 0 <= j < n can be accessed with a[ j * m + i ] x_matrix_t ---------- The matrix of training vectors is a x_matrix_t structure, because it is not necessarily a dense matrix. Several encodings are used for x_matrix_t's: ** encoding = JSGD_X_FULL This is a plain dense matrix. The array data(d, n) contains the elements. ** encoding = JSGD_X_PQ This is a product quantizer matrix. There are nsq subquantizers. Quantization centroids are given by the 3D matrix centroids(d / nsq, 256, nsq) The corresponding codes are in array codes(nsq, n) ** encoding = JSGD_X_SPARSE The matrix is a sparse column matrix (a la Matlab). Compressed Sparse Columuns (CSC) indptr(n + 1) points to the the first non-0 cell in column j indices(nnz) indices[indptr[j]] .. indices[indptr[j]] are the non-0 rows in column j (nnz = indptr[n]) sparse_data(nnz) sparse_data[indptr[j]] .. sparse_data[indptr[j]] are the values of the cells in column j. This is similar to the Matlab encoding and that of CSR in scipy.sparse. See http://www.mathworks.fr/help/techdoc/apiref/mxsetir. jsgd_params_t ------------- The jsgd_params_t contains all the parameters of the training algorithm, as well as output statistics produced when running it. See above and c/jsgd.h for an explanation of its fields. Optimizations ============= All algorithms can be expressed as (see ref/sgd_simple.m) for t = 1 to n # draw a_set_of_classes more or less randomly for c in a_set_of_classes # a dot product score(t, c) := W(:, c)' * x(:, t) end for # compute factor from the score table for c in a_subset_of_classes # a W update W(:, c) := scalar * W(:, c) + factor(c) * x(:, t) end for end for The operations that depend on d (the dimension of the vectors) are the dot products and the W updates. Depending on the algorithm, each iteration does Nb of dot products Nb of W updates OVR 1 + beta 1 + beta MUL C 0 or 2 (1.08) RNK 2 0 or 2 (0.43) WAR 2 to C (6) 0 or 2 (0.72) where C is the number of classes, the numbers in parentheses are measured on a run with C = 10 classes. The library (especially the OVR version) is optimized in several ways: - by using multipliers. In this case, w is represented as (w_factor * w). Updates like w := scalar * w are never actually performed, because only w_factor is updated. This optimization is borrowed from Leon Bottou's reference implementation. - by using SSE instructions. Operations are performed on 4 vectors components at a time for dense and PQ-encoded matrices. If addresses are not aligned on a 16-byte boundary or the vector size is not a multiple of 4, the library falls back on the slow implementation. For the caller, this means that if the vector dimensions are not a multiple of 4, it is a good idea to pad them with 0s to the next multiple of 4. - for dense matrix-matrix and matrix-vector multiplications, BLAS is used. Unfortunately, most operations cannot be expressed in these terms (else the Matlab implementation would be fast enough). - the performance bottleneck that remains is cache accesses. Assuming that there is 1MB of L3 cache per processor core, with 4096D single precision descriptors, only 64 of them fit in cache. If a_set_of_classes is all classes (like with MUL) and if C > 64, then there will be cache misses at each iteration, which cost more than the operations performed on the vectors. Therefore, for OVR, the code is laid out in blocks like for t0 = 1 to n by t_step for c = 1 to C for t = t_block to t_block + t_step - 1 # compute all scores depending on t, c # perform updates on W(:, c) end for end for end for where t_step is chosen so that all x(:, t) accessed in the inner loop fit in cache. A split in blocks around the d dimension is also implemented. It works by precomputing self_dp(t1, t2) = x(:, t1)' * x(:, t2) for all t0 <= t1, t2 < t0 + t_step. Then, at each t0 block start, we compute score2(t, c) := W_t0(:, c)' * x(:, t) for all c and t0 <= t < t0 + t_step where W_t0 means "the state of W at t = t0". This operation is easily split around the dimension d. The score2 table is not the same as the score table because it is computed with W's state at the start of the t block. However W_t can be represented as W_t(:, c) = W_t0(:, c) + alpha_t0 * x(:, t0) + ... + alpha_{t-1} * x(:, t - 1) Where the alpha_t's are computed at each W update. Therefore the current score expands to score(t, c) = W_t(:, c)' * x(:, t) = score2(t, c) + alpha_t0 * self_dp(t0, t) + ... + alpha_{t-1} * self_dp(t, t-1) which makes it possible to compute alpha_t. Computations of scores and alpha's do not involve any operation in d. The actual update of W is performed at the end of the t block using the operation above. This method is implemented (use_self_dotprods = 1 and set d_step). Unfortunately, it is quite complicated, and even worse, it does not seem to improve the speed, so it is disabled by default. - when possible, multithreading is used. This is not easy when there are dependencies between the factors and the scores across classes. This is not the case for OVR: for OVR multithreading is peformed along the classes (ie. each thread takes care of a subset of classes). Legal, contact ============== This code is copyright (C) INRIA 2012 Homepage: http://lear.inrialpes.fr/src/jsgd/ License: Cecill (see http://www.cecill.info/licences/Licence_CeCILL_V1.1-US.txt), similar to GPL Contact: Mattis Paulin, mattis.paulin@inria.fr Matthijs Douze, matthijs.douze@inria.fr Last update 2013-06-17