Matrix Factorization¶
The Matrix Factorization App decomposes an N by M matrix D into two lower-rank matrices L and R where L * R = D, where L and R have sizes N by K and K by M (for a user-specified K). The algorithm used in this app is the Stochastic Gradient Descent (SGD) algorithm to optimize an objective function by applying gradient descent to random subsets of the data.
Running Matrix Factorization on Local Clusters¶
- Read and follow this section of the Wiki.
- Run:
gradle buildMatrixFact - Prepare a host file. Each line in the file is of the format:
<ip address>:<port>. - Prepare a data file.
We have prepared a data file generator:
app/matrixfact/scripts/generate_data.py.Please follow this guide to prepare a data file. - Run:
python scripts/jbosen_run.py <path to host file> app/matrixfact/build/libs/MatrixFact.jar org.petuum.app.matrixfact.MatrixFact --app_args "-dataFile <path to data file>"If you haven’t prepared a host file, you may usemachinefile/sample_machinefileto simulate two hosts on your local machine.
Besides the required command line argument -dataFile for the app, the Matrix Factorization app also offers other optional command line arguments for further configuration.
-stalenessStaleness of parameter tables.Default value: 0-numEpochsNumber of passes over data.Default value: 10-KRank of factor matrices.Default value: 20-lambdaRegularization parameter lambda.Default value: 0.1-learningRateDecayLearning rate is multiplied by this factor each iteration.Default value: 1-learningRateEta0Learning rate parameter. If you are getting garbage output or you see the objective value increase rapidly, try lowering this parameter by 10x or more.Default value: 0.001-numMiniBatchesPerEpochEquals to number of clock() calls per data sweep.Default value: 1-outputPrefixOutput to outputPrefix.L, outputPrefix.W.Default value: Does not output.-numLocalWorkerThreadsNumber of worker threads to use per host/machine.Default value: 1
To use these, simply add them onto the –app_args argument. For example:python scripts/jbosen_run.py ... --app_args "-dataFile <path to data file> -numEpoch 10 -staleness 0 ... -outputPrefix output".
Running Matrix Factorization on YARN¶
Read and follow this section of the Wiki.
Read and follow the section above to make sure the Matrix Factorization app runs correctly on the local machine.
From step 2, users should have gradle build the Matrix Factorization app jar file under
app/matrixfact/build/libs/MatrixFact.jarand prepared a data file for the app.Move the the data file onto HDFS by running:
hadoop fs -copyFromLocal <path to data file on local> <a directory on HDFS>Make sure the directory on HDFS has the appropriate permission for the user and YARN.Run:
gradle buildYarnThis should create two jar files, yarnClient.jar and yarnApplicationMaster.jar underjbosen_yarn/build/libs.Run:
python scripts/jbosen_yarn_run.py --client_jar_path jbosen_yarn/build/libs/yarnClient.jar --app_master_jar_path jbosen_yarn/build/libs/yarnApplicationMaster.jar --ps_app_jar_local_path app/matrixfact/build/libs/MatrixFact.jar --ps_app_args "-dataFile <Path to data file on HDFS> -outputPrefix <Prefix of output files on HDFS>" --num_nodes XwhereXis the number of nodes (machines) to use.Note: HDFS paths must be in this format:
hdfs://<domain>/<path>Record the application id from the logs.
Obtain the results by running:
yarn logs -applicationId <application id>
Furthermore, as explained in “Running Matrix Factorization on Local Clusters” section, the Matrix Factorization app can take extra optional arguments. Users can simply pass these arguments in --ps_app_args. Note that all HDFS paths provided to the app need to be in the following format of hdfs://<domain>/<path>.
In addition to these optional arguments, the JBösen system can be configured using extra command line arguments. When running on local machines, the jbosen_run.py scripts takes care of these JBösen system configurations. However, when using YARN, users needs to pass in these system arguments as part of the –ps_app_args. A detailed explanation on these arguments are here.
Note: when using -numLocalWorkerThreads to set the number of worker threads on each of the client as well as the -numLocalCommChannels option, be sure to change the option --container_vcores to an appropriate number.