Main Content

Deploy Tall Arrays to a CLOUDERA Spark Enabled Hadoop Cluster

This example shows how to deploy a MATLAB® application containing tall arrays to a CLOUDERA® Spark™ enabled Hadoop® cluster.

Deploying MATLAB applications against a CLOUDERA distribution of Spark requires a special wrapper type that you generate using the mcc command. This wrapper type generates a jar file as well as a shell script which calls spark-submit. The spark-submit script in the Spark bin directory is used to start applications on a cluster. It supports both yarn-client mode and yarn-cluster mode.

The inputs to the application are:

  • master — URL to the Spark cluster

  • inputFile — the file containing the input data

  • outputFile— the file containing the results of the computation

Note

The complete code for this example is in the file meanArrivalDemo.m, as shown below.

 meanArrivalDemo.m

Prerequisites

  1. Install the MATLAB Runtime in the default location on the desktop. This example uses /usr/local/MATLAB/MATLAB_Runtime/v91 as the default location for the MATLAB Runtime. If you don’t have MATLAB Runtime, see Install and Configure MATLAB Runtime for installation instructions.

  2. Install the MATLAB Runtime on every worker node.

  3. Copy the airlinesmall.csv from folder toolbox/matlab/demos of your MATLAB install area into Hadoop Distributed File System (HDFS™) folder /datasets/airlinemod.

Deploy Tall Arrays

  1. At the MATLAB command prompt, use the mcc command to generate a jar file and shell script for the MATLAB application meanArrivalDemo.m.

    >> mcc -vCW 'Spark:meanArrivalDemoApp' meanArrivalDemo.m

    This action creates a jar file named meanArrivalDempApp.jar and a shell script named run_meanArrivalDemoApp.sh.

    Note

    To use the shell script, set up the environment variables HADOOP_PREIX, HADOOP_CONF_DIR and SPARK_HOME.

  2. Execute the shell script in eitheryarn-client mode or yarn-cluster mode. In yarn-client mode, the driver runs on the desktop. In yarn-cluster mode, the driver runs in the Application Master process in the cluster.

    The general syntax to execute the shell script is:

    ./run_meanArrivalDemoApp.sh <runtime install root> [Spark arguments] [Application arguments] 

    yarn-client mode

    Run the following command from a Linux® terminal:

    $ ./run_meanArrivalDemoApp.sh \
       /usr/local/MATLAB/MATLAB_Runtime/v91 \
        yarn-client \
        hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv \
        hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResult
    

    To examine the result, enter the following from the MATLAB command prompt:

    >> ds = datastore('hdfs:///user/someuser/meanArrivalResult/*');
    >> readall(ds)
    

    yarn-cluster mode

    Run the following command from a Linux terminal:

    $ ./run_meanArrivalDemoApp.sh \
       /usr/local/MATLAB/MATLAB_Runtime/v91 \
       --deploy-mode cluster --master yarn yarn-cluster \
       hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv \ 
       hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResult
    

    In yarn-cluster mode, since the driver is running on a worker node in the cluster, any standard output from the MATLAB function is not displayed on your desktop. In addition, files can be saved anywhere. To prevent such behavior, this example uses the write function to explicitly save the results to a particular location in HDFS.