Installing Spark with Cloudera Manager

Apache Spark is a fast and general-purpose cluster computing system with support for in-memory computation. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.

Prerequisites

Spark 0.9 is available as a parcel that can be installed from within the Cloudera Manager Admin Console. It requires Cloudera Manager 4.8 and CDH4, version 4.4 or later, installed with Cloudera Manager using parcels.

If you do not already have Cloudera Manager 4.8 installed, you must install it, as well as the CDH4 parcel (version 4.4 or later). If you installed CDH4 using packages, you must reinstall or upgrade it using parcels.

If you already have Cloudera Manager 4.8 installed using packages, you can skip to Install the Spark parcel using Cloudera Manager

Install Cloudera Manager, if not already done

If you have not installed Cloudera Manager, you must do so. You also need to install a CDH 4 parcel as part of that process.

Follow the documentation at http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Installation-Guide/cm5ig_install_path_A.html to install and configure Cloudera Manager 4.8 with CDH 4.5 (the latest version of CDH).

Make sure you choose Parcels as the method for doing the installation.

Install the Spark parcel using Cloudera Manager

Install the Spark parcel (download, distribute and activate through the Cloudera Manager Admin Console). Spark parcels are available for CDH4 (4.x or later) and CDH 5 (beta 1 or later).

  1. In the Cloudera Manager Admin Console, from the Administration tab, select Settings, then go to the Parcels category.
  2. Find the Remote Parcel Repository URLs property and add the location of the parcel repository.
    1. Click the Plus sign to open a new field.
    2. Enter the URL of the location of the parcel you need to install (typically archive.cloudera.com/spark/parcels).
    3. Click Save Changes to save your changes.
  3. From the Hosts page, click the Parcels tab. The parcel for the external application should appear in the set of parcels available for download.
  4. Download, distribute, and activate the parcel:
    1. Click the Download button for the Spark parcel to initiate the parcel download. When the parcel has finished downloading, the button changes to say Distribute.
    2. Click Distribute to start the process of distributing the parcel to the other hosts in your cluster. When the distribution process has completed, the button will change to say Activate.
    3. Click Activate to update Cloudera Manager to point to the new software, ready to run upon the next service restart. In the confirmation pop-up that appears, do not elect to restart the service.

    For general information about installing parcels with Cloudera Manager, see http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Managing-Clusters/cm5mc_parcels.html

Configure and start the Spark service

The following steps must be performed as the root user from the command line on the host that will run the Spark Master role.
  1. Edit /etc/spark/conf/spark-env.sh:
    • Change the environment variable STANDALONE_SPARK_MASTER_HOST to the fully qualified domain name of the master host.
    • Set the environment variable DEFAULT_HADOOP_HOME to the Hadoop installation, which is /opt/cloudera/parcels/CDH/lib/hadoop for a parcel installation.
    • You can optionally set the Spark Master's port and Web UI port with SPARK_MASTER_PORT and SPARK_MASTER_WEBUI_PORT respectively.
  2. Edit the slaves file in /etc/spark/conf/slaves.
    • Enter the fully-qualified domain names of all Spark worker nodes, one name per line.
  3. Sync the contents of /etc/spark/conf to all nodes.
  4. Start the Master role on the host that will act as the Spark Master. The Master role is responsible for coordinating the different Spark applications (Spark contexts).
    /opt/cloudera/parcels/SPARK/lib/spark/sbin/start-master.sh
  5. Run the start-slaves.sh script to start all the worker roles:
    /opt/cloudera/parcels/SPARK/lib/spark/sbin/start-slaves.sh
    In order for this to work, you should have passwordless SSH configured for root. Otherwise, you will need to run the following on every worker node (as root):
    /opt/cloudera/parcels/SPARK/lib/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://<master_ip>:<master_port>
    
    (The default Master port is 7077.)

Testing your Spark Setup

To test your Spark setup, start `spark-shell' on one of the nodes. You can, for example, run wordcount:

val file = sc.textFile("hdfs://namenode:8020/path/to/file")
val counts = file.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://namenode:8020/output")

You can see the application by going to the Spark Master UI, by default at http://spark-master:18080, to see the Spark Shell application, its executors and logs.

Ports Used by Spark

  • 7077 – Default Master RPC port
  • 7078 – Default Worker RPC port
  • 18080 – Default Master webui port
  • 18081 – Default Worker webui port

For further information, see the Spark documentation at Apache Spark 0.9.