Cloudera Manager REST API Tutorial

[/concept/conbody {"- concept/conbody concept/conbody "})

The Cloudera Manager API is an HTTP REST API using JSON serialization. This topic walks through an example of setting up a four node HDFS and MapReduce cluster via the Cloudera Manager REST API.

(conbody]

API Clients

The API supports various clients.

curl Client

The simplest way to use the API is by making HTTP calls using tools like curl. For example, to obtain the status of service hdfs2 in cluster dev01 run the command:

$ curl -u 'admin:admin' http://cm_host:7180/api/v1/clusters/dev01/services/hdfs2 
{ "name" : "hdfs2", "type" : "HDFS", "clusterRef" : { "clusterName" :  "dev01" }, 
"serviceState" : "STARTED", "healthSummary" : "GOOD", "configStale" : false, 
... } 

Python Client

The API also has a Python client. To make the same request using Python:

>>> from cm_api.api_client import ApiResource 
>>> api = ApiResource('cm_host', username='admin', password='admin')
>>> dev01 = api.get_cluster('dev01') 
>>> hdfs = dev01.get_service('hdfs2') 
>>> print hdfs.serviceState, hdfs.healthSummary 
STARTED GOOD

Setting up a Cluster

Next we demonstrate an API Python script that defines, configures, and starts a cluster. You are about to see some of the low-level details of Cloudera Manager. Compared with the UI wizard, the API route is more tedious. But the API provides flexibility and programmatic control. You will also notice that this setup process does not require my cluster to be online (until the very last step where the services are started.) This has proven useful to administrators who are stamping out pre-configured clusters.

Step 1. Define the Cluster

#!/usr/bin/env python
import socket from cm_api.api_client 
import ApiResource

CM_HOST = "centos56-17.ent.cloudera.com"

api = ApiResource(CM_HOST, username="admin", password="admin")
cluster = api.create_cluster("prod01", "CDH4")

This creates a handle on the API. The ApiResource object also accept other optional arguments such as port, TLS, and API version. With the ApiResource, we created a cluster called prod01 on version CDH4. The handle to the cluster is returned as part of the call.

Step 2. Create HDFS Service and Roles

Now create the services. HDFS comes first:

hdfs = cluster.create_service("hdfs01", "HDFS")

At this point, if you query the different role types supported by hdfs01, you get:

>>> print hdfs.get_role_types() 
[u'DATANODE', u'NAMENODE', u'SECONDARYNAMENODE', u'BALANCER', u'GATEWAY', u'HTTPFS', u'FAILOVERCONTROLLER']

Now create one NameNode, one Secondary NameNode, and four DataNodes:

HOSTNAMES = [
  "centos56-17.ent.cloudera.com",
  "centos56-18.ent.cloudera.com",
  "centos56-19.ent.cloudera.com",
  "centos56-20.ent.cloudera.com"
]
hosts = [ ]                             # API host handles

for name in HOSTNAMES:
  host = api.create_host(
      name,                             # Host id
      name,                             # Host name (FQDN)
      socket.gethostbyname(name),       # IP address
      "/default_rack")                  # Rack
  hosts.append(host)

nn = hdfs.create_role("hdfs01-nn", "NAMENODE", hosts[0].hostId)
snn = hdfs.create_role("hdfs01-snn", "SECONDARYNAMENODE", hosts[0].hostId)
for i in range(4):
  hdfs.create_role("hdfs01-dn" + str(i), "DATANODE", hosts[i].hostId)

Most of the code is performing host creation. That is required for role creation, as each role needs to be assigned to a host. In the end, the first host is assigned the NameNode, the Secondary NameNode and a DataNode. The rest are DataNodes.

At this point, if you query the first host, you can see the correct roles assigned to it:

>>> print api.get_host(HOSTNAMES[0]).roleRefs 

Returns:

[{'clusterName': 'prod01', 'roleName': 'hdfs01-snn', 'serviceName': 'hdfs01'},  
{'clusterName': 'prod01', 'roleName': 'hdfs01-dn0', 'serviceName': 'hdfs01'},  
{'clusterName': 'prod01', 'roleName': 'hdfs01-nn', 'serviceName': 'hdfs01'}]

Step 3. Configure HDFS

Service configuration is separated into service-wide configuration and role type configuration. Service-wide configuration is typically settings such as HDFS replication factor that affect multiple role types. Role type configuration is a template that gets inherited by specific role instances. For example, at the role type template level, I can set all DataNodes to use three data directories. And I can override that for specific DataNodes by setting the role-level configuration.

hdfs_service_config = {
  'dfs_replication': 2,
}
nn_config = {
  'dfs_name_dir_list': '/dfs/nn',
  'dfs_namenode_handler_count': 30,
}
snn_config = {
  'fs_checkpoint_dir_list': '/dfs/snn',
}
dn_config = {
  'dfs_data_dir_list': '/dfs/dn1,/dfs/dn2,/dfs/dn3',
  'dfs_datanode_failed_volumes_tolerated': 1,
}
hdfs.update_config(
    svc_config=hdfs_service_config,
    NAMENODE=nn_config,
    SECONDARYNAMENODE=snn_config,
    DATANODE=dn_config)

# Use a different set of data directories for DataNode3
hdfs.get_role('hdfs01-dn3').update_config({'dfs_data_dir_list': '/dn/data1,/dn/data2' })

How do I find out the configuration keys used by Cloudera Manager? For example, how do I know that dfs_replication is the key for setting replication factor? I can query the service as follows:

>>> service_conf, roletype_conf = hdfs.get_config(view="full") 
>>> print service_conf {u'catch_events': ,  u'dfs_block_access_token_enable': ,  u'dfs_block_size': ,  ... 
>>> for k, v in sorted(service_conf.items()): ... 
print "\n------ ", v.displayName, "\n Key:", k, \ ... "\n Value:", v.value, 
"\n Default:", v.default, \ ... "\n AKA:", v.relatedName, 
"\n Desc:", v.description 
... 
which returns:
------ Enable log event capture   
Key: catch_events   
Value: None   
Default: true   
AKA: None   
Desc: When set, each role will identify important log events and forward them to Cloudera Manager.
------ Enable block access token   
Key: dfs_block_access_token_enable   
Value: None Default: true   
AKA: dfs.block.access.token.enable   
Desc: If true, access tokens are used as capabilities for accessing DataNodes. If false, no access tokens are checked on accessing DataNodes. 
------ Block Size   
Key: dfs_block_size   
Value: None   
Default:  134217728   
AKA: dfs.blocksize   
Desc: The default block size for new HDFS files.
...

Note the view="full" argument passed to hdfs.get_config(). Without it, the API returns only the configurations that are set to non-default values:

>>> hdfs.get_config()
which returns:
({u'dfs_ha_fencing_cloudera_manager_secret_key': u'xz5Yr2inDI8vWEzf16EQpIKPoBMoTg', 
u'dfs_replication': u'2'}, {u'BALANCER': {}, 
u'DATANODE': {u'dfs_data_dir_list': u'/dfs/dn1,/dfs/dn2,/dfs/dn3', u'dfs_datanode_failed_volumes_tolerated': u'1'},
u'FAILOVERCONTROLLER': {}, u'GATEWAY': {}, u'HTTPFS': {},   
u'NAMENODE': {u'dfs_name_dir_list': u'/dfs/nn', u'dfs_namenode_handler_count': u'30'},
u'SECONDARYNAMENODE': {u'fs_checkpoint_dir_list': u'/dfs/snn'}})

Step 4. Create MapReduce Service and Roles

This step is similar to the HDFS service. We assign a TaskTracker to each node, and the JobTracker to the first node.

mr = cluster.create_service("mr01", "MAPREDUCE")
jt = mr.create_role("mr01-jt", "JOBTRACKER", hosts[0].hostId)
for i in range(4):
  mr.create_role("mr01-tt" + str(i), "TASKTRACKER", hosts[i].hostId)

Step 5. Configure MapReduce

Here is the code to configure the "mr01" service:

mr_service_config = {
'hdfs_service': 'hdfs01',
}

jt_config = {
  'jobtracker_mapred_local_dir_list': '/mapred/jt',
  'mapred_job_tracker_handler_count': 40,
}
tt_config = {
  'tasktracker_mapred_local_dir_list': '/mapred/local',
  'mapred_tasktracker_map_tasks_maximum': 10,
  'mapred_tasktracker_reduce_tasks_maximum': 6,
}

gateway_config = {
  'mapred_reduce_tasks': 10,
  'mapred_submit_replication': 2,
}
mr.update_config(
    svc_config=mr_service_config,
    JOBTRACKER=jt_config,
    TASKTRACKER=tt_config,
    GATEWAY=gateway_config)

Two items deserve elaboration. First is the hdfs_service property. Rather than asking for the equivalent of fs.defaultFS, a MapReduce service depends on an HDFS service, and derives its HDFS access parameters based on how that HDFS service is configured. Second, the gateway role type is unique to Cloudera Manager. It represents a client. A gateway role does not run any daemons. It simply receives client configuration, as part of the "deploy client configuration" process, which we perform later.

Step 6. Start HDFS

Before you can start HDFS, the cluster nodes must be up, CDH installed, and Cloudera Manager agents running. (The API does not perform software installation.) As part of the preparation, I did that, and pointed the Cloudera Manager agents to the Cloudera Manager server by editing the server_host property in /etc/cloudera-scm-agent/config.ini.

To format HDFS and start it, run:

CMD_TIMEOUT = 180 # format_hdfs takes a list of NameNodes
cmd = hdfs.format_hdfs('hdfs01-nn')[0]
if not cmd.wait(CMD_TIMEOUT).success:
  raise Exception("Failed to format HDFS")

cmd = hdfs.start()
if not cmd.wait(CMD_TIMEOUT).success: 
  raise Exception("Failed to start HDFS")

Each cmd object represents an asynchronous command. Once they complete and assert that they have succeeded, you deploy the HDFS client configuration to the host running hdfs01-nn:

cmd = hdfs.deploy_client_config('hdfs01-nn')
if not cmd.wait(CMD_TIMEOUT).success: 
  raise Exception("Failed to deploy HDFS client config")

Step 7. Start MapReduce

First, create /tmp folders (because JobTracker will not start unless /tmp exists in HDFS):

$ sudo -u hdfs hadoop fs -mkdir /tmp 
$ sudo -u hdfs hadoop fs -chmod 1777 /tmp

Then start the MapReduce service and deploy the client configuration:

cmd = mr.start()
if not cmd.wait(CMD_TIMEOUT).success:
  raise Exception("Failed to start MapReduce")
   cmd = mr.deploy_client_config('mr01-jt')
if not cmd.wait(CMD_TIMEOUT).success:
  raise Exception("Failed to deploy MapReduce client config")

Note that users have not been set up and their home directories do not exist. But you can run a job as the user "hdfs":

[root@centos56-17 ~]# sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 2 2
Number of Maps = 2 
Samples per Map = 2 
Wrote input for Map #0 
Wrote input for Map #1 
Starting Job 
12/09/03 23:38:35 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments.
Applications should implement Tool for the same. 
12/09/03 23:38:35 INFO mapred.FileInputFormat: Total input paths to process : 2 
12/09/03 23:38:36 INFO mapred.JobClient: Running job: job_201209032320_0001 ...