Recovering from Cloudera Manager Host Failures

Cloudera Manager uses databases to store information about the Cloudera Manager system and jobs. If the machine hosting Cloudera Manager fails, it is possible to re-establish the installation if the database information is still available. Database information is typically available for either of the following reasons:

  • You backed up the database.
  • The database and Cloudera Manager are on separate servers and the database server is still available.

Before beginning this process, find the failed machine's name IP address and hostname. It is not absolutely necessary to have the old Cloudera Manager server name and IP address, but it simplifies the process. You could use a new IP address and hostnames, but this would require updating the configuration of every agent to use this new information. Because it is easier to use the old server name and address in most cases, using a new hostname and IP address is not described.

To restore a Cloudera Manager when the database server is available

  1. Identify a new server on which to install Cloudera Manager. Assign the failed Cloudera Manager server's IP address and hostname to the new server.
      Note:

    If the agents were configured with the server's hostname, you do not need to assign the old machine's IP address to the new host. Simply assigning the hostname will suffice.

  2. Install Cloudera Manager on a new server, using the method described under Step 3: Install the Cloudera Manager Server. Do not install the other components, such as CDH and databases, as those should still exist in your environment
  3. Update /etc/cloudera-scm-server/db.properties with the necessary information so Cloudera Manager server connects to the restored database. This information is typically the database name, database instance name, user name, and password.
  4. Start the Cloudera Manager server.

To restore a Cloudera Manager deployment from database backups when the database server is not available

  1. Identify a new server on which to install Cloudera Manager. Assign the failed Cloudera Manager server's IP address and hostname to the new server.
      Note:

    If the agents were configured with the server's hostname, you do not need to assign the old machine's IP address to the new host. Simply assigning the hostname will suffice.

  2. Install Cloudera Manager on a new server, using whatever method you used before, as described in Step 3: Install the Cloudera Manager Server.
  3. Install the database packages on the machines that will host the restored database. This could be the same server on which you have just installed Cloudera Manager or it could be a different server. The details of which package to install varies based on which database was initially installed on your system. If you used an external MySQL, PostgreSQL, or Oracle database, reinstall that now. If you used the embedded PostgreSQL database, you will need to install the cloudera-manager-server-db package as described in Installing an Embedded PostgreSQL Database. After installing that package, you must initialize and start the database as described in Configuring Your Systems to Support PostgreSQL.
  4. Restore the backed up databases to the new database installations.
  5. Update /etc/cloudera-scm-server/db.properties with the necessary information so Cloudera Manager server connects to the restored database. This information is typically the database name, database instance name, user name, and password.
  6. Start the Cloudera Manager server.

At this point, Cloudera Manager should resume functioning as it did before the failure. Because you restored the database from the backup, the server should accept the running state of the agents, meaning it will not terminate any running Hadoop processes.

This process is similar with secure clusters, though additional files in /etc/cloudera-scm-server must be restored in addition to the database.