Introduction

Sqoop is a tool designed to help users of large data import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine each table’s schema, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read tables from the database via the DBInputFormat (JDBC-based InputFormat). Tables are read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported. Sqoop also supports high-performance imports from select databases including MySQL.

This document describes how to get started using Sqoop to import your data into Hadoop.

The Sqoop Command Line

To execute Sqoop, run with Hadoop:

$ bin/hadoop jar contrib/sqoop/hadoop-$(version)-sqoop.jar (arguments)

NOTE:Throughput this document, we will use sqoop as shorthand for the above. i.e., $ sqoop (arguments)

You pass this program options describing the import job you want to perform. If you need a hint, running Sqoop with --help will print out a list of all the command line options available. The sqoop(1) manual page will also describe Sqoop’s available arguments in greater detail. The manual page is built in $HADOOP_HOME/build/contrib/sqoop/doc/sqoop.1.gz. The following subsections will describe the most common modes of operation.

Connecting to a Database Server

Sqoop is designed to import tables from a database into HDFS. As such, it requires a connect string that describes how to connect to the database. The connect string looks like a URL, and is communicated to Sqoop with the --connect argument. This describes the server and database to connect to; it may also specify the port. e.g.:

$ sqoop --connect jdbc:mysql://database.example.com/employees

This string will connect to a MySQL database named employees on the host database.example.com. It’s important that you do not use the URL localhost if you intend to use Sqoop with a distributed Hadoop cluster. The connect string you supply will be used on TaskTracker nodes throughout your MapReduce cluster; if they’re told to connect to the literal name localhost, they’ll each reach a different database (or more likely, no database at all)! Instead, you should use the full hostname or IP address of the database host that can be seen by all your remote nodes.

You may need to authenticate against the database before you can access it. The --username and --password or -P parameters can be used to supply a username and a password to the database. e.g.:

$ sqoop --connect jdbc:mysql://database.example.com/employees \
    --username aaron --password 12345
Warning
Password security
The --password parameter is insecure, as other users may be able to read your password from the command-line arguments via the output of programs such as ps. The -P argument will read a password from a console prompt, and is the preferred method of entering credentials. Credentials may still be transferred between nodes of the MapReduce cluster using insecure means.

Sqoop automatically supports several databases, including MySQL. Connect strings beginning with jdbc:mysql:// are handled automatically Sqoop, though you may need to install the driver yourself. (A full list of databases with built-in support is provided in the "Supported Databases" section, below.)

You can use Sqoop with any other JDBC-compliant database as well. First, download the appropriate JDBC driver for the database you want to import from, and install the .jar file in the /usr/hadoop/lib directory on all machines in your Hadoop cluster, or some other directory which is in the classpath on all nodes. Each driver jar also has a specific driver class which defines the entry-point to the driver. For example, MySQL’s Connector/J library has a driver class of com.mysql.jdbc.Driver. Refer to your database vendor-specific documentation to determine the main driver class. This class must be provided as an argument to Sqoop with --driver.

For example, to connect to a postgres database, first download the driver from http://jdbc.postgresql.org and install it in your Hadoop lib path. Then run Sqoop with something like:

$ sqoop --connect jdbc:postgresql://postgres-server.example.com/employees \
    --driver org.postgresql.Driver

Listing Available Databases

Once connected to a database server, you can list the available databases with the --list-databases parameter. This currently is supported only by HSQLDB and MySQL. Note that in this case, the connect string does not include a database name, just a server address.

$ sqoop --connect jdbc:mysql://database.example.com/ --list-databases
information_schema
employees

This only works with HSQLDB and MySQL. A vendor-agnostic implementation of this function has not yet been implemented.

Listing Available Tables

Within a database, you can list the tables available for import with the --list-tables command. The following example shows four tables available within the "employees" example database:

$ sqoop --connect jdbc:mysql://database.example.com/employees --list-tables
employee_names
payroll_checks
job_descriptions
office_supplies

Automatic Full-database Import

If you want to import all the tables in a database, you can use the --all-tables command to do so:

$ sqoop --connect jdbc:mysql://database.example.com/employees --all-tables

This will query the database for the available tables, generate an ORM class for each table, and run a MapReduce job to import each one. Hadoop uses the DBInputFormat to read from a database into a Mapper instance. To read a table into a MapReduce program requires creating a class to hold the fields of one row of the table. One of the benefits of Sqoop is that it generates this class definition for you, based on the table definition in the database.

The generated .java files are, by default, placed in the current directory. You can supply a different directory with the --outdir parameter. These are then compiled into .class and .jar files for use by the MapReduce job that it launches. These files are created in a temporary directory. You can redirect this target with --bindir.

Each table will be imported into a separate directory in HDFS, with the same name as the table. For instance, if my Hadoop username is aaron, the above command would have generated the following directories in HDFS:

/user/aaron/employee_names
/user/aaron/payroll_checks
/user/aaron/job_descriptions
/user/aaron/office_supplies

You can change the base directory under which the tables are loaded with the --warehouse-dir parameter. For example:

$ sqoop --connect jdbc:mysql://database.example.com/employees --all-tables \
    --warehouse-dir /common/warehouse

This would create the following directories instead:

/common/warehouse/employee_names
/common/warehouse/payroll_checks
/common/warehouse/job_descriptions
/common/warehouse/office_supplies

By default the data will be read into text files in HDFS. Each of the columns will be represented as comma-delimited text. Each row is terminated by a newline. See the section on "Controlling the Output Format" below for information on how to change these delimiters.

If you want to leverage compression and binary file formats, the --as-sequencefile argument to Sqoop will import the table to a set of SequenceFiles instead. This stores each field of each database record in a separate object in a SequenceFile. This representation is also likely to be higher performance when used as an input to subsequent MapReduce programs as it does not require parsing. For completeness, Sqoop provides an --as-textfile option, which is implied by default. An --as-textfile on the command-line will override a previous --as-sequencefile argument.

The SequenceFile format will embed the records from the database as objects using the code generated by Sqoop. It is important that you retain the .java file for this class, as you will need to be able to instantiate the same type to read the objects back later, in other user-defined applications.

Importing Individual Tables

In addition to full-database imports, Sqoop will allow you to import individual tables. Instead of using --all-tables, specify the name of a particular table with the --table argument:

$ sqoop --connect jdbc:mysql://database.example.com/employees \
    --table employee_names

You can further specify a subset of the columns in a table by using the --columns argument. This takes a list of column names, delimited by commas, with no spaces in between. e.g.:

$ sqoop --connect jdbc:mysql://database.example.com/employees \
    --table employee_names --columns employee_id,first_name,last_name,dept_id

Sqoop will use a MapReduce job to read sections of the table in parallel. For the MapReduce tasks to divide the table space, the results returned by the database must be orderable. Sqoop will automatically detect the primary key for a table and use that to order the results. If no primary key is available, or (less likely) you want to order the results along a different column, you can specify the column name with --split-by.

Important
Row ordering
To guarantee correctness of your input, you must select an ordering column for which each row has a unique value. If duplicate values appear in the ordering column, the results of the import are undefined, and Sqoop will not be able to detect the error.

Finally, you can control which rows of a table are imported via the --where argument. With this argument, you may specify a clause to be appended to the SQL statement used to select rows from the table, e.g.:

$ sqoop --connect jdbc:mysql://database.example.com/employees \
  --table employee_names --where "employee_id > 40 AND active = 1"

The --columns, --split-by, and --where arguments are incompatible with --all-tables. If you require special handling for some of the tables, then you must manually run a separate import job for each table.

Controlling the Output Format

The delimiters used to separate fields and records can be specified on the command line, as can a quoting character and an escape character (for quoting delimiters inside a values). Data imported with --as-textfile will be formatted according to these parameters. Classes generated by Sqoop will encode this information, so using toString() from a data record stored --as-sequencefile will reproduce your specified formatting.

The (char) argument for each argument in this section can be specified either as a normal character (e.g., --fields-terminated-by ,) or via an escape sequence. Arguments of the form \0xhhh will be interpreted as a hexidecimal representation of a character with hex number hhh. Arguments of the form \0ooo will be treated as an octal representation of a character represented by octal number ooo. The special escapes \n, \r, \", \b, \t, and \\ act as they do inside Java strings. \0 will be treated as NUL. This will insert NUL characters between fields or lines (if used for --fields-terminated-by or --lines-terminated-by), or will disable enclosing/escaping if used for one of the --enclosed-by, --optionally-enclosed-by, or --escaped-by arguments.

The default delimiters are , for fields, \n for records, no quote character, and no escape character. Note that this can lead to ambiguous/unparsible records if you import database records containing commas or newlines in the field data. For unambiguous parsing, both must be enabled, e.g., via --mysql-delimiters.

The following arguments allow you to control the output format of records:

--fields-terminated-by (char)

Sets the field separator character

--lines-terminated-by (char)

Sets the end-of-line character

--optionally-enclosed-by (char)

Sets a field-enclosing character which may be used if a value contains delimiter characters.

--enclosed-by (char)

Sets a field-enclosing character which will be used for all fields.

--escaped-by (char)

Sets the escape character

--mysql-delimiters

Uses MySQL’s default delimiter set:

fields: , lines: \n escaped-by: \ optionally-enclosed-by: '

For example, we may want to separate records by tab characters, with every record surrounded by "double quotes", and internal quote marks escaped by a backslash (\) character:

$ sqoop --connect jdbc:mysql://database.example.com/employees \
  --table employee_names --fields-terminated-by \t \
  --lines-terminated-by \n --enclosed-by '\"' --escaped-by '\\'

Generated Class Names

By default, classes are named after the table they represent. e.g., sqoop --table foo will generate a file named foo.java. You can override the generated class name with the --class-name argument.

$ sqoop --connect jdbc:mysql://database.example.com/employees \
  --table employee_names --class-name com.example.EmployeeNames

This generates a file named +com/example/EmployeeNames.java+

If you want to specify a package name for generated classes, but still want them to be named after the table they represent, you can instead use the argument --package-name:

$ sqoop --connect jdbc:mysql://database.example.com/employees \
  --table employee_names --package-name com.example

This generates a file named +com/example/employee_names.java+

Miscellaneous Additional Arguments

If you want to generate the Java classes to represent tables without actually performing an import, supply a connect string and (optionally) credentials as above, as well as --all-tables or --table, but also use the --generate-only argument. This will generate the classes and cease further operation.

You can override the $HADOOP_HOME environment variable within Sqoop with the --hadoop-home argument. You can override the $HIVE_HOME environment variable with --hive-home.

Data emitted to HDFS is by default uncompressed. You can instruct Sqoop to use gzip to compress your data by providing either the --compress or -z argument (both are equivalent).

Using --verbose will instruct Sqoop to print more details about its operation; this is particularly handy if Sqoop appears to be misbehaving.

Direct-mode Imports

While the JDBC-based import method used by Sqoop provides it with the ability to read from a variety of databases using a generic driver, it is not the most high-performance method available. Sqoop can read from certain database systems faster by using their built-in export tools.

For example, Sqoop can read from a MySQL database by using the mysqldump tool distributed with MySQL. You can take advantage of this faster import method by running Sqoop with the --direct argument. This combined with a connect string that begins with jdbc:mysql:// will inform Sqoop that it should select the faster access method.

If your delimiters exactly match the delimiters used by mysqldump, then Sqoop will use a fast-path that copies the data directly from mysqldump's output into HDFS. Otherwise, Sqoop will parse mysqldump's output into fields and transcode them into the user-specified delimiter set. This incurs additional processing, so performance may suffer. For convenience, the --mysql-delimiters argument will set all the output delimiters to be consistent with mysqldump's format.

Sqoop also provides a direct-mode backend for PostgreSQL that uses the COPY TO STDOUT protocol from psql. No specific delimiter set provides better performance; Sqoop will forward delimiter control arguments to psql.

The "Supported Databases" section provides a full list of database vendors which have direct-mode support from Sqoop.

When writing to HDFS, direct mode will open a single output file to receive the results of the import. You can instruct Sqoop to use multiple output files by using the --direct-split-size argument which takes a size in bytes. Sqoop will generate files of approximately this size. e.g., --direct-split-size 1000000 will generate files of approximately 1 MB each. If compressing the HDFS files with --compress, this will allow subsequent MapReduce programs to use multiple mappers across your data in parallel.

Tool-specific arguments

Sqoop will generate a set of command-line arguments with which it invokes the underlying direct-mode tool (e.g., mysqldump). You can specify additional arguments which should be passed to the tool by passing them to Sqoop after a single - argument. e.g.:

$ sqoop --connect jdbc:mysql://localhost/db --table foo --direct - --lock-tables

The --lock-tables argument (and anything else to the right of the - argument) will be passed directly to mysqldump.

Importing Data Into Hive

Sqoop’s primary function is to upload your data into files in HDFS. If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line.

By default the data is imported into HDFS, but you can skip this operation by using the --hive-create option. Optionally, you can specify the --hive-overwrite option to indicate that existing table in hive must be replaced. After your data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types, and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory if --hive-create option is not added. The script will be executed by calling the installed copy of hive on the machine where Sqoop is run. If you have multiple Hive installations, or hive is not in your $PATH use the --hive-home option to identify the Hive installation directory. Sqoop will use $HIVE_HOME/bin/hive from here.

Note
This function is incompatible with --as-sequencefile.

Hive’s text parser does not know how to support escaping or enclosing characters. Sqoop will print a warning if you use --escaped-by, --enclosed-by, or --optionally-enclosed-by since Hive does not know how to parse these. It will pass the field and record terminators through to Hive. If you do not set any delimiters and do use --hive-import, the field delimiter will be set to ^A and the record delimiter will be set to \n to be consistent with Hive’s defaults.

The table name used in Hive is, by default, the same as that of the source table. You can control the output table name with the --hive-table option.

Hive’s Type System

Hive users will note that there is not a one-to-one mapping between SQL types and Hive types. In general, SQL types that do not have a direct mapping (e.g., DATE, TIME, and TIMESTAMP) will be coerced to STRING in Hive. The NUMERIC and DECIMAL SQL types will be coerced to DOUBLE. In these cases, Sqoop will emit a warning in its log messages informing you of the loss of precision.

Exporting to a Database

In addition to importing database tables into HDFS, Sqoop can also work in "reverse," reading the contents of a file or directory in HDFS, interpreting the data as database rows, and inserting them into a specified database table.

To run an export, invoke Sqoop with the --export-dir and --table options. e.g.:

$ sqoop --connect jdbc:mysql://db.example.com/foo --table bar  \
    --export-dir /results/bar_data

This will take the files in /results/bar_data and inject their contents in to the bar table in the foo database on db.example.com. The target table must already exist in the database. Sqoop will perform a set of INSERT INTO operations, without regard for existing content. If Sqoop attempts to insert rows which violate constraints in the database (e.g., a particular primary key value already exists), then the export will fail.

As in import mode, Sqoop will auto-generate an interoperability class to use with the particular table in question. This will be used to parse the records in HDFS files before loading their contents into the database. You must specify the same delimiters (e.g., with --fields-terminated-by, etc.) as are used in the files to export in order to parse the data correctly. If your data is stored in SequenceFiles (created with an import in the --as-sequencefile format), then you do not need to specify delimiters.

If you have an existing auto-generated jar and class that you intend to use with Sqoop, you can specify these with the --jar-file and --class-name parameters. Providing these options will disable autogeneration of a new class based on the target table.

Supported Databases

Sqoop uses JDBC to connect to databases. JDBC is a compatibility layer that allows a program to access many different databases through a common API. Slight differences in the SQL language spoken by each database, however, may mean that Sqoop can’t use every database out of the box, or that some databases may be used in an inefficient manner.

When you provide a connect string to Sqoop, it inspects the protocol scheme to determine appropriate vendor-specific logic to use. If Sqoop knows about a given database, it will work automatically. If not, you may need to specify the driver class to load via --driver. This will use a generic code path which will use standard SQL to access the database. Sqoop provides some databases with faster, non-JDBC-based access mechanisms. These can be enabled by specfying the --direct parameter.

Sqoop includes vendor-specific code paths for the following databases:

Database version --direct support? connect string matches
HSQLDB 1.8.0+ No jdbc:hsqldb:*//
MySQL 5.0+ Yes jdbc:mysql://
Oracle 10.2.0+ No jdbc:oracle:*//
PostgreSQL 8.3+ Yes jdbc:postgresql://

Sqoop may work with older versions of the databases listed, but we have only tested it with the versions specified above.

Even if Sqoop supports a database internally, you may still need to install the database vendor’s JDBC driver in your $HADOOP_HOME/lib path.

Developer API Reference

This section is intended to specify the APIs available to application writers integrating with Sqoop, and those modifying Sqoop. The next three subsections are written from the following three perspectives: those using classes generated by Sqoop, and its public library; those writing Sqoop extensions (i.e., additional ConnManager implementations that interact with more databases); and those modifying Sqoop’s internals. Each section describes the system in successively greater depth.

The External API

Sqoop auto-generates classes that represent the tables imported into HDFS. The class contains member fields for each column of the imported table; an instance of the class holds one row of the table. The generated classes implement the serialization APIs used in Hadoop, namely the Writable and DBWritable interfaces. They also contain other convenience methods: a parse() method that interprets delimited text fields, and a toString() method that preserves the user’s chosen delimiters. The full set of methods guaranteed to exist in an auto-generated class are specified in the interface org.apache.hadoop.sqoop.lib.SqoopRecord.

Instances of SqoopRecord may depend on Sqoop’s public API. This is all classes in the org.apache.hadoop.sqoop.lib package. These are briefly described below. Clients of Sqoop should not need to directly interact with any of these classes, although classes generated by Sqoop will depend on them. Therefore, these APIs are considered public and care will be taken when forward-evolving them.

The Extension API

This section covers the API and primary classes used by extensions for Sqoop which allow Sqoop to interface with more database vendors.

While Sqoop uses JDBC and DBInputFormat (and DataDrivenDBInputFormat) to read from databases, differences in the SQL supported by different vendors as well as JDBC metadata necessitates vendor-specific codepaths for most databases. Sqoop’s solution to this problem is by introducing the ConnManager API (org.apache.hadoop.sqoop.manager.ConnMananger).

ConnManager is an abstract class defining all methods that interact with the database itself. Most implementations of ConnManager will extend the org.apache.hadoop.sqoop.manager.SqlManager abstract class, which uses standard SQL to perform most actions. Subclasses are required to implement the getConnection() method which returns the actual JDBC connection to the database. Subclasses are free to override all other methods as well. The SqlManager class itself exposes a protected API that allows developers to selectively override behavior. For example, the getColNamesQuery() method allows the SQL query used by getColNames() to be modified without needing to rewrite the majority of getColNames().

ConnManager implementations receive a lot of their configuration data from a Sqoop-specific class, SqoopOptions. While SqoopOptions does not currently contain many setter methods, clients should not assume SqoopOptions are immutable. More setter methods may be added in the future. SqoopOptions does not directly store specific per-manager options. Instead, it contains a reference to the Configuration returned by Tool.getConf() after parsing command-line arguments with the GenericOptionsParser. This allows extension arguments via "-D any.specific.param=any.value" without requiring any layering of options parsing or modification of SqoopOptions.

All existing ConnManager implementations are stateless. Thus, the system which instantiates ConnManagers may implement multiple instances of the same ConnMananger class over Sqoop’s lifetime. If a caching layer is required, we can add one later, but it is not currently available.

ConnManagers are currently created by instances of the abstract class ManagerFactory (See MAPREDUCE-750). One ManagerFactory implementation currently serves all of Sqoop: org.apache.hadoop.sqoop.manager.DefaultManagerFactory. Extensions should not modify DefaultManagerFactory. Instead, an extension-specific ManagerFactory implementation should be provided with the new ConnManager. ManagerFactory has a single method of note, named accept(). This method will determine whether it can instantiate a ConnManager for the user’s SqoopOptions. If so, it returns the ConnManager instance. Otherwise, it returns null.

The ManagerFactory implementations used are governed by the sqoop.connection.factories setting in sqoop-site.xml. Users of extension libraries can install the 3rd-party library containing a new ManagerFactory and ConnManager(s), and configure sqoop-site.xml to use the new ManagerFactory. The DefaultManagerFactory principly discriminates between databases by parsing the connect string stored in SqoopOptions.

Extension authors may make use of classes in the org.apache.hadoop.sqoop.io, mapred, mapreduce, and util packages to facilitate their implementations. These packages and classes are described in more detail in the following section.

Sqoop Internals

This section describes the internal architecture of Sqoop.

The Sqoop program is driven by the org.apache.hadoop.sqoop.Sqoop main class. A limited number of additional classes are in the same package; SqoopOptions (described earlier) and ConnFactory (which manipulates ManagerFactory instances).

General program flow

The general program flow is as follows:

org.apache.hadoop.sqoop.Sqoop is the main class and implements Tool. A new instance is launched with ToolRunner. It parses its arguments using the SqoopOptions class. Within the SqoopOptions, an ImportAction will be chosen by the user. This may be import all tables, import one specific table, execute a SQL statement, or others.

A ConnManager is then instantiated based on the data in the SqoopOptions. The ConnFactory is used to get a ConnManager from a ManagerFactory; the mechanics of this were described in an earlier section.

Then in the run() method, using a case statement, it determines which actions the user needs performed based on the ImportAction enum. Usually this involves determining a list of tables to import, generating user code for them, and running a MapReduce job per table to read the data. The import itself does not specifically need to be run via a MapReduce job; the ConnManager.importTable() method is left to determine how best to run the import. Each of these actions is controlled by the ConnMananger, except for the generating of code, which is done by the CompilationManager and ClassWriter. (Both in the org.apache.hadoop.sqoop.orm package.) Importing into Hive is also taken care of via the org.apache.hadoop.sqoop.hive.HiveImport class after the importTable() has completed. This is done without concern for the ConnManager implementation used.

A ConnManager’s importTable() method receives a single argument of type ImportJobContext which contains parameters to the method. This class may be extended with additional parameters in the future, which optionally further direct the import operation. Similarly, the exportTable() method receives an argument of type ExportJobContext. These classes contain the name of the table to import/export, a reference to the SqoopOptions object, and other related data.

Subpackages

The following subpackages under org.apache.hadoop.sqoop exist:

The io package contains OutputStream and BufferedWriter implementations used by direct writers to HDFS. The SplittableBufferedWriter allows a single BufferedWriter to be opened to a client which will, under the hood, write to multiple files in series as they reach a target threshold size. This allows unsplittable compression libraries (e.g., gzip) to be used in conjunction with Sqoop import while still allowing subsequent MapReduce jobs to use multiple input splits per dataset.

Code in the mapred package should be considered deprecated. The mapreduce package contains DataDrivenImportJob, which uses the DataDrivenDBInputFormat introduced in 0.21. The mapred package contains ImportJob, which uses the older DBInputFormat. Most ConnManager implementations use DataDrivenImportJob; DataDrivenDBInputFormat does not currently work with Oracle in all circumstances, so it remains on the old code-path.

The orm package contains code used for class generation. It depends on the JDK’s tools.jar which provides the com.sun.tools.javac package.

The util package contains various utilities used throughout Sqoop:

In several places, Sqoop reads the stdout from external processes. The most straightforward cases are direct-mode imports as performed by the LocalMySQLManager and DirectPostgresqlManager. After a process is spawned by Runtime.exec(), its stdout (Process.getInputStream()) and potentially stderr (Process.getErrorStream()) must be handled. Failure to read enough data from both of these streams will cause the external process to block before writing more. Consequently, these must both be handled, and preferably asynchronously.

In Sqoop parlance, an "async sink" is a thread that takes an InputStream and reads it to completion. These are realized by AsyncSink implementations. The org.apache.hadoop.sqoop.util.AsyncSink abstract class defines the operations this factory must perform. processStream() will spawn another thread to immediately begin handling the data read from the InputStream argument; it must read this stream to completion. The join() method allows external threads to wait until this processing is complete.

Some "stock" AsyncSink implementations are provided: the LoggingAsyncSink will repeat everything on the InputStream as log4j INFO statements. The NullAsyncSink consumes all its input and does nothing.

The various ConnManagers that make use of external processes have their own AsyncSink implementations as inner classes, which read from the database tools and forward the data along to HDFS, possibly performing formatting conversions in the meantime.