Importing data from SQL Server to Cloudera Hadoop via Sqoop2

Critical points that may or may not be obvious:

1. If you’re using the Sqoop stuff in the web-based interface, you’re actually using Sqoop2

2. You have to download and install the JDBC driver for SQL Server yourself
– curl -L ‘http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz’ | tar xz
– sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop2/
– while you’re at it, you may as well put it in the sqoop directory too: sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop/

3. Sqoop2 home directory is /var/lib/sqoop2/

4. restart Sqoop2 service after copying in the JDBC driver file

5a. “Connector” is a Sqoop thing for how it communicates with various processes in Hadoop. Unless you’ve got a lot more experience, it should just be “generic-jdbc-connector”

5b. Class Path is “com.microsoft.sqlserver.jdbc.SQLServerDriver”

6. connection string in “Manage Connections” is like this: jdbc:sqlserver://192.168.1.102:1433 (though port number defaults to 1433)

For the Actions of the job:

7. Schema name: I just leave this blank and instead paste in the TSQL query I want
– if you specify a TSQL statement below, then this needs to be blank

8. Table name: I leave this blank and instead do it all in the TSQL.
– if you specify a TSQL statement below, then this needs to be blank

9. Table SQL statement: Paste in your query (you can craft it in SSMS and paste it in here). Then, append this to the end of it: +and+${CONDITIONS}. ${CONDITIONS} expands out to be some range of values of the Partition Column name you can specify below this field.

10. Table Column names: put them in if you want to limit the columns that actually get extracted.

11. Partition column name: Make sure this column is indexed somehow – Sqoop first queries the min and max values then issues a series of queries that return evenly-distributed portions of all rows based on this column value. e.g. a transactions table; I specify the transaction date column in Partition column name; sqoop gets the min and max dates; Sqoop then issues a series of queries replacing ${CONDITIONS} with “where transDate >= ‘2015-01-01’ and transDate < '2015-04-01'" (moving that window for each query). Each query can be sent from any node in your cluster (though I bet you can restrict which nodes those are) 12. Nulls in partition Column: if you do have nulls, this helps Sqoop. 13. You can manually specify the query Sqoop uses to get the min/max of the partition column name (by default it looks like this "select min(), max() from ()”.

14. if you mess with the connection you create in Hue/Sqoop2, note you have to type in the password again

15. if you get errors, don’t fight it – you have to log in via SSH and look at /var/log/sqoop2/sqoop2.log

16. if your jobs are failing, and in SQL Server Profiler on the SQL Server you’re querying you only see queries with “where … (1 = 0)…” in them, check your firewall rules: all the nodes in the cluster need to be able to talk out to the SQL Server instance. Yeah, Sqoop will distribute the various partitioned queries across your cluster 🙂