使用sqoop导mysql表数据到hive

最近又开始学习hadoop,主要是有台40GB内存的vCenter放在那里太浪费了,搭建了一个CDH5集群,1个namenode,1个secondary namenode,3个datanode。

然后学了下怎么将mysql数据导入hive,使用的工具是sqoop


安装mysql jdbc driver

本文测试的环境是CDH5,所以sqoop环境已经设置好了,但是要注意的一点是,由于版权的问题,MySQL jdbc driver需要自己下载,然后拷贝到/var/lib/sqoop目录中:

cd /dist/dist  
wget -c http://ftp.ntu.edu.tw/MySQL/Downloads/Connector-J/mysql-connector-java-5.1.36.tar.gz  
cd /dist/src  
tar xvf /dist/dist/mysql-connector-java-5.1.36.tar.gz  
cp -v mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar /var/lib/sqoop  

MySQL端权限设置

sqoop通过jdbc连接MySQL,因此需要MySQL server开放对应的权限:

# mysql -uroot -p<root_pass>
mysql> create user 'hadoop'@'192.168.1.2' IDENTIFIED BY 'hadoop';  
mysql> create user 'hadoop'@'192.168.1.3' IDENTIFIED BY 'hadoop';  
mysql> create user 'hadoop'@'192.168.1.4' IDENTIFIED BY 'hadoop';  
mysql> create user 'hadoop'@'192.168.1.5' IDENTIFIED BY 'hadoop';  
mysql> create user 'hadoop'@'192.168.1.6' IDENTIFIED BY 'hadoop';  
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.2' WITH GRANT OPTION;  
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.3' WITH GRANT OPTION;  
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.4' WITH GRANT OPTION;  
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.5' WITH GRANT OPTION;  
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.6' WITH GRANT OPTION;  
mysql> flush privileges;  
  • 192.168.1.2 - 192.168.1.6 是hadoop集群的机器。
  • 192.168.1.7 是 MySQL server。

将线上数据导入MySQL进行测试:

mysql -uroot -p<root_pass>  
mysql> create database hd_test_db;  
mysql> quit;  
mysql -uroot -p<root_pass> hd_test_db < hd_test_db_struct.sql  
cd /data/backup/hd_test_db  
mysqlimport --local -uroot -p<root_pass> hd_test_db *.txt  

用的是这个表:

mysql -uroot -p<root_pass>  
mysql> use hd_test_db;  
mysql> select count(*) from items;  
+----------+
| count(*) |
+----------+
| 14822656 |
+----------+
1 row in set (0.00 sec)  

hive

创建个hive数据库

# hive shell
hive> create database hd_test_db;  
hive> quit;  

sqoop
sudo -u hdfs sqoop import --connect jdbc:mysql://192.168.1.7/hd_test_db --username hadoop --password hadoop --table items --target-dir /user/hive/warehouse/hd_test_db --fields-terminated-by '\t' --hive-import --hive-overwrite --create-hive-table --hive-table hd_test_db.items  

选项的用法可用sqoop help import查看。

导入完成后进入hive看下数据:

# sudo -u hdfs hive
hive> show databases;  
OK  
default  
hd_test_db  
Time taken: 0.674 seconds, Fetched: 2 row(s)  
hive> use hd_test_db;  
OK  
Time taken: 0.959 seconds  
hive> show tables;  
OK  
items  
Time taken: 0.533 seconds, Fetched: 1 row(s)  
hive> select count(*) from items;  
Total MapReduce jobs = 1  
Launching Job 1 out of 1  
Number of reduce tasks determined at compile time: 1  
In order to change the average load for a reducer (in bytes):  
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:  
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:  
  set mapred.reduce.tasks=<number>
Starting Job = job_1440428885404_0008, Tracking URL = http://hd1.hadoop.com:8088/proxy/application_1440428885404_0008/  
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1440428885404_0008  
Hadoop job information for Stage-1: number of mappers: 7; number of reducers: 1  
2015-08-30 21:45:24,588 Stage-1 map = 0%,  reduce = 0%  
2015-08-30 21:45:37,436 Stage-1 map = 10%,  reduce = 0%, Cumulative CPU 9.43 sec  
2015-08-30 21:45:38,501 Stage-1 map = 10%,  reduce = 0%, Cumulative CPU 9.43 sec  
2015-08-30 21:45:39,577 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:40,634 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:41,694 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:42,763 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:43,829 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:44,903 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:45,978 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:47,047 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:48,110 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:49,183 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:50,259 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:51,319 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:52,388 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:53,462 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:54,538 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:55,650 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:56,718 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:57,801 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:58,861 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:59,928 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:46:00,994 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:46:02,077 Stage-1 map = 62%,  reduce = 0%, Cumulative CPU 26.61 sec  
2015-08-30 21:46:03,132 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:04,196 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:05,260 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:06,322 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:07,382 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:08,471 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:09,538 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:10,594 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:11,653 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:12,706 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:13,786 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:14,853 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:15,921 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 37.94 sec  
2015-08-30 21:46:16,977 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 37.94 sec  
MapReduce Total cumulative CPU time: 37 seconds 940 msec  
Ended Job = job_1440428885404_0008  
MapReduce Jobs Launched:  
Job 0: Map: 7  Reduce: 1   Cumulative CPU: 37.94 sec   HDFS Read: 1575346947 HDFS Write: 9 SUCCESS  
Total MapReduce CPU Time Spent: 37 seconds 940 msec  
OK  
14822656  
Time taken: 72.172 seconds, Fetched: 1 row(s)  

联表查询

导入两个表进行联表查询:

# sudo -u hdfs sqoop import --connect jdbc:mysql://192.168.1.7/hd_test_db --username hadoop --password hadoop --table users --target-dir /user/hive/warehouse/hd_test_db --fields-terminated-by '\t' --hive-import --hive-overwrite --create-hive-table --hive-table hd_test_db.users
# sudo -u hdfs sqoop import --connect jdbc:mysql://192.168.1.7/hd_test_db --username hadoop --password hadoop --table pays --target-dir /user/hive/warehouse/hd_test_db --fields-terminated-by '\t' --hive-import --hive-overwrite --create-hive-table --hive-table hd_test_db.pays

直接用root用户执行,需要在hdfs里创建一个root的用户目录:

# sudo -u hdfs hadoop fs -mkdir /user/root
# sudo -u hdfs hadoop fs -chown root:root /user/root

hive联表查询:

# hive
hive> use hd_test_db;  
hive> describe users;  
OK  
pid                     bigint                  None  
user_id                 bigint                  None  
user_name               string                  None  
create_time             int                     None  
sex                     string                  None  
auto                    int                     None  
Time taken: 0.098 seconds, Fetched: 6 row(s)  
hive> describe pays;  
OK  
user_id                 bigint                  None  
user_pay                int                     None  
zero_dateline           int                     None  
Time taken: 0.079 seconds, Fetched: 4 row(s)  
hive> select t2.user_pay t1.user_name from users t1 join pays t2 on t1.user_id = t2.user_id order by t2.pays desc limit 10;  

花费时间users记录数是26464,pays记录数是54316:

Time taken: 47.608 seconds, Fetched: 10 row(s)  

当然这完全体现不出hadoop的威力。找时间搞个大的表来测试下。