hive与tsdb整合的初步设想

hive可以读取hbase的数据,需要用到StorageHandlers,暂时还没有时间去看这方面的资料,于是借着这段时间看hive的成果,尝试先将tsdb信息导出到文本中,然后将文本文件导入hive表中。


tsdb数据查询接口

使用tsdb query接口:

# ./tsdb query
Not enough arguments.  
Usage: query [Gnuplot opts] START-DATE [END-DATE] <query> [queries...]  
A query has the form:  
  FUNC [rate] [counter,max,reset] [downsample FUNC N] SERIES [TAGS]
For example:  
 2010/03/11-20:57 sum my.awsum.metric host=blah sum some.other.metric host=blah state=foo
Dates must follow this format: YYYY/MM/DD-HH:MM[:SS] or Unix Epoch  
 or relative time such as 1y-ago, 2d-ago, etc.
Supported values for FUNC: [min, mimmin, max, mimmax, dev, sum, avg, zimsum]  
Gnuplot options are of the form: +option=value  
  --config=PATH    Path to a configuration file (default: Searches for file see docs).
  --graph=BASEPATH Output data points to a set of files for gnuplot.  The path of the output files will start with BASEPATH.
  --table=TABLE    Name of the HBase table where to store the time series (default: tsdb).
  --uidtable=TABLE Name of the HBase table to use for Unique IDs (default: tsdb-uid).
  --verbose        Print more logging messages and not just errors.
  --zkbasedir=PATH Path under which is the znode for the -ROOT- region (default: /hbase).
  --zkquorum=SPEC  Specification of the ZooKeeper quorum to use (default: localhost).
  -v               Short for --verbose. 

输出到一个文本中,格式是这样的:

metric timestamp value {tag1=tag_value1, tag2=tag_value2, ...}  
xxxx.yyyy 1441411568000 25866 {unit=u_entry, project=xxxx, role=yyyy}  

文本格式转换

需要将分隔符由空格改为制表符\t,注意到tags中也是用空格分开的,不能用sed global来替换:

sed -i -e 's/ /\t/1' -e 's/ /\t/1' -e 's/ /\t/1' /data/dump.txt  

创建hive表
hive> create table tsdb (metric string, ts bigint, value float, tags string)  
    > row format delimited
    > fields terminated by '\t';
OK  
Time taken: 0.756 seconds  
hive> show tables;  
OK  
tsdb  
Time taken: 0.386 seconds, Fetched: 1 row(s)  
hive> describe extended tsdb;  
OK  
metric                  string                  None  
ts                       bigint                  None  
value                   float                   None  
tags                    string                  None

Detailed Table Information    Table(tableName:tsdb, dbName:default, owner:root, createTime:1441624414, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:metric, type:string, comment:null), FieldSchema(name:timestamp, type:int, comment:null), FieldSchema(name:value, type:float, comment:null), FieldSchema(name:tags, type:string, comment:null)], location:hdfs://hd1.mc.com:8020/user/hive/warehouse/tsdb, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format= , field.delim=  
Time taken: 0.104 seconds, Fetched: 6 row(s)  

将文本数据导入hive表:

hive> load data local inpath "/data/dump.txt" into table tsdb;  
Copying data from file:/data/dump.txt  
Copying file: file:/data/dump.txt  
Loading data to table default.tsdb  
chgrp: changing ownership of '/user/hive/warehouse/tsdb/dump.txt': User does not belong to hive  
Table default.tsdb stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 1288241, raw_data_size: 0]  
OK  
Time taken: 1.712 seconds  

测试导入结果
# wc -l /data/dump.txt
13561 /data/dump.txt  
# awk -F'\t' '{if($3>max)max=$3}END{print max}' /data/dump.txt
40329  

看看hive的结果是否一致:

hive> select count(1) from tsdb;  
...
MapReduce Jobs Launched:  
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 5.54 sec   HDFS Read: 1288470 HDFS Write: 6 SUCCESS  
Total MapReduce CPU Time Spent: 5 seconds 540 msec  
OK  
13561  
Time taken: 39.406 seconds, Fetched: 1 row(s)  
hive> select max(value) from tsdb;  
...
apReduce Jobs Launched:  
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 6.24 sec   HDFS Read: 1288470 HDFS Write: 8 SUCCESS  
Total MapReduce CPU Time Spent: 6 seconds 240 msec  
OK  
40329.0  
Time taken: 38.202 seconds, Fetched: 1 row(s)  

两者一致,hive导入数据没问题。


impala

用hive还是太慢了,用impala测试下:

[root@hd2 ~]# impala-shell
Starting Impala Shell without Kerberos authentication  
Connected to hd2.mc.com:21000  
Server version: impalad version 1.4.0-cdh5-INTERNAL RELEASE (build e801bd8c0d134e783c2313c7dd422a5ad06591af)  
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v1.4.0-cdh5-INTERNAL (e801bd8) built on Sat Jul 12 06:45:04 PDT 2014)
[hd2.mc.com:21000] > invalidate metadata;
Query: invalidate metadata

Returned 0 row(s) in 1.03s  
[hd2.mc.com:21000] > show tables;
Query: show tables  
+------+
| name |
+------+
| tsdb |
+------+
Returned 1 row(s) in 0.01s  
[hd2.mc.com:21000] > select count(1) from tsdb;
Query: select count(1) from tsdb  
+----------+
| count(1) |
+----------+
| 13561    |
+----------+
Returned 1 row(s) in 0.26s  
[hd2.mc.com:21000] > select max(value) from tsdb;
Query: select max(value) from tsdb  
+------------+
| max(value) |
+------------+
| 40329      |
+------------+
Returned 1 row(s) in 0.32s  

果然是快太多了。