博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
spark整合hive+hbase做数据实时插入及实时查询分析
阅读量:6260 次
发布时间:2019-06-22

本文共 8600 字,大约阅读时间需要 28 分钟。

hot3.png

    声明

        使用的spark是2.0.1,hive是1.2.1,hbase是1.2.4,hadoop是2.6.0,zookeeper是3.4.9

        各依赖安装这里不再赘述,如需要可自行查看以前博客或百度,这里着重说明如何配置。

    hbase

        hbase不需要特殊配置,正常启动即可。

    hadoop

        hadoop不需要也属配置,正常启动即可。

    hive

        编辑hive-env.sh,增加HBASE_HOME变量

# Licensed to the Apache Software Foundation (ASF) under one# or more contributor license agreements.  See the NOTICE file# distributed with this work for additional information# regarding copyright ownership.  The ASF licenses this file# to you under the Apache License, Version 2.0 (the# "License"); you may not use this file except in compliance# with the License.  You may obtain a copy of the License at##     http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# Set Hive and Hadoop environment variables here. These variables can be used# to control the execution of Hive. It should be used by admins to configure# the Hive installation (so that users do not have to set environment variables# or set command line parameters to get correct behavior).## The hive service being invoked (CLI/HWI etc.) is available via the environment# variable SERVICE# Hive Client memory usage can be an issue if a large number of clients# are running at the same time. The flags below have been useful in # reducing memory usage:## if [ "$SERVICE" = "cli" ]; then#   if [ -z "$DEBUG" ]; then#     export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"#   else#     export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"#   fi# fi# The heap size of the jvm stared by hive shell script can be controlled via:## export HADOOP_HEAPSIZE=1024## Larger heap size may be required when running queries over large number of files or partitions. # By default hive shell scripts use a heap size of 256 (MB).  Larger heap size would also be # appropriate for hive server (hwi etc).# Set HADOOP_HOME to point to a specific hadoop install directoryexport HADOOP_HOME=${HADOOP_HOME}export HBASE_HOME=/opt/hbase/hbase-1.2.4# export HIVE_CLASSPATH=$HIVE_CLASSPATH:/opt/hive/apache-hive-1.2.1-bin/lib/*# Hive Configuration Directory can be controlled by:export HIVE_CONF_DIR=${HIVE_HOME}/conf# Folder containing extra ibraries required for hive compilation/execution can be controlled by:# export HIVE_AUX_JARS_PATH=

        编辑hive-site.xml,增加hbase相关配置

hbase.zookeeper.quorum
hadoop-n,hadoop-d1,hadoop-d2
hbase.zookeeper.property.clientPort
2181
Property from ZooKeeper's config zoo.cfg. The port at which the clients will connect.
hbase.master
hadoop-n:60000

    spark

        拷贝hbase安装目录下的如下jar,注意不要偷懒在spark-env.sh增加hbase的classpath,那样会导致spark无法启动。

hbase-protocolhbase-commonhbase-clienthbase-serverhive-hbase-handler-2.1.0htrace-coremetrice-core

    测试

        1、在hbase建表,并增加三条数据

create 'hbase_test',{NAME=>'cf1'}put 'hbase_test','a','cf1:v1','1'put 'hbase_test','b','cf1:v1','2'put 'hbase_test','b','cf1:v1','3'

        

        2、在hive建表

create external table hbase_test(key string,value string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:v1") TBLPROPERTIES("hbase.table.name" = "hbase_test");

    

        3、启动sparksql

cd $SPAR_HOME/bin./spark-sql
spark-sql> select * from hbase_test;16/11/18 11:20:48 INFO execution.SparkSqlParser: Parsing command: select * from hbase_test16/11/18 11:20:49 INFO parser.CatalystSqlParser: Parsing command: string16/11/18 11:20:49 INFO parser.CatalystSqlParser: Parsing command: string16/11/18 11:20:49 INFO parser.CatalystSqlParser: Parsing command: string16/11/18 11:20:49 INFO parser.CatalystSqlParser: Parsing command: string16/11/18 11:20:49 INFO memory.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 222.0 KB, free 365.5 MB)16/11/18 11:20:49 INFO memory.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 21.4 KB, free 365.5 MB)16/11/18 11:20:49 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.5.3.100:39358 (size: 21.4 KB, free: 366.2 MB)16/11/18 11:20:49 INFO spark.SparkContext: Created broadcast 7 from processCmd at CliDriver.java:37616/11/18 11:20:50 INFO hbase.HBaseStorageHandler: Configuring input job properties16/11/18 11:20:50 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x165634aa connecting to ZooKeeper ensemble=localhost:218116/11/18 11:20:50 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x165634aa0x0, quorum=localhost:2181, baseZNode=/hbase16/11/18 11:20:50 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)16/11/18 11:20:50 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session16/11/18 11:20:50 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x158751d4c19000d, negotiated timeout = 4000016/11/18 11:20:50 INFO util.RegionSizeCalculator: Calculating region sizes for table "hbase_test".16/11/18 11:20:50 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService16/11/18 11:20:50 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x158751d4c19000d16/11/18 11:20:50 INFO zookeeper.ZooKeeper: Session: 0x158751d4c19000d closed16/11/18 11:20:50 INFO zookeeper.ClientCnxn: EventThread shut down16/11/18 11:20:50 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:37616/11/18 11:20:50 INFO scheduler.DAGScheduler: Got job 3 (processCmd at CliDriver.java:376) with 1 output partitions16/11/18 11:20:50 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (processCmd at CliDriver.java:376)16/11/18 11:20:50 INFO scheduler.DAGScheduler: Parents of final stage: List()16/11/18 11:20:50 INFO scheduler.DAGScheduler: Missing parents: List()16/11/18 11:20:50 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[23] at processCmd at CliDriver.java:376), which has no missing parents16/11/18 11:20:50 INFO memory.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 15.2 KB, free 365.5 MB)16/11/18 11:20:50 INFO memory.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 8.3 KB, free 365.5 MB)16/11/18 11:20:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 10.5.3.100:39358 (size: 8.3 KB, free: 366.2 MB)16/11/18 11:20:50 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:101216/11/18 11:20:50 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[23] at processCmd at CliDriver.java:376)16/11/18 11:20:50 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks16/11/18 11:20:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, 10.5.3.101, partition 0, ANY, 5544 bytes)16/11/18 11:20:50 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 4 on executor id: 1 hostname: 10.5.3.101.16/11/18 11:20:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 10.5.3.101:57818 (size: 8.3 KB, free: 366.3 MB)16/11/18 11:20:50 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.5.3.101:57818 (size: 21.4 KB, free: 366.3 MB)16/11/18 11:20:51 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 509 ms on 10.5.3.101 (1/1)16/11/18 11:20:51 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 16/11/18 11:20:51 INFO scheduler.DAGScheduler: ResultStage 4 (processCmd at CliDriver.java:376) finished in 0.511 s16/11/18 11:20:51 INFO scheduler.DAGScheduler: Job 3 finished: processCmd at CliDriver.java:376, took 0.611485 sa	1b	2c	3Time taken: 2.33 seconds, Fetched 3 row(s)16/11/18 11:20:51 INFO CliDriver: Time taken: 2.33 seconds, Fetched 3 row(s)spark-sql>

    注意

            由于本例全部依赖都安装在三台虚拟机上,并且每台只有2G内存,故只能用作软件流程测试,而不能用做性能测试,本文所列所有数据,不能做性能测试的依据。

        

转载于:https://my.oschina.net/shyloveliyi/blog/790227

你可能感兴趣的文章
C# SQL 整表插入
查看>>
CSS3效果:animate实现点点点loading动画效果(二)
查看>>
NYOJ92 图像实用区域 【BFS】
查看>>
Maven常见异常及解决方法(本篇停更至16-4-12)
查看>>
微信小程序wx.previewImage实用案例(交流QQ群:604788754)
查看>>
用SSH解决大局域网反向端口转发问题
查看>>
【来龙去脉系列】机器学习入门必读
查看>>
VMware给虚拟机绑定物理网卡
查看>>
ROS中测试机器人里程计信息
查看>>
Python 能做什么?
查看>>
RecyclerView分隔线定制
查看>>
python-i春秋验证码识别
查看>>
Es对于日期处理
查看>>
深入理解 Java 动态代理机制
查看>>
Go基础系列:简单数据类型
查看>>
[UWP]合体姿势不对的HeaderedContentControl
查看>>
使用RSA加密在Python中逆向shell
查看>>
MS UI Automation
查看>>
Android开发指南(41) —— Searchable Configuration
查看>>
现代软件工程 怎么教好课 (读书笔记)
查看>>