【分享】Hive基本使用
本帖最后由 爆炒小提莫 于 2019-4-3 16:55 编辑【官方】Apache Hive数据仓库软件有助于使用SQL读取,编写和管理驻留在分布式存储中的大型数据集。可以将结构投影到已存储的数据中。提供了命令行工具和JDBC驱动程序以将用户连接到Hive。
Hive 是大数据领域一个重要的数据仓库框架,能够处理存储在Hadoop的分布式文件系统上的结构化数据,将SQL转为MapReduce任务分配到Hadoop集群上并发的对海量数据进行计算,主流的离线计算框架。这是一种空间换时间的计算方式,数据量小的场景就不建议使用大数据了,传统数据库的优势更大,成本也没有大数据昂贵。
基本建表语句
create table t_a(name string, numb int)
row format delimited
fields terminated by ',';
create table t_b(name string, nick string)
row format delimited
fields terminated by ',';
准备两张表的数据
a.txt
a,1
b,2
c,3
d,4
b.txt
a,xx
b,yy
d,zz
e,pp
将a.txt导入到t_a表中,将b.txt导入到t_b表中
load data local inpath '/root/test/a.txt' into table t_a;
load data local inpath '/root/test/b.txt' into table t_b;
查询数据
select * from t_a;
select * from t_b;
各类join1. 内连接笛卡儿积
select
a.*, b.*
from t_a a
join t_b b;两个表的数据行没有指定怎么关联,两个表所有行都相互关联,产生笛卡尔积。左外连接和右外连接没有关联条件也会出现笛卡尔积现象。造成大量的开销,生成大量的数据集。+---------+---------+---------+---------+--+
| a.name| a.numb| b.name| b.nick|
+---------+---------+---------+---------+--+
| a | 1 | a | xx |
| b | 2 | a | xx |
| c | 3 | a | xx |
| d | 4 | a | xx |
| a | 1 | b | yy |
| b | 2 | b | yy |
| c | 3 | b | yy |
| d | 4 | b | yy |
| a | 1 | d | zz |
| b | 2 | d | zz |
| c | 3 | d | zz |
| d | 4 | d | zz |
| a | 1 | e | pp |
| b | 2 | e | pp |
| c | 3 | e | pp |
| d | 4 | e | pp |
+---------+---------+---------+---------+--+
指定join条件select a.*, b.*
from t_a a
join t_b b
on a.name =b.name;
+---------+---------+---------+---------+--+
| a.name| a.numb| b.name| b.nick|
+---------+---------+---------+---------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| d | 4 | d | zz |
+---------+---------+---------+---------+--+
1. 左外连接(左连接)
select
a.*, b.*
from t_a a
left join t_b b
on a.name =b.name;
左表的每一行数据与右表按条件关联,如果符合条件就返回左右两表的行数据,如果不符合关联条件,依旧返回左表的行数据,右表行数据返回null。
+---------+---------+---------+---------+--+
| a.name| a.numb| b.name| b.nick|
+---------+---------+---------+---------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| c | 3 | NULL | NULL |
| d | 4 | d | zz |
+---------+---------+---------+---------+--+
2. 右外连接(右连接)
select
a.*, b.*
from t_a a
right join t_b b
on a.name =b.name;
与左外连接相反
+---------+---------+---------+---------+--+
| a.name| a.numb| b.name| b.nick|
+---------+---------+---------+---------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| d | 4 | d | zz |
| NULL | NULL | e | pp |
+---------+---------+---------+---------+--+
3. 全外连接
select
a.*, b.*
from t_a a
full outer join t_b b
on a.name =b.name;
左表,右表之间关联,全外连接为左外连接与右外连接结果的并集。
+---------+---------+---------+---------+--+
| a.name| a.numb| b.name| b.nick|
+---------+---------+---------+---------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| c | 3 | NULL | NULL |
| d | 4 | d | zz |
| NULL | NULL | e | pp |
+---------+---------+---------+---------+--+
4.左半连接
左表的数据只返回符合和右表关联条件的数据,右表的数据不返回。
select a.*
from t_a a
left semi join t_b b
on a.name =b.name;
+---------+---------+--+
| a.name| a.numb|
+---------+---------+--+
| a | 1 |
| b | 2 |
| d | 4 |
+---------+---------+--+
注意:left semi join的select 字句中,不能有右表的字段
分组聚合查询
准备一个访问日志数据pv_url.log
【url在文档保存的时候被处理了,没法改过来,大家知道是url字符串就行,本来是没有杂的东西加在里面的】
192.168.88.2,https://www.52pojie.cn/2.php,2019-04-03 16:33:07
192.168.88.3,https://www.52pojie.cn/0.php,2019-04-03 16:33:06
192.168.88.2,https://www.52pojie.cn/4.php,2019-04-03 16:33:09
192.168.88.2,https://www.52pojie.cn/3.php,2019-04-03 16:33:06
192.168.88.0,https://www.52pojie.cn/0.php,2019-04-03 16:33:07
192.168.88.1,https://www.52pojie.cn/2.php,2019-04-03 16:33:05
192.168.88.0,https://www.52pojie.cn/1.php,2019-04-03 16:33:14
192.168.88.1,https://www.52pojie.cn/0.php,2019-04-03 16:33:06
192.168.88.3,https://www.52pojie.cn/3.php,2019-04-03 16:33:12
192.168.88.1,https://www.52pojie.cn/0.php,2019-04-03 16:33:07
192.168.88.0,https://www.52pojie.cn/4.php,2019-04-03 16:33:08
192.168.88.3,https://www.52pojie.cn/3.php,2019-04-03 16:33:13
192.168.88.3,https://www.52pojie.cn/4.php,2019-04-03 16:33:06
192.168.88.0,https://www.52pojie.cn/3.php,2019-04-03 16:33:07
192.168.88.2,https://www.52pojie.cn/4.php,2019-04-03 16:33:13
192.168.88.2,https://www.52pojie.cn/0.php,2019-04-03 16:33:05
192.168.88.0,https://www.52pojie.cn/3.php,2019-04-03 16:33:11
192.168.88.3,https://www.52pojie.cn/0.php,2019-04-03 16:33:07
192.168.88.0,https://www.52pojie.cn/3.php,2019-04-03 16:33:09
192.168.88.3,https://www.52pojie.cn/1.php,2019-04-03 16:33:14
将数据导入到hive中create table t_pv_log(ip string,url string, dt string)
row format delimited
fields terminated by ',';
load data local inpath '/root/test/pv_url.log' into table t_pv_log;
select * fromt_pv_log;
+---------------+-------------------------------+----------------------+--+
|t_pv_log.ip| t_pv_log.url | t_pv_log.dt |
+---------------+-------------------------------+----------------------+--+
| 192.168.88.2| https://www.52pojie.cn/2.php| 2019-04-03 16:33:07|
| 192.168.88.3| https://www.52pojie.cn/0.php| 2019-04-03 16:33:06|
| 192.168.88.2| https://www.52pojie.cn/4.php| 2019-04-03 16:33:09|
| 192.168.88.2| https://www.52pojie.cn/3.php| 2019-04-03 16:33:06|
| 192.168.88.0| https://www.52pojie.cn/0.php| 2019-04-03 16:33:07|
| 192.168.88.1| https://www.52pojie.cn/2.php| 2019-04-03 16:33:05|
| 192.168.88.0| https://www.52pojie.cn/1.php| 2019-04-03 16:33:14|
| 192.168.88.1| https://www.52pojie.cn/0.php| 2019-04-03 16:33:06|
| 192.168.88.3| https://www.52pojie.cn/3.php| 2019-04-03 16:33:12|
| 192.168.88.1| https://www.52pojie.cn/0.php| 2019-04-03 16:33:07|
| 192.168.88.0| https://www.52pojie.cn/4.php| 2019-04-03 16:33:08|
| 192.168.88.3| https://www.52pojie.cn/3.php| 2019-04-03 16:33:13|
| 192.168.88.3| https://www.52pojie.cn/4.php| 2019-04-03 16:33:06|
| 192.168.88.0| https://www.52pojie.cn/3.php| 2019-04-03 16:33:07|
| 192.168.88.2| https://www.52pojie.cn/4.php| 2019-04-03 16:33:13|
| 192.168.88.2| https://www.52pojie.cn/0.php| 2019-04-03 16:33:05|
| 192.168.88.0| https://www.52pojie.cn/3.php| 2019-04-03 16:33:11|
| 192.168.88.3| https://www.52pojie.cn/0.php| 2019-04-03 16:33:07|
| 192.168.88.0| https://www.52pojie.cn/3.php| 2019-04-03 16:33:09|
| 192.168.88.3| https://www.52pojie.cn/1.php| 2019-04-03 16:33:14|
+---------------+-------------------------------+----------------------+--+
求每条url的访问次数select url, count(1) --该表达式是对很好组的数据进行逐行运算
from t_pv_log
group by url;
0: jdbc:hive2://hdp:10000> select url, count(1) from t_pv_log group by url;
+-------------------------------+------+--+
| url | _c1|
+-------------------------------+------+--+
| https://www.52pojie.cn/0.php| 6 |
| https://www.52pojie.cn/1.php| 2 |
| https://www.52pojie.cn/2.php| 2 |
| https://www.52pojie.cn/3.php| 6 |
| https://www.52pojie.cn/4.php| 4 |
+-------------------------------+------+--+
内容通俗易懂,谢谢 不错不错 感谢分享 谢谢分享,支持好帖!! 谢谢分享,没大明白说的啥 谢谢楼主
页:
[1]