一、Hive是什么

Hive是基于Hadoop的一个数据仓库工具(离线)，可以将结构化的数据文件映射为一张数据库表，并提供类SQL查询功能。，它能接收用户输入的sql语句，然后把它翻译成mapreduce程序对HDFS上的数据进行查询、运算，并返回结果，或将结果存入HDFS。

要点：HIVE利用HDFS来存储数据文件；利用MAPREDUCE来做数据分析运算；利用SQL来为用户提供查询接口；

二、Hive的安装及配置

1.1.用内嵌derby作为元数据库

（1）安装hive的机器上应该有HADOOP环境（安装目录，HADOOP_HOME环境变量）；（2）直接解压一个hive安装包即可此时，安装的这个hive实例使用其内嵌的derby数据库作为记录元数据的数据库此模式不便于使用。

1.2.将mysql作为元数据库

以lunix安装为例：（1）上传mysql安装包；（2）解压；（3）安装mysql的server包；（4）安装mysql的客户端r包；（5）启动mysql的服务；（6）修改初始密码；（7）测试。注意点：要让mysql可以远程登录访问

2.修改hive的配置文件

修改hive-site.xml配置文件：

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://master:3306/hive?createDatabaseIfNotExist=true</value>

</property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

</property>

<name>javax.jdo.option.ConnectionUserName</name>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

</property>

</configuration>

3.上传一个mysql的驱动jar包到hive的安装目录的lib中

4.配置HIVE_HOME到系统环境变量中：/etc/profile

5.source /etc/profile

6.hive启动

三、启动Hive的三种方式

1.启动一个hive交互式查询shell

bin/hive

hive>

2.启动hive的网络服务，然后通过一个客户端beeline去连接服务进行查询

启动服务： bin/hiveserver2

启动客户端去连接hive服务： bin/beeline -u jdbc:hive2://slave2:10000 -n root

3.启动hive以一个脚本的方式

当有大量的hive查询任务时使用脚本化运行机制效率较高，该机制的核心点是：hive可以用一次性命令的方式来执行给定的hql语句。示例如下：

vi t_t.sh

#!/bin/bash

hive -e "insert into table t_max select id,max(age) from t_1 group by gender'"

hive -e 'create table t_sum as select id,sum(amount) from t_1 group by id'

四、Hive建库建表与数据导入(DDL)

1、建库：create database hello; 库目录：/user/hive/warehouse/hello.db

hive中有一个默认的库：库名： default 库目录：/user/hive/warehouse

2、建表：

内部表：表目录按照hive的规范来部署，位于hive的仓库目录/user/hive/warehouse中

create table t1(id int,name string,age int)

row format delimited fields terminated by ',';

外部表：表目录由建表用户自己指定

create external table t2(id int,name string,age int)

row format delimited fields terminated by ','

location ‘/xx/xx/’;

注意：drop内部表时，表目录会被删除，表的元数据也会被删除

drop外部表时，是表目录还在，但表的元数据会被删除

3、删除表

drop table t1;

4、导入数据到表中

实际上，只要把数据文件放入表目录即可 hadoop fs -put ......

hive命令：

如果文件在hive的本地磁盘： load data local inpath ‘/home/t_t.dat’ into table t1;

如果文件在hdfs上：load data inpath ‘/t_t.dat’ into table t1;

提醒：hive不会对用户所导入的数据做任何的检查和约束

5、修改表的定义

修改表名：alter table table_name rename to new_table_name

alter table t1 rename to t2;

修改字段名、字段类型：alter table table_name change [column] col_old_name col_new_name column_type [commentcol_comment] [first|(after column_name)]

alter table t1 change id oid float first;

增加、替换列：alter table table_name add|replace columns (col_name data_type[comment col_comment], ...)

alter table t1 add columns (gender string,phone int);

alter table t1 replace columns (id int,age int,name string);

6、分区表

在表目录中为数据文件创建分区子目录，以便于在查询时，mr程序可以针对指定的分区子目录中的数据进行处理，减少读取的范围，提高效率。

内部表分区建表：

create table t1(id int,uid string,price float,amount int)

partitioned by (day string,city string)

row format delimited fields terminated by ',';

已存在的文件夹作为表的一个分区，映射到表的分区：

alter table t2_ex add partition(day=’2018-11-25’) local ‘/2018-11-25’;

注意：分区字段不能是表定义中的已存在字段

导入数据到指定分区：

load data [local] inpath '/home/t_t.txt' into table t2 partiton(day='2018-11-25',city='shengzheng');

根据分区进行查询：

select count(*) from t2 where day='20170804' and city='shengzheng';

将分区字段当成表字段来用，就可以使用where子句指定分区了

7、根据已存在的表建表

1、create table t_t2 like t _t1; 新建的 t_t2表结构定义与源表 t_t1一致，但是没有数据

2、create table t_t2 as select id,name from t_t1; 根据select查询的字段来建表，将查询的结果插入新表中

8、将表中的数据导出到指定路径的文件

insert overwrite [local] directory '......'

row format delimited fields terminated by ','

select * from t1;

加local代表导入到本地磁盘文件，没加则代表导入到hdfs

五、SQL语法

sql运算模型一：逐行运算模型（逐行表达式，逐行过滤）例：select id,upper(name),age from t1;

sql运算模型二：分组运算模型（分组表达式，分组过滤）

例：select id,avg(money) from t1 where money >=1000 group by gender having avg(age) <= 23;

sql的join联表机制：join的实质是将多个表的数据连成一个表，作为查询的输入数据集，hive不支持不等值join

笛卡尔积连接例：select a.*,b.* from a join b;

内连接例：select a.*,b.* from a join b on a.id = b.id;

左外连接：左表的数据全返回作为查询的输入数据集例：select a.*,b.* from a left join b on a.id = b.id;

右外连接：右表的数据全返回作为查询的输入数据集例：select a.*,b.* from a right join b on a.id = b.id;

全外连接：两表的数据全返回作为查询的输入数据集例：select a.*,b.* from a full join b on a.id = b.id;

左半连接：hive特有，按照内连接的规律链接，但只返回左半部分作为查询的输入集

对select a.* from a where id in (select distinct id from b);

例：select id,name from a left semi join b on a.id = b.id;

子查询：本质就是将一个select查询的结果集作为下一个查询的输入数据集

select a.city,a,city_sum

from

(select city,sum(price*amount) as city_sum

from t1

group by city) a

where a.city_sum>300;

order by 排序：order by 永远写在一个select语句的最后，limit前； limit n ：限制select返回的结果条数；

select city,sum(price*amount) as city_sum

from t1

group by city

order by city_sum asc

limit 2;

in 过滤条件子句： select a.* from a where id in (select distinct id from b);

distinct 去重关键字：distinct的前面不能再有表达式；distinct后面的表达式会被看成组合去重

六、数据类型

1、数字类型

tinyint(1字节整数);smallint(2字节整数);int/integer (4字节整数);bigint(8字节整数);float(4字节浮点数);double (8字节双精度浮点数)

2、字符串类型

string;varchar(20) (字符串1-65535长度，超长截断);char (字符串，最大长度255)

3、BOOLEAN（布尔类型）

trune;false

4、时间类型

timestamp(时间戳) (包含年月日时分秒毫秒的一种封装);date (日期)（只包含年月日）

5、array数组类型

array<data_type>

6、map类型

map<primitive_type, data_type>

7、struct类型

struct<col_name : data_type, ...>

用一个字段来描述整个用户信息，可以采用struct

七、常用内置函数

1、类型转换函数

select cast("10" as int) ;

select cast("2018-11-25" as date) ;

select cast(current_timestamp as date);

2、数学运算函数

select round(2.5); ## 3 四舍五入

select round(2.2315,3) ; ##2.231

select ceil(2.2) ; // select ceiling(2.2) ; ## 3 向上取整

select floor(2.2); ## 2 向下取整

select abs(-2.2) ; ## 2.2 绝对值

select greatest(id1,id2,id3) ; ## 单行函数,多个输入参数中的最大值

select least(2,3,7) ; ##单行函数，求多个输入参数中的最小值

3、字符串函数

upper(string str) ##转大写

lower(string str) ##转小写

substr(string str, int start) ## 截取子串

substring(string str, int start)

substr(string, int start, int len)

substring(string, int start, int len)

concat(string A, string B...) ## 拼接字符串

concat_ws(string SEP, string A, string B...)

length(string A)

split(string str, string pat) ## 切分字符串，返回数组

注意：select split("192.168.33.44",".") ; 错误的，因为.号是正则语法中的特定字符

select split("192.168.33.44","\\.") ;

4.时间函数

select current_timestamp; ## 返回值类型：timestamp，获取当前的时间戳(详细时间信息)

select current_date; ## 返回值类型：date，获取当前的日期

unix时间戳转字符串格式——from_unixtime

from_unixtime(bigint unixtime[, string format])

示例：select from_unixtime(unix_timestamp());

select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");

字符串格式转unix时间戳——unix_timestamp：返回值是一个长整数类型

如果不带参数，取当前时间的秒数时间戳long--(距离格林威治时间1970-1-1 0:0:0秒的差距)

select unix_timestamp();

unix_timestamp(string date, string pattern)

示例： select unix_timestamp("2018-11-25 10:00:00");

select unix_timestamp("2018-08-10 10:00:00","yyyy-MM-dd HH:mm:ss");

将字符串转成日期date

select to_date("2018-11-25 10:00::00");

5、条件控制函数

if select id,if(age>18,'male','children') from t1;

case when

case

when condition1 then result1

when condition2 then result2

...

when conditionn then resultn

end

6、聚合函数

array(5,4,6,1) 构造一个整数数组

array(‘hello’,’hi’,’nihao’) 构造一个字符串数组

array_contains(Array<T>, value) 判断是否包含，返回boolean值

sort_array(Array<T>) 返回排序后的数组

size(Array<T>) 返回一个集合的长度，int值

size(Map<K.V>) 返回一个imap的元素个数，int值

size(array<T>) 返回一个数组的长度,int值

map_keys(Map<K.V>) 返回一个map字段的所有key，结果类型为：数组

map_values(Map<K.V>) 返回一个map字段的所有value，结果类型为：数组

7、常见分组聚合函数

sum(字段) : 求这个字段在一个组中的所有值的和

avg(字段) ：求这个字段在一个组中的所有值的平均值

max(字段) ：求这个字段在一个组中的所有值的最大值

min(字段) ：求这个字段在一个组中的所有值的最小值

Hive基础（一）

一、Hive是什么

二、Hive的安装及配置

三、启动Hive的三种方式

四、Hive建库建表与数据导入(DDL)

五、SQL语法

六、数据类型

七、常用内置函数

相关文章

LeetCode 926. 将字符串翻转到单调递增（动态规划）

利用Jqurey写一个输入内容增加并且可以删除，上下移动的标签

java的注释、关键字、标识符、变量常量、数据类型、运算符、流程控制等

LeetCode 851. 喧闹和富有（拓扑排序）

Jquery练习题—实现分组添加功能

PostgreSQL参数学习：vacuum_defer_clean_age

java的常用引用类、数组、String类

json-ajax-jsonp-cookie

LeetCode 981. 基于时间的键值存储（哈希+二分查找）

java的类与对象

学车总结

建设网站需要的Bootstrap介绍与操作

04.卷积神经网络 W4.特殊应用：人脸识别和神经风格转换

java的封装，继承，多态

详解一个自己原创的正则匹配IP的表达式

亚洲赛前训练计划

天池在线编程 2020年9月26日日常周赛题解

JavaScript试题练习题

Java的static，final，代码块，内部类，抽象类，接口等

使用parted划分GPT分区(转)