SparkSQL相关语句总结 PDF 下载_Java知识分享网-免费Java资源下载

失效链接处理

SparkSQL相关语句总结 PDF 下载

本站整理下载：

链接：https://pan.baidu.com/s/1fqOhLH_3PHHEEEWEOHAHow

提取码：1234

相关截图：

主要内容：

1.in 不支持子查询 eg. select * from src where key in(select key from test);

支持查询个数 eg. select * from src where key in(1,2,3,4,5);

in 40000个耗时25.766秒

in 80000个耗时78.827秒

2.union all/union

不支持顶层的union all eg. select key from src UNION ALL select key from test;

支持select * from (select key from src union all select key from test)aa;

不支持 union

支持select distinct key from (select key from src union all select key from test)aa;

3.intersect 不支持

4.minus 不支持

5.except 不支持

6.inner join/join/left outer join/right outer join/full outer join/left semi join 都支持

left outer join/right outer join/full outer join 中间必须有outer

join是最简单的关联操作，两边关联只取交集;

left outer join是以左表驱动，右表不存在的key均赋值为null；

right outer join是以右表驱动，左表不存在的key均赋值为null；

full outer join全表关联，将两表完整的进行笛卡尔积操作，左右表均可赋值为null;

left semi join最主要的使用场景就是解决exist in;

Hive不支持where子句中的子查询，SQL常用的exist in子句在Hive中是不支持的

不支持子查询 eg. select * from src aa where aa.key in(select bb.key from test bb);

可用以下两种方式替换：

select * from src aa left outer join test bb on aa.key=bb.key where bb.key <> null;

select * from src aa left semi join test bb on aa.key=bb.key;

大多数情况下 JOIN ON 和 left semi on 是对等的

A,B两表连接，如果B表存在重复数据

当使用JOIN ON的时候，A,B表会关联出两条记录，应为ON上的条件符合；

而是用LEFT SEMI JOIN 当A表中的记录，在B表上产生符合条件之后就返回，不会再继续查找B表记录了，

所以如果B表有重复，也不会产生重复的多条记录。

left outer join 支持子查询 eg. select aa.* from src aa left outer join (select * from test111)bb on aa.key=bb.a;

7. hive四中数据导入方式

1）从本地文件系统中导入数据到Hive表

create table wyp(id int,name string) ROW FORMAT delimited fields terminated by '\t' STORED AS TEXTFILE;

load data local inpath 'wyp.txt' into table wyp;

2)从HDFS上导入数据到Hive表

[wyp@master /home/q/hadoop-2.2.0]$ bin/hadoop fs -cat /home/wyp/add.txt

hive> load data inpath '/home/wyp/add.txt' into table wyp;

3)从别的表中查询出相应的数据并导入到Hive表中

hive> create table test(

> id int, name string

> ,tel string)

> partitioned by

> (age int)

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY '\t'

> STORED AS TEXTFILE;

注：test表里面用age作为了分区字段，分区：在Hive中，表的每一个分区对应表下的相应目录，所有分区的数据都是存储在对应的目录中。

比如wyp表有dt和city两个分区，则对应dt=20131218city=BJ对应表的目录为/user/hive/warehouse/dt=20131218/city=BJ，

所有属于这个分区的数据都存放在这个目录中。

hive> insert into table test

> partition (age='25')

> select id, name, tel

> from wyp;

也可以在select语句里面通过使用分区值来动态指明分区：

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> insert into table test

> partition (age)

> select id, name,

> tel, age

> from wyp;

Hive也支持insert overwrite方式来插入数据

hive> insert overwrite table test

> PARTITION (age)

> select id, name, tel, age

> from wyp;

Hive还支持多表插入

hive> from wyp

> insert into table test

> partition(age)

> select id, name, tel, age

> insert into table test3

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦！

Python学习路线图

SparkSQL相关语句总结 PDF 下载